HiveMind: A Scalable and Serverless Coordination Control ... · of AWS’s general serverless framework tailored to the require-ments and characteristics of IoT applications, which

HiveMind: A Scalable and Serverless CoordinationControl Platform for UAV Swarms

Justin Hu, Ariana Bruno, Brian Ritchken, Brendon Jackson, Mateo Espinosa, Aditya Shah, and Christina DelimitrouCornell University

Abstract— Swarms of autonomous devices are increasing inubiquity and size. There are two main trains of thought forcontrolling devices in such swarms; centralized and distributedcontrol. Centralized platforms achieve higher output qualitybut result in high network traffic and limited scalability, whiledecentralized systems are more scalable, but less sophisticated.

In this work we present HiveMind, a centralized coordinationcontrol platform for IoT swarms that is both scalable andperformant. HiveMind leverages a centralized cluster for allresource-intensive computation, deferring lightweight and time-critical operations, such as obstacle avoidance to the edge devicesto reduce network traffic. HiveMind employs an event-drivenserverless framework to run tasks on the cluster, guaranteesfault tolerance both in the edge devices and serverless functions,and handles straggler tasks and underperforming devices. Weevaluate HiveMind on a swarm of 16 programmable drones ontwo scenarios; searching for given items, and counting uniquepeople in an area. We show that HiveMind achieves betterperformance and battery efficiency compared to fully centralizedand fully decentralized platforms, while also handling loadimbalances and failures gracefully, and allowing edge devices toleverage the cluster to collectively improve their output quality.

I. INTRODUCTION

Swarms of autonomous edge devices are increasing in num-ber, size, and popularity [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14, 15, 16, 17, 18, 19, 20, 21]. From UAVs to self-driving carsand supply-chain robots, swarms are enabling new distributedapplications, which often experience intermittent activity, andare interactive and latency-sensitive [1, 2, 10, 11, 22, 23, 24].

Coordination of large swarms of edge devices usuallyfollows one of two main approaches. The first approachargues for decentralized control and smart edge devices, whichperform most computation in situ, only transferring the resultsto the backend system [1, 10, 11, 14, 15]. This design avoidsthe high network traffic of a centralized system, however,it either requires edge devices to work individually, hencelimiting their use cases and missing the potential for col-lective optimizations, or involves peer-to-peer communicationbetween devices, which again increases network traffic.

The second approach are centralized coordination sys-tems [3, 5, 9, 12, 25], where the edge devices are merely away to collect sensor data, while all computation and statemanagement happens in a backend cluster. This approachbenefits from the ample cloud resources, hence it can exploremore sophisticated techniques than what is possible on edgedevices, but also experiences high overheads from networkcommunication, as data is transferred to and from the cloud.

In response to the emergence of this new class of systemsand services, cloud computing operators have developed newprogramming frameworks to make it easier for edge devicesto leverage cloud resources, and express the event-driven,interactive nature of their computation [26, 27, 28, 29, 30, 31,32]. Serverless compute frameworks for example, specificallytarget highly-parallel, intermittent computation, where main-taining long-running instances is not cost efficient [25, 33, 34,35, 36, 37, 38, 39, 40, 41]. Serverless frameworks additionallysimplify cloud management by letting the cloud providerhandle application placement, resource provisioning, and statemanagement, with users being charged on a per-request ba-sis [34, 36, 41]. Serverless functions are instantiated in short-lived containers to improve portability, resource isolation, andsecurity, and containers are terminated shortly after the processthey host completes, freeing up resources for other workloads.

From the cloud operator’s perspective, serverless has twobenefits: first, it gives the cloud provider better visibilityinto the characteristics of external workloads, allowing themto better optimize for performance, resource efficiency, andfuture growth. Second, it reduces the long-term resourceoverprovisioning current cloud systems suffer from [23, 42, 43,44, 45, 46, 47, 48, 49, 50, 51, 52], by not requiring the end userto handle provisioning, allocating resources at fine granularityand for short periods of time instead.

While serverless is not exclusively applicable to swarmapplications [22, 25, 36, 53, 54, 55, 56], it is well-suited fortheir requirements. AWS Greengrass, for example, is a variantof AWS’s general serverless framework tailored to the require-ments and characteristics of IoT applications, which allowsdevices to both train their ML models in the cloud, and alsolaunch serverless functions for inference [57]. Finally, server-less frameworks offer a centralized persistent storage systemfor sensor data that the swarm can use to improve its decisionquality over time. Despite the increasing prevalence of bothIoT swarms and serverless compute, there are currently nosystems that compare the advantages and issues of centralizedand decentralized platforms, and highlight the potential ofserverless in addressing their performance requirements.

In this work we first explore the performance and efficiencytrade-offs between centralized and decentralized swarm coor-dination control. Based on the findings of this comparison,we present HiveMind, a centralized and scalable coordinationcontrol platform for swarms of edge devices. HiveMind isdesigned to optimize task latency, battery efficiency, and fault

arX

iv:2

002.

0141

9v1

[cs

.DC

] 4

Feb

202

0

1.

Backend Cluster

…

Persistent shared storage

Node Node

Centralized ControllerServerless framework

Input: Find 15x

Send borders & route

2. Move & take photo

𝝺𝝺

𝝺

…

Output:<x1,y1> <x2,y2> <x3,y3> <x15,y15>

3. Send photo

5. Run image recognition

2 2

3✓

6. Return & land

4. Schedulefunctions

…

Fig. 1: (a) High-level operations for the first scenario in the centralized platform. (b) The drone swarm executing the firstscenario, and (c) the second scenario. Red bounding boxes show each of the 16 drones in the swarm, yellow boxes show the15 tennis balls and 25 people respectively, the blue box shows the router the drones use to communicate with the backendcluster, and the purple box shows the approximate location of the cluster. Buildings are blurred for double blind review.

tolerance across the platform’s cloud and edge components. Itleverages event-driven computation using serverless computeto expose the fine-grained parallelism in operations triggeredby edge devices, and improve their performance and effi-ciency. HiveMind keeps network traffic low by tasking theedge devices with filtering sensor data, and only transferringthe most meaningful information to the cloud for furthercomputation. It additionally exploits the centralized system tocontinuously improve the decision quality of edge devices,by letting them learn from each other’s mistakes. Finally,HiveMind implements fault tolerance, load rebalancing, andstraggler mitigation techniques to further improve performancepredictability. We evaluate HiveMind using a 16-drone swarm,however, the platform’s design and programming frameworkare not drone-specific, and can be used to port applications onother IoT swarms, such as self-driving vehicles.

We explore two application scenarios; locating stationaryitems, and counting the unique people in a bounded area. Weshow that HiveMind achieves better and more predictable per-formance and battery efficiency compared to fully centralizedand fully decentralized systems, while also handling cloudand edge failures gracefully. HiveMind also achieves higherdecision quality than a fully decentralized system, and betterresource efficiency than a fully centralized system, enablingthe swarm to efficiently scale to large numbers of edge devices.

II. APPLICATION SCENARIOSA. MethodologyDrones: The swarm consists of 16 programmable Parrot AR.Drones 2.0 [58]. Each drone is equipped with an ARM CortexA8 1GHz, 32-bit processor running Linux 2.6.32. There is 1GBof on-board RAM, which we complement with a 32GB USBflash drive to store the image recognition model and sensordata. Each drone also has by default a vertical 720p front-camera used for obstacle avoidance, and the following sensors:gyroscope, accelerometer, magnetometer, pressure sensor, andaltitude ultrasound sensor. We additionally fit each drone witha 12MB camera connected to the underside of the device overUSB, which is used for high definition photos.

Server cluster: We use a dedicated local cluster with 12, 2-socket, 40-core servers with 128-256GB of RAM each, running

Ubuntu 16.04. Each server is connected to a 40Gbps ToRswitch over 10Gbe NICs. Servers communicate with the droneswarm over a 867Mbps LinkSys AC2200 MU-MIMO wirelessrouter [59] using TCP. Finally, we deploy OpenWhisk [31] onthe cluster to launch serverless functions, and instantiate newjobs inside single-concerned Docker containers.

B. Stationary Item Detection

We first explore a scenario where the swarm needs to locate15 tennis balls placed within the 2D borders of a baseball field.Fig. 1a shows the high-level order of operations, and Fig. 1bshows the swarm executing the scenario.

Drones fly at a height of 4-6m, move at 4m/s, and takephotos of the ground every 1s to ensure full coverage ofthe terrain without excessive photo overlap. Fig. 2 shows anexample of consecutive photo taking intervals for a drone,flying at an average 5m altitude. The camera has a 92°fieldof view (FoV), which results in an approximate coverageof 6.7m × 8.75m. At 4m/s, this ensures full coverage ofthe assigned space with some overlap between photos toimprove detection accuracy for items close to the drone’sFoV’s borders. Duplicate items are disambiguated using their< x, y > coordinates.

While Parrot AR 2.0 drones can move at a maximumspeed of 12m/s, which would allow faster space coverage,we have found that for speeds over 7m/s control becomesdifficult, and flight becomes severely unstable, leading tocrashes and equipment damages. This is especially the casewhen computation, such as obstacle avoidance, happens on-board. The photo taking interval is also dictated by the factthat collecting photos more frequently than every 0.5s canlead to long network queueing delays for swarms larger than15 drones. In Sec. VI we also explore photo intervals of0.5s at 6m/s speeds for cases where not all sensor data aretransferred to the cluster. We show that while this allows fasterspace coverage, it can also lead to flight instability when theon-board resources are highly-utilized. Finally, to avoid alldevices transferring data to the cloud at the exact same time,we also insert an initial 0.1s delay between the time the dronesstart their missions.

2

We build and deploy a centralized and a decentralizedcoordination control platform on the swarm, and explore theirperformance, reliability, and efficiency trade-offs in the nextsection. In both systems, there is a centralized controller thatperforms the initial work assignment between drones, andstores the final output in persistent storage. The controllerevenly divides the area across all drones, and sends them theassigned border coordinates. It also communicates with eachdrone the routing strategy they should follow, and the intervalat which photos should be collected. The route is derived usingA∗ [60], where each drone tries to minimize the total distancetraveled by photographing neighboring points in sequence.

4 m/s4-6m

5m92°

~8.75m

4 m/s

Found 1 Not foundFound 2 (1 duplicate)

Fig. 2: Field of view (FoV), alti-tude, and speed of a Parrot drone.

Once the drones receivetheir assignment, theymove to their startingpoints; the corner borderpoint for each of theirassigned regions that isclosest to their take offpoint. All drones take offfrom the same location.Once each drone reaches

its starting point, they start collecting photos of the groundevery 1s. Depending on the structure of the coordinationcontrol platform deployed, the drone either transfers thedata over wifi to the backend cluster, or performs imagerecognition on-board. In both the centralized and decentralizedplatforms, obstacle avoidance happens on-board to avoidcatastrophic failures caused by delays in network transfers.Obstacle avoidance leverages the drone’s med-resolutionfront camera (non-pivotable) to detect solid objects in thedrone’s vicinity and adjust its route to avoid them. We use theobstacle avoidance SVM classifier in the ardrone-autonomylibrary [61], and train it, in addition to generic solid objects,for trees, people, other drones, and walls as well.Centralized platform: In this case all sensor data aretransferred to the cloud. Once the backend cluster receivesa new image, the controller triggers the serverless frameworkto launch the recognition task. The OpenWhisk master findsand allocates cluster resources to the new job, and launches theserverless functions. Image recognition uses an SVM classifier,implemented in OpenCV based on the cylon framework [62],and trained on a dataset of various balls used in sports. Oncethe job completes, OpenWhisk informs the controller whetherthe image contained a unique tennis ball by comparing itscoordinates to previously-identified balls. In the meantime, thedrones continue their routes, and send new photos.Decentralized platform: In the decentralized system, imagerecognition happens exclusively at the edge using the sameSVM classifier as in the centralized system, adjusted to accountfor the drone’s different OS and hardware stack. Since thereis no centralized state where the locations of identified ballsare stored, the drones need to disambiguate their findings anddiscard any duplicate balls. Once a drone covers its assignedregion, it shares all coordinates containing tennis balls withits neighboring drones (any drones it is sharing a border

with). The recipients check for duplicates and only retainuniquely-identified balls. The process continues across theswarm, ensuring that disambiguation between a pair of dronesis unidirectional to avoid discarding the same balls twice.

In both platforms, if at any point one or more drones fail, orare close to running out of battery, the centralized controllerrepartitions their region equally among neighboring drones.Rebalancing policies are discussed in detail in Sec. IV-F. Onceall drones complete their routes they return to their take-offlocation and land. The final output is stored in the cluster, andincludes photos and coordinates for each tennis ball.C. Mobile People Recognition

We now require the swarm to recognize a total of 25people present on the baseball field, and count their number;the number of people is not known to the drones or clusterin advance. People are allowed to move within the bordersof the field while the scenario takes place. This introducesadditional challenges as, unlike in the previous scenario, thenumber of people in a region can change over time, resultingin counting the same person multiple times as they crossbetween regions assigned to different drones. People can alsobe double-counted by standing close to the borders betweenregions. Finally, if a person knows a drone’s route, they canmove between regions, such that they remain outside theFoV of nearby drones. We assume that people do not knowthe drones’ routes, and do not actively try to avoid beingphotographed.

Figure 1c shows the swarm executing the scenario. Asbefore, the controller assigns regions to each drone, andcommunicates the coordinates and route with them.Centralized platform: Once the cloud receives a new image,it invokes OpenWhisk to launch the new face recognition task.Once the job completes, the OpenWhisk master informs thecontroller whether there was a person in the given photograph.Human recognition is based on the Tensorflow DetectionModel Zoo [63, 64], a set of pre-trained models compatiblewith Tensorflow’s Object Detection API, an open-source li-brary used for training and testing object detection models.Tensorflow Detection Model Zoo consists of 16 object detec-tion models, trained on the COCO dataset [65]. The modelsprovide bounding boxes around target objects as output, andare capable of detecting 80 types of objects, including humans.

Given that a person can move within a given area, weneed to disambiguate between detected people. Tensorflow’sDetection Model Zoo achieves high accuracy in full-bodyobject detection, but is not designed for face recognition.Therefore, to disambiguate between people we use a facerecognition framework in OpenCV based on FaceNet [66],a CNN-based system that directly learns a mapping betweenface images and a compact Euclidean space where distancescorrespond to a measure of face similarity. Having this map-ping makes it easy to compute similarities between faces withFaceNet embeddings as feature vectors. Disambiguation usesthe serverless framework, it starts as soon as the first people areidentified, and usually finishes after the drones have completedtheir routes and transferred all data. To avoid cases where the

3

MotionImage Recognition

Obstacle AvoidanceData Transfer

SchedulingLoad (Re)Balance

Drone 1Drone 2Drone 3




Drone 13Drone 14

Drone 15Drone 16

0 10 20 30 40 50 60 70Time (s)

100

101

102

103

104

Tas

k La

tenc

y (m

s)

Centralized Processing

0 20 40 60 80 100 120 140Time (s)

100

101

102

103

104

Tas

k La

tenc

y (m

s)

Decentralized Processing

Cen

tral

ized

Fig. 3: Latency per task type for centralized (top) and decentralized processing(bottom) in the first scenario (locating tennis balls). The vertical line in thebottom graph signifies when the centralized platform completes the scenario.

0 10 20 30 40 50 60 70Time (s)

02468

10121416

Dro

ne ID

Centralized

0102030405060708090100

Bat

tery

Lev

el (

%)

0 20 40 60 80 100 120 140Time (s)

02468

10121416

Dro

ne ID

Decentralized

0102030405060708090100

Bat

tery

Lev

el (

%)

Fig. 4: Per-drone battery level for thecentralized and decentralized platforms inthe first scenario.

same person is photographed from the front and back by oneor more drones, in which case disambiguation based on facerecognition is impossible, we limit TensorFlow’s DetectionModel Zoo to people where the face is at least partially visible.Decentralized platform: As with the previous scenario,recognition and disambiguation happen at the edge, usingthe same models as in the centralized system, rewritten inOpenCV, since TensorFlow’s dependencies could not run onthe drones. Disambiguation follows a similar procedure asbefore, with the difference that now drones exchange photos todifferentiate people instead of coordinates, since people mayhave moved between pictures, which increases network traffic.

Once all drones complete their missions they return to theirtake-off location and land. The output is stored in the cluster,and contains the photos and total number of identified people.

III. CENTRALIZED VS. DECENTRALIZED COORDINATION

We now examine the trade-offs between centralized anddistributed coordination platforms for the scenarios of Sec. II.Performance: Fig. 3 shows the latency of different typesof tasks throughout the duration of the first scenario. The topfigure shows task latencies for the centralized platform and thebottom for the decentralized system. The placement of tennisballs in the field is identical in both cases. In the top figurethere are six task types, with three executing on the drones;motion control, obstacle avoidance, and data transfer, and threeon the cluster; image recognition, scheduling of serverlesstasks, and load (re)balancing. In the decentralized platform,there is no need for the serverless framework, since imagerecognition happens at the edge. The cluster is only used forwork assignment and rebalancing.

The centralized platform completes the scenario in almosthalf the time (74s) required by the decentralized system(141s). The most time-consuming tasks involve motion con-trol, as drones move between locations, followed by taskstransferring data to the cloud. Inserting a small amount ofdelay between drones avoids high spikes in data transfer,

and reduces network latencies, although queueing latenciescan still occur when all drones are active. Once data aretransferred to the cluster, image recognition happens fast,taking 23ms on average and 49ms for the 99th inferencepercentile. Scheduling serverless tasks is almost instantaneous,taking 3.2ms on average and 5.4ms in the worst-case. Finally,work assignment happens at time 0 and only needs to berevisited once towards the end of execution, when drone 14’sbattery starts draining disproportionately fast, and its work isassigned to its neighboring drones to avoid it running out ofbattery. This also results in more obstacle avoidance tasks, asmore drones are congregating in the same area. The highermovement latencies in the end of the scenario correspond todrones returning to their take-off point, which can be far fromtheir current location.

The main difference in the decentralized platform is thatimage recognition at the edge takes 1-2 orders of magnitudelonger compared to serverless. This is not surprising, giventhe limited on-board resources, and the fact that serverless canleverage fine-grained parallelism even within a single recog-nition task. However, it has severe implications in executiontime, and in the drones’ reliability. Occupying the drone’sresources with image recognition means that the device is lessagile and less able to adjust its course in a timely mannerwhen needed, e.g., to avoid an obstacle. This makes motioncontrol slower and more unpredictable, seen by a number ofslow motion operations in the middle of the scenario, and inthe case of drones 4 and 14, it also resulted in them crashingat t = 74s, and being unable to continue their missions. Thisforces the backend controller to rebalance the load severaltimes as the scenario progresses, both due to failures andlow battery reserves for several drones. The benefit of thedecentralized platform is reducing network traffic, as onlyphotos with the target object are shared with the backendcluster. Both coordination platforms correctly identified all 15tennis balls.

Fig. 5 shows a similar comparison for the second scenario.

4

MotionImage RecognitionObstacle Avoidance

People DisambiguationData Transfer

SchedulingLoad (Re)Balance





Drone 13Drone 14

Drone 15Drone 16

0 20 40 60 80 100 120Time (s)

100

101

102

103

104

Tas

k La

tenc

y (m

s)

Centralized Processing

0 50 100 150 200Time (s)

100

101

102

103

104

Tas

k La

tenc

y (m

s)

Decentralized Processing

Cen

tral

ized

Inco

mpl

ete

Fig. 5: Latency per task type for centralized (top) and decentralized platforms(bottom) in the scenario where we want to count the unique people in an area.We show when the decentralized platform is unable to complete execution.

0 20 40 60 80 100 120Time (s)

02468

10121416

Dro

ne ID

Centralized

0102030405060708090100

Bat

tery

Lev

el (

%)

0 50 100 150 200Time (s)

02468

10121416

Dro

ne ID

Decentralized

0102030405060708090100

Bat

tery

Lev

el (

%)

Fig. 6: Per-drone battery level for thecentralized and decentralized platforms inthe second scenario.

This scenario is more computationally-intensive, given thediversity in people’s anatomy compared to uniform tennisballs, and the need to disambiguate people using their faces asopposed to coordinates. As with the first scenario, movementis the most time consuming operation for the centralizedplatform, followed by data transfer and image recognition.Compared to the first scenario, recognizing people takeslonger, 59ms on average and 159ms for the 99th percentile.Disambiguation is the last set of tasks to finish after all droneshave completed their missions, and incurs similar latencies toimage recognition. Scheduling serverless tasks incurs similarlatencies to the first scenario, despite the higher intra-jobparallelism, because cluster resources never become oversub-scribed. Between 72-100s the controller has to rebalance thework assignment to account for a subset of drones whosebattery is draining quickly, and to counteract the fact thatdrones 4, 5, and 14 are moving slower than the rest, andhence would degrade overall execution time. This causestheir neighboring drones to move further away to cover theadditional areas, seen by the higher movement latencies aftert = 72s.

In contrast, the decentralized platform progresses at a slowerpace, despite avoiding most data transfers, and is eventuallyunable to complete the scenario due to several drones runningout of battery. The centralized controller rebalances the loadseveral times, however, the remaining devices ultimately donot have sufficient battery to accommodate the extra work.Both people recognition and face disambiguation take 1-3orders of magnitude longer than in the centralized platform, af-fecting the drones’ flight stability, and draining their batteries.Drone 5 is the last to run out of battery at t = 243s. Most ofthe drones return to their take-off location before completelyrunning out of battery, with the exception of drones 4 and 14which ran out of battery before returning to the base station.

Unlike the centralized platform, the decentralized system isalso penalized by the sensor data remaining isolated acrossdrones, thus not benefiting from each other’s decisions. In

Sec. VI we study the impact of a centralized data repositoryon a swarm’s ability to continue learning online. Finally, thecentralized system correctly identified all 25 people in thefield, while the decentralized system missed 7 people dueto drones running out of battery. When comparing the twosystems, we instruct people to move in exactly the same way.Battery efficiency: Fig. 4 shows the per-drone battery levelfor the first scenario across the two platforms. All dronesstart with 100% battery charge. In the case of the central-ized platform, battery depletion is mostly uniform, with theexception of drone 14 whose battery is draining at a fasterrate, due to a fault in the power controller’s firmware thatkept the core always at the highest frequency. To address this,the centralized controller rebalances the load between drone14 and its neighboring drones to avoid completely draining itsbattery. The average battery level across all drones at the endof the scenario is 73.2%.

Battery depletion is less uniform in the decentralized plat-form, with some drones losing charge at higher rates, de-pending on how quickly they perform the on-board imagerecognition. The increased resource load also results in lessreliable motion control, causing drones 4 and 14 to crash andpower off. The average battery level in the end is 29.8%.

Fig. 6 shows the per-drone battery level for the secondscenario. The trade-offs are clearer here due to the increasedcomputational requirements people recognition has, comparedto recognizing a single stationary item. The average batterylevel at the end of the scenario for the centralized platform is65%, lower than before, but much higher than the decentral-ized system, where almost all drones ran out of battery.Computation vs. communication: Fig. 7 shows the break-down to network communication, computation, and manage-ment operations for different latency percentiles. Communi-cation includes data transfer to the cloud/edge devices, com-putation includes the image recognition and disambiguationtasks, and management includes the overhead of serverlesstask scheduling. We omit the load (re)balancing tasks as

5

Centralized Decentralized

50th 90th 99thRequest Percentiles

0

500

1000

1500

Req

uest

Lat

ency

(m

s)

Tennis Balls Detection

Data TransferLambda SchedulingImage Recognition

50th 90th 99thRequest Percentiles

0

500

1000

1500

2000

Req

uest

Lat

ency

(m

s)

People Detection

Data TransferLambda SchedulingImage Recognition

Fig. 7: Task latency breakdown for the centralized anddecentralized platforms across the two scenarios.

they are very infrequent. In the centralized platform datatransfer accounts for the largest latency fraction, especiallyin high percentiles. In comparison, the decentralized platformincurs lower latencies for data transfer, which would allow theswarm to scale to larger device numbers. On the other hand,computation is much more costly in the decentralized platform,especially in the second scenario, with the latency of recog-nition tasks being additionally highly variant. For example,the 99th %ile of image recognition for the second scenario is159ms for the centralized platform compared to 2006ms forthe decentralized, hurting performance predictability. Finally,management tasks introduce negligible overheads, less than5.5ms in all cases.

IV. HIVEMIND DESIGN

A. Overview

The analysis of Sec. III showed that centralized platformsachieve better performance, battery efficiency, and output qual-ity compared to decentralized coordination control systems,but at the cost of much higher bandwidth usage, whichlimits their scalability. In this section we present HiveMind, acoordination control platform for large IoT swarms designedto achieve the best of centralized and decentralized platforms,and optimize for performance, battery efficiency, and faulttolerance. To evaluate HiveMind, we use the same 16-droneswarm as before, however, HiveMind’s design principles arenot drone-specific, and the platform can be used for diversetypes of IoT swarms, including autonomous vehicles.

HiveMind is designed with the following principles, eachof which is detailed below: • centralized control, • event-driven, fine-grained cloud computation, • scalable schedulingand resource allocation, • on-board preprocessing, • faulttolerance, • dynamic load balancing, • continuous learning,and a • generalizable programming framework.

B. Centralized Controller

HiveMind uses a centralized controller to obtain globalvisibility on the state and sensor data of all devices in aswarm. This allows the platform to ensure higher qualityrouting, better fault tolerance (Sec. IV-E) and load balancing(Sec. IV-F), as well as to leverage the entire swarm’s data tocontinuously improve its decision quality (Sec. IV-G).

The controller consists of a load balancer, which for theexamined scenarios also performs the swarm’s route map-ping, an interface to the serverless framework responsible forcloud-side computation, an interface to communicate with theedge devices, and a monitoring system that collects tracinginformation from the edge devices and cloud servers. Thecontroller also has visibility over the sequence of operationsin a scenario, and is responsible for training and deploying theinitial models to the edge devices, and for retraining them later,if necessary. The controller is implemented as a centralizedprocess in C++, with two hot standby copies of the masterthat can take over quickly, in the case of a failure.

C. On-Board PreprocessingThe analysis of Sec. III showed that offloading all computa-

tion to the cloud can lead to network link saturation, limitingthe swarm’s scalability. Additionally, there are mission-criticaltasks, such as obstacle avoidance, that cannot afford thelatency of sending data to/from the cloud. HiveMind leveragesthe on-board resources to preprocess sensor data and filterthose important enough to be sent to the backend clusterfor further processing. In the case of the first scenario weexamine, drones perform an initial image recognition on-boardto find objects with an approximate circular shape using asimple and lightweight detection library, and only offload tothe cloud images that contain such objects. Although thisapproach is prone to false positives, and in some cases falsenegatives too, it greatly reduces the bandwidth usage, allowingHiveMind to support a larger number of IoT devices. Similarly,HiveMind uses the on-board compute resources and low-powerfront-camera to detect obstacles in the drone’s vicinity, andadjusts its route to avoid them. Offloading this operation tothe backend cluster can be subject to high latencies that oftenresult in crashes and catastrophic equipment failures.

D. Serverless Cloud FrameworkHiveMind uses OpenWhisk to launch and execute tasks on

the backend cluster. Serverless allows exploiting fine-grainedparallelism within a single job, decreasing task latency. Oncea new photo arrives from a drone, the centralized controllerinvokes the OpenWhisk scheduler to launch the new job. Eachjob is divided to several serverless functions, either determinedempirically by the OpenWhisk master based on the amount ofdata processed, or defined by the user. Each function is thenspawned in a Docker container and allocated one core, 2GBof memory, and 512MB of disk storage by default, consistentwith resource allocations on AWS Lambda [26], GoogleFunctions [28], and Azure Functions [27]. Each function canalso access the remote shared persistent storage system holdingall the training datasets, and sensor data transferred by the edgedevices, similar to AWS Lambdas accessing S3 storage.

Each worker node runs a worker monitor, which tracksand reports resource utilization to the OpenWhisk master,allowing it to adjust the amount of allocated resources via theDocker resource interface, if necessary. A user can also bypassthe scheduler’s resource allocation policy and customize theamount of resources per serverless function. Similarly, users

6

can express priorities for different types of jobs, or differentedge devices. In our scenarios we assume that all drones haveequal priority. When cluster resources are plentiful, CPUs arededicated to a single container to avoid contention. Given thateach function lasts at most a few hundred milliseconds inour scenarios, this does not result in new jobs being queuedwaiting for resource allocations. We plan to explore moreresource-efficient allocation strategies in future work.

We have also implemented a task monitoring infrastructurein OpenWhisk that checks the progress of active serverlessfunctions, and flags potential stragglers that can degradeexecution time. If a serverless function takes longer than the90th percentile of functions in that job, OpenWhisk respawnsthe misbehaving tasks on new physical servers, and usesthe results of whichever tasks finish first [67, 68]. The exactpercentile that signals a straggler can be tuned dependingon the importance of a job. If several underperforming tasksall come from the same physical node, that server is puton probation for a few minutes until its behavior recovers.OpenWhisk also respawns any failed tasks by default.

Once a job completes, the OpenWhisk master informs thecentralized controller and passes the job’s output to it. Bydefault, a container is terminated when its function completesexecution. This can introduce significant instantiation over-heads if job churn is high. In Sec. VI we explore differentcontainer keep-alive policies to reduce start-up overheads.

E. Fault Tolerance at the Edge

1

1

2

2

3

44

3

After drone failure After load rebalancing

Fig. 8: Load repartitioning tohandle a drone failure.

Edge devices areprone to failures andunpredictable behavior. Inthe scenarios we examine,all drones send a periodicheartbeat to HiveMind(once per sec). If thecontroller does not receivea heartbeat for morethan 3s, it assumes that

the drone has failed. HiveMind handles such failures byrepartitioning the load among the remaining drones. Fig. 14shows such an example for our scenarios. Immediately afterHiveMind realizes that the red-marked drone has failed, itrepartitions its assigned area equally among its neighboringdrones assuming they have sufficient battery, and updatestheir routing information. Depending on which drone hasfailed, this involves reassigning work to 3-8 drones for thisexample. If the failed drone had already executed part of itsroute, HiveMind only repartitions the remaining area. If thenon-responsive drone makes contact before the other dronesstart its work, the adjustment is reverted.F. Dynamic Load Rebalancing

Failure is not the only reason why load may need to berepartitioned among edge devices. Often some devices depletetheir battery reserve faster, either due to hardware/softwarebugs, or due to performing more resource-intensive computa-tion or movement. For any of these cases, HiveMind reparti-

tions the work assigned to edge devices. When repartitioningwork, HiveMind tries to accommodate the extra work usingneighboring devices only, if possible, to avoid long traveltimes. We plan to extend load rebalancing to account forheterogeneous edge devices as part of future work.

G. Continuous LearningA benefit of centralized coordination is that data from

all devices can be collectively used to improve the learningability of the swarm. Once HiveMind receives the first fewimages that edge devices have tagged as containing the targetobject, it verifies whether their detection was accurate. Inthe case of a false positive, HiveMind penalizes the incorrectdecision, periodically retrains the on-board detection engines,and redeploys the new model to the edge devices. In Sec. VIwe show that leveraging decisions from all devices improvesdecision quality much more quickly than retraining a deviceonly based on its own decisions. To handle false negatives,after the end of a scenario’s execution, HiveMind verifies thatany images not sent to the cluster did indeed not contain thetarget object. These images are stored on the drone’s local flashdrive for validation. In case of undetected objects, HiveMindretrains the on-board models, and redeploys them for the nextexecution.

H. Programming FrameworkHiveMind uses a high-level, event-driven programming

framework to allow users to express new application scenarios.The framework supports applications in node.js and Python,and we are currently expanding it to Scala. Users have toexpress the sequence of operations in their scenario, as wellas define which operations run on the edge devices, and whichon the serverless framework. Users are also responsible withproviding any ML models and training datasets needed fortheir applications. Finally, users can optionally express server-less scheduling policies that deviate from what HiveMindsupports by default, as well as priorities between edge devices.They can also customize the fault tolerance, load rebalancing,and continuous learning techniques. HiveMind automaticallyhandles the communication interfaces between OpenWhisk,the controller, and the edge devices, as well as the monitoringand tracing frameworks in the cloud and edge.

I. Putting It All TogetherFig. 9 shows an overview of the platform’s cloud and edge

components, and the sequence of operations when executingthe first scenario. The controller first communicates with thedrones their work assignment and route, after which, dronesstart their mission, collecting photographs and analyzing themon-board. In the first scenario, this involves drones using asimple SVM classifier implemented in OpenCV based oncylon [62] to put bounding boxes around any circular objects.Images with such objects are transferred to the cloud fordetailed image recognition. After the serverless functions com-plete, OpenWhisk informs the HiveMind controller whether aunique tennis ball indeed existed in the image. The scenariocontinues until all drones cover their assigned regions.

7

Backend Cluster

…

Persistent shared storage

Centralized Controller

𝝺

𝝺

𝝺

…

Edge Controller

Obstacle Avoidance

Motion control

Sensor control

Tracing

TCP over wifi

Worker monitor

𝝺

𝝺

𝝺

Load balancer

(create routes)

Edge

Inte

rfac

e

Clo

ud

Inte

rfac

e

4. TakePhoto

1. AssignRegion(dronek)

7. ItemRecognition(photo)8. ExistsTargetItem(y/n)

Tracing

3. Move(x,y,z)

…

Serverless master (OpenWhisk)

Task schedulerStraggler detection

Resource allocation & placement

Task Tracing

Worker Worker

𝝺𝝺𝝺𝝺𝝺𝝺𝝺𝝺

𝝺𝝺𝝺𝝺𝝺𝝺𝝺𝝺

On-board recognition

5. ShapeDetection(photo,y/n)

Fig. 9: HiveMind platform overview, and sequence of opera-tions for the first scenario.

In the second scenario, the drones perform an initial humandetection, placing bounding boxes around shapes that resemblehumans, using the rectangular and oval item recognition modelof cylon in OpenCV [62]. They then only transfer imageswith such shapes to the cloud. The backend cluster uses thedetailed TensorFlow model to verify that the image indeedcontained a human, and disambiguate them against previously-detected people. In Sec. VI we evaluate the accuracy of on-board detection for both scenarios.

V. IMPLEMENTATION

HiveMind controller: The HiveMind controller is writtenin approximately 10,000 lines of code in C++ and currentlysupports Ubuntu 14.04 and newer versions. It includes theload balancer, route planner, and the interface to the edgedevices via the wireless router, as well as the interface toinvoke and receive information from the serverless framework.The controller communicates with edge devices over TCP,although UDP is also supported. We have also implementeda monitoring system in the controller that tracks applicationprogress, errors, and device status, and verified that the tracingsystem has no meaningful impact on performance; less than0.1% impact on task tail latency, and less than 0.2% onthroughput.Serverless framework: We use OpenWhisk v.0.10.0 andextend the OpenWhisk master by 4,000 LoC in Go to supportthe task scheduling and straggler detection policies describedin Sec. IV, as well as implement the interface with the Hive-Mind controller. We also implement a fine-grained monitoringsystem both in the OpenWhisk master and each worker nodeto track task latency, resource utilization, and function errors.Edge devices: The platform on the edge devices includes thenetwork interface to talk to the HiveMind controller over TCPusing the wireless router, the motion controller that executesthe drone’s route, the obstacle avoidance engine, the on-board object detection engines, and a logger to track taskperformance, errors, and battery levels. Finally, the code onedge devices also handles state management by storing in theon-board flash drive any images not sent to the cloud.Applications: HiveMind currently supports applications innode.js and Python, and is being extended to also support

Scala. The two application scenarios we evaluate use primarilyOpenCV libraries for item and people recognition at theedge, and OpenCV and TensorFlow models when recognitionhappens in the backend serverless framework.

VI. EVALUATION

Performance: The top rows of Fig. 10 and 11 show thelatencies of tasks executing on the edge devices with HiveMindfor the two scenarios. On-board recognition now looks forcircular objects in the first scenario, and human-shaped objectsin the second. Photos with such objects are transferred tothe cloud, while other images are stored on board. On-boardrecognition is considerably faster (18 − 117×) compared tothe decentralized platform, where drones had to recognizethe precise target objects, allowing the devices to consumeless battery and maintain more reliable motion control. SinceHiveMind transfers a lot less data to the cloud, it can alsocollect images at a higher rate (twice per second as opposedto once), covering a given area faster. Below we also explore asimilar policy for the centralized and decentralized platforms.

Motion and obstacle avoidance takes similar time to thecentralized platform, while data transfer is significantly re-duced, and does not suffer the high queueing latencies of thecentralized system. Movement again takes longer towards theend of the scenario when drones return to their take-off loca-tion. The results are consistent across the two scenarios, withpeople-shaped detection being slightly more computationally-intensive, causing the scenario to take longer to complete.Bandwidth utilization: The next rows of Fig. 10 and 11show the network bandwidth utilization for all three platforms.For HiveMind, we show two photo-taking intervals; 1s at4m/s speed and 0.5s at 6m/s speed. We also attempted to dothe same with the centralized and decentralized platforms. Inthe centralized case, bandwidth quickly saturates, resulting inhigh packet drops, once all drones start sending data at peakfrequency. In the decentralized system, on-board recognitionis already stressing the drones’ capabilities, therefore requiringthe process to happen at twice the rate caused their battery todeplete even faster, leaving both scenarios incomplete.

Even at 1s intervals, the centralized platform uses muchhigher bandwidth than the other platforms in both scenarios.The decentralized platform uses the least bandwidth, as onlyphotos with detected objects are transferred to the cloud, withthe exception of a slightly higher bandwidth usage towards theend of the scenarios, when drones are exchanging their indi-vidual results to perform object and people disambiguation.

HiveMind uses more bandwidth than the decentralized plat-form but less than the centralized system, as it filters out sensordata with no objects resembling the target. Unsurprisingly,when the rate of sensor data collection is higher the bandwidthusage is higher too, although the whole scenario takes lesstime to complete. The lower bandwidth usage of HiveMindalso means that the platform can scale to a larger number ofedge devices. For example, while the current wireless routerwould saturate after 26 drones in the centralized case, withHiveMind it can support up to approximately 150 drones.

8

Drone Motion On-board Recognition Obstacle Avoid. Data Transfer

0 10 20 30 40 50Time (s)

100

101

102

103

104

Tas

k La

tenc

y (m

s) Edge Operations

0 20 40 60 80 100 120 140Time (s)

0

10

20

30

40

50

60

Ban

dwid

th (

MB

ps)

CentralizedDecentralized

HiveMind (1s)HiveMind (0.5s)

0 40 80 120Execution Time (s)

Centralized

Decentralized

HiveMind (1s int)

HiveMind (0.5s int)

0 20 40 60 80 100Avg Remainder Battery (%)

Correct False False Duplic. Total0

10203040506070 Without Retraining


Target

Neg PosCorrect False False Duplic. Total

010203040506070 With Self Retraining

Target


010203040506070 With Global Retraining

Target

Neg Pos

Fig. 10: From top to bottom for the first scenario: (a) latencyof different tasks running on the drones with HiveMind,(b) bandwidth usage across platforms, (c) performance andbattery efficiency comparison, and (d) the impact of onlineretraining to the accuracy of on-board recognition.

Drone 1Drone 2

Drone 3Drone 4

Drone 5Drone 6

Drone 7Drone 8

Drone 9Drone 10

Drone 11Drone 12

Drone 13Drone 14

Drone 15Drone 16

0 10 20 30 40 50 60 70 80Time (s)

100

101

102

103

104

Tas

k La

tenc

y (m

s) Edge Operations

0 50 100 150 200Time (s)

010203040506070

Ban

dwid

th (

MB

ps)

CentralizedDecentralized


0 50 100 150 200Execution Time (s)

Centralized

Decentralized

HiveMind (1s int)

HiveMind (0.5s int)

0 20 40 60 80 100Avg Remainder Battery (%)

Correct False False Duplic. Total0

20

40

60

80

100

120 Without Retraining


Target


0

20

40

60

80

100

120 With Self Retraining

Target


0

20

40

60

80

100

120 With Global Retraining

Target

Neg Pos

Fig. 11: From top to bottom for the second scenario:(a) latency of tasks running on the drones with HiveMind,(b) bandwidth usage across platforms, (c) performance andbattery efficiency comparison, and (d) the impact of onlineretraining to the accuracy of on-board recognition.

Battery usage: The following rows of Fig. 10 and 11 showthe performance and battery levels across platforms and sce-narios. HiveMind at 1s photo intervals achieves very similarperformance to the centralized platform, while consumingmore battery due to the on-board detection. At 0.5s photointervals both scenarios complete faster, while the reducedtotal execution time allows the swarm to consume less batterythan when moving at a slower speed, almost canceling out theeffect of on-board recognition, and finishing the scenario withsimilar battery levels as the centralized platform. 1

Continuous learning at the edge: We now explore the benefitof retraining the on-board image recognition engines. We alsoshow the impact of centralized learning on decision quality,compared to decentralized learning at the endpoints.

The last rows of Fig. 10 and 11 show the accuracy ofon-board recognition with HiveMind at 1s and 0.5s photointervals. We show the number of correctly-identified itemsby drones, any false negatives, i.e., the tennis balls and peopledrones missed respectively, any false positives, i.e., the numberof photos transferred to the cloud which did not end upcontaining the target objects, any objects detected multipletimes (duplicates), and finally the total number of recognitionjobs triggered in the cloud by edge devices.

The left-most figure shows these statistics when there isno retraining of the on-board recognition engines during thescenario’s execution. Although the drones successfully detectseveral tennis balls and a few people, they also have several

1Speed does not affect battery consumption severely, with most batterydepletion happening during take-off, and to keep the drone airborne.

false negatives. Since the images for those items/people werenot transferred to the cloud, there is no way for the centralizedsystem to identify them, except by mining all on-board dataafter the end of the scenario. Drones also experience a largenumber of false positives, by flagging items that e.g., havecircular shapes but are not tennis balls, which increases thetotal amount of cloud resources used, and data transferred.

We now enable online retraining, where once a drone startstransferring images to the cloud, the cluster uses this drone’simages to retrain its on-board recognition engine by penalizingfalse positives, and redeploys the newly-trained model tothe drone over the network, allowing it to improve as thescenario is running. 2 While this significantly reduces the falsepositives, as seen in the middle graphs, it does not solve theproblem of false negatives. To address this, once the scenariocompletes, the cluster mines the on-board storage to flag anyfalse negatives, retrains the on-board models, and redeploysthem for the next time the scenario executes. This results infewer, although still existing, false negatives, as well as fewertotal number of cloud recognition jobs, which saves cloudresources and network bandwidth. Duplicates are the resultof partial overlap between images from the same or differentdrones, so the technique above cannot address them, althoughthey are not as impactful as false negatives and false positives.

Finally, we investigate the potential of using the falsenegatives/positives of the entire swarm to improve their on-board models in a collective way, given that their images are

2We retrain models in the cloud to avoid depleting the on-board resources.

9

0 10 20 30 40 50Time (s)

100

101

102

103

104T

ask

Late

ncy

(ms) Cloud Operations

Task Scheduling Image Recognition Load (Re)Balance

0 10 20 30 40 50 60 70 80Time (s)

050

100150200250300350400

Act

ive

Cor

es

Centralized HiveMind (1s) HiveMind (0.5s)

0 10 20 30 40 50Time (s)

02468

1012

Ser

vers

02468101214161820

CP

U U

tiliz

atio

n (%

)

0.0 0.2 0.4Cost (cent)

Lambda S3

Centralized Decentralized HiveMind (1s int) HiveMind (0.5s int)

No ReuseTD+0.1s TD+1s TD+5s0

10

20

30

40

50

60

70

80

90

Exe

cutio

n T

ime

(s)

Performance

No Reuse TD TD + 1s TD + 5s0

200400600800

10001200

#Con

tain

ers

Reused Containers


2000

4000

6000

8000

10000

#Con

tain

ers

Non-Reused Containers

Fig. 12: From top to bottom for the first scenario: (a)latencies for cloud tasks with HiveMind, (b) number of activeCPUs, (c) CPU utilization per server and cost for serverlessresources, and (d) performance (left) and number of reusedand non-reused containers across platforms (right).

0 10 20 30 40 50 60 70 80Time (s)

100

101

102

103

104

Tas

k La

tenc

y (m

s) Cloud Operations

Task Scheduling Image Recognition Load (Re)Balance Disambiguation

0 20 40 60 80 100 120Time (s)

050

100150200250300350400

Act

ive

Cor

es

Centralized HiveMind (1s) HiveMind (0.5s)

0 10 20 30 40 50 60 70 80Time (s)

02468

1012

Ser

vers

0.02.55.07.510.012.515.017.520.022.525.0

CP

U U

tiliz

atio

n (%

)

0.0 0.2 0.4 0.6 0.8Cost (cent)

Lambda S3



20

40

60

80

100

120

140

160

Exe

cutio

n T

ime

(s)

Performance

No Reuse TD TD + 1s TD + 5s0

500

1000

1500

2000

2500

#Con

tain

ers

Reused Containers


2000400060008000

100001200014000

#Con

tain

ers

Non-Reused Containers

Fig. 13: From top to bottom for the second scenario: (a)latencies for tasks executing in the cloud with HiveMind, (b)number of active CPUs, (c) CPU utilization per server andcost for serverless resources, and (d) performance (left) andnumber of reused and non-reused containers across platforms.

transferred to the centralized cluster. The right-most graphsof the last row of Fig. 10 and 11 show the impact of globalretraining on the drones’ detection accuracy. False negativesare completely eliminated for both scenarios, while falsepositives are further reduced. This shows that a centralizedplatform allows edge devices to benefit from each other andlearn faster than in a fully decentralized system.Cloud task performance: We now explore metrics related tothe centralized cluster. The top rows of Fig. 12 and 13 showthe latencies for cloud image recognition, task scheduling, andload (re)balancing. The results are consistent with those forthe centralized platform, with the difference that there aremuch fewer image recognition tasks in HiveMind. This alsosimplifies the job of the serverless scheduler, which now takesan average of 1.8ms to schedule a new job, compared to3.2ms in the centralized platform. The load balancer had torebalance the work once for the first scenario, when drone 14’sbattery started draining abnormally quickly, and a couple oftimes for the second scenario to account for drones in highly-populated areas that consumed more battery by needing totransfer more data to the cloud.Cloud utilization: The next two rows in Fig. 12 and 13show the number of active CPUs in the cluster for HiveMindand the centralized platform, and the CPU utilization perserver for HiveMind with 0.5s photo intervals. The centralizedplatform uses 4 − 6× more CPUs than HiveMind, since itperforms image recognition on all sensor data. Between the

two variants of HiveMind, collecting data at a higher rate alsomeans more cloud resources per unit time, although this alsoresults in faster execution. The second scenario involves morecomputationally-intensive work, resulting in more serverlessfunctions per job, therefore the number of CPUs is also higher.

The heatmaps below show that despite 30-75 CPUs beingactive at a time with HiveMind, the actual CPU utilization islow, as each serverless function only keeps a single CPU occu-pied for a few msec. This can cause resource underutilization,however, we currently primarily focus on optimizing inferencelatency by scheduling new jobs on available, uncontendedresources, as fast as possible. We plan to explore moreresource-efficient placement policies in future work.Serverless cost: The right graphs of the third rows of Fig. 12and 13 show the cost of hosting the cluster on a commercialserverless framework, specifically AWS Lambda [26], andexecuting each scenario once to completion. We use pricinginformation from the time of submission, and observe that themajority of cost for all platforms comes from hosting the train-ing and output data on S3, as opposed to processing lambdas.The S3 cost for the centralized platforms and HiveMind alsoincludes the read and write accesses needed to perform imagerecognition on the cloud, with the overall processing cost forcentralized being higher than HiveMind.Container reuse: Finally we examine the impact of containerreuse on performance and resource efficiency. The last rowsof Fig. 12 and 13 show the performance and number of

10

reused and uniquely-used cloud containers for the centralizedplatform and HiveMind, when no reuse is allowed, when con-tainers remain alive only for 100ms after their task completes,and when they remain alive for an additional 1s and 5s afterthe end of a task’s execution. By default, each serverlessfunction has its own container, consistent with the allocationpolicies of AWS Lambda [26], Google Functions [28], andAzure Functions [27]. Disallowing container sharing meansthat only a single function can use a container before it isterminated, which results in several thousand containers beingspawned throughout the execution of each scenario, especiallyfor the more computationally-intensive people recognition.This causes significant degradation in execution time, as eachserverless function must sustain the overhead of initializing anew container on a new CPU. This is especially problematicwhen all drones are active, and unallocated CPUs are sparse.

0 20 40 60 80 100Execution Time (s)

No failure

1 failure

2 failures

3 failures

Tennis Balls People

0 10 20 30 40 50 60 70 80Avg Remaining Battery (%)

No failure

1 failure

2 failures

3 failures

Fig. 14: Performance andbattery under drone failures.

Allowing containers to bereused reduces instantiationoverheads, but can introduceresource contention, by keep-ing compute and memory al-located without doing usefulwork. For this experiment wedo not use sleep states in thecontainers; adding sleep stateswould free up more resourcesbut delay task starting time.

The figure shows that keeping containers alive for up to 100msafter their task completes significantly reduces the uniquely-used containers in the centralized system, where there is ahigh probability that a new job arrives within this interval,but does not have a significant impact on the performanceand reused containers of HiveMind. Keeping containers alivefor an additional 1s reduces the total number of containersin HiveMind, and improves the scenarios’ execution time byavoiding most instantiation overhead. Keeping containers alivefor 5s almost eliminates uniquely-used containers, although itdoes not have an additional benefit on performance, given thatthere should already be new data arriving every 1s from thedrones, and that the total number of active tasks is relativelylow, such that incurring some instantiation overhead doesnot lead to long queueing delays in the serverless scheduler.Nonetheless, between the two variants of HiveMind, collectingimages every 0.5s means that there is a higher chance ofreusing containers than when data arrives at a lower frequency.Finally, people recognition takes longer than item recognition,which increases the probability of reusing a keep-alive con-tainer. Unless otherwise specified, we keep containers alive for1s past their task’s execution.

Fault tolerance: Fig. 14 shows the performance and averagebattery at the end of each scenario with HiveMind, excludingfailed drones, when we force a randomly-selected subset ofdrones to land and power down at random points duringexecution. When a drone fails, the controller redistributesits assigned region to its neighboring drones, as discussed


0 20 40 60 80 100Task Percentiles

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Ser

verle

ss T

ask

Late

ncy

(s)

No Straggler MitigationWith Straggler Mitigation

0

50

100

150

200

250

300

350

Exe

cutio

n T

ime

(s)

Tennis Balls People Recognition

Fig. 15: (a) Impact of straggler mitigation on the latency ofserverless tasks. (b) Performance variance across runs.

in Sec. IV-E. One failure is easily absorbed by the swarm,without significant performance and battery degradation. Twofailures have a visible impact on battery, although the per-formance impact is limited, 10.3% for the first scenario, and11.8% for the second. Three failures lower the battery by theend to 48% for the first scenario and 43% for the second,and degrade performance by 24% and 26% respectively. In allcases HiveMind is able to redistribute the load, and completeeach scenario.Straggler mitigation: Fig. 15a shows the impact of stragglermitigation on serverless task latency for the second scenario.Results are similar for the first scenario. By default a smallnumber of serverless tasks can take orders of magnitude longerto complete than the average, either due to faults or resourcecontention, delaying execution and keeping resources busy. Bydetecting stragglers eagerly, HiveMind is able to reduce longtails in their execution. The larger the swarm, the more tasksare spawned in the backend cluster, and the more critical itbecomes to handle straggler tasks quickly.Performance predictability: Finally, Fig. 15b shows violinplots of execution time variability for both scenarios acrossplatforms. Across all experiments, we use the same placementfor tennis balls, and instruct people to follow the same route.The centralized system achieves low and predictable executiontime, since it leverages the ample resources of the backendcluster for most computation. The decentralized system has thelargest performance variability, especially in the second sce-nario where computation is more resource-demanding. Hive-Mind experiences slightly higher performance jitter comparedto the centralized system, since it relies on the drones for theinitial image recognition, although its overall execution timeis better than the fully centralized system, because it reducesboth data transfer and cloud resource contention.

VII. CONCLUSIONS

We have presented HiveMind, a scalable and performant co-ordination control platform for IoT swarms. HiveMind uses acentralized controller to improve decision quality, load balanc-ing, and fault tolerance, and leverages a serverless frameworkto ensure fast and cost-efficient cloud execution. HiveMindadditionally employs lightweight on-board pre-processing toreduce network bandwidth congestion, and scale to largerswarms. We evaluated HiveMind on a 16-drone swarm andshowed that it achieves better performance, efficiency, and

11

reliability than decentralized platforms, and better network ef-ficiency and scalability than fully centralized systems. We alsoshowed that HiveMind seamlessly handles failures, as well asstraggler cloud tasks to improve performance predictability.

12

REFERENCES

[1] L. Tong, Y. Li, and W. Gao, “A hierarchical edge cloud architectureformobile computing,” in Proc. of the 35th Annual IEEE InternationalConference on Computer Communications (INFOCOM). San Fran-cisco, 2016.

[2] A. Singhvi, S. Banerjee, Y. Harchol, A. Akella, M. Peek, and P. Rydin,“Granular computing and network intensive applications: Friends orfoes?” in Proceedings of the 16th ACM Workshop on Hot Topics inNetworks, ser. HotNets-XVI. New York, NY, USA: ACM, 2017, pp.157–163.

[3] “Who will control the swarm?” https://platformlab.stanford.edu/news.php.

[4] Y. Han, X. Wang, V. C. Leung, D. Niyato, X. Yan, and X. Chen,“Convergence of edge computing and deep learning: A comprehensivesurvey,” in arXiv:1907.08349 v1 [cs.NI] 19 Jul 2019.

[5] G. Vasarhelyi, C. Viragh, G. Somorjai, T. Nepusz, A. E. Eiben, andT. Vicsek, “Optimized flocking of autonomous drones in confinedenvironments,” in Science Robotics 18 Jul 2018, Vol. 3, Issue 20.

[6] M. A. Estrada, S. Mintchev, D. L. Christensen, M. R. Cutkosky, andD. Floreano, “Forceful manipulation with micro air vehicles,” ScienceRobotics, vol. 3, no. 23, 2018.

[7] “Mit creates control algorithm for drone swarms,” https://techcrunch.com/2016/04/22/mit-creates-a-control-algorithm-for-drone-swarms/.

[8] “Fleets of drones could aid searches for lost hikers,” http://news.mit.edu/2018/fleets-drones-help-searches-lost-hikers-1102.

[9] M. Almeida, H. Hildmann, and G. Solmaz, “Distributed uav-swarm-based real-time geomatic data collection under dynamically changingresolution requirements,” ISPRS - International Archives of the Pho-togrammetry, Remote Sensing and Spatial Information Sciences, vol.XLII-2/W6, pp. 5–12, 08 2017.

[10] S.-C. Lin, Y. Zhang, C.-H. Hsu, M. Skach, M. E. Haque, L. Tang,and J. Mars, “The architectural implications of autonomous driving:Constraints and acceleration,” in Proceedings of the Twenty-ThirdInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems, ser. ASPLOS ’18. Williamsburg,VA, USA: ACM, 2018, pp. 751–766. [Online]. Available: http://doi.acm.org/10.1145/3173162.3173191

[11] S.-C. Lin, “Cross-layer system design for autonomous driving,” inDoctoral Dissertation, The University of Michigan. 2019.

[12] S. Allen, C. Aniszczyk, C. Arimura, B. Browning, L. Calcote,A. Chaudhry, D. Davis, L. Fourie, A. Gulli, Y. Haviv, D. Krook,O. Nissan-Messing, C. Munns, K. Owens, M. Peek, and C. Zhang, “Cncfserverless whitepaps v1.0.” CNCF Technical Report.

[13] R. N. Akram, K. Markantonakis, K. Mayes, O. Habachi, D. Sauveron,A. Steyven, and S. Chaumette, “Security, privacy and safety evaluationof dynamic and static fleets of drones,” in Proc. of the IEEE/AIAA 36thDigital Avionics Systems Conference (DASC). 2017.

[14] F. Faticanti, F. D. Pellegrini, D. Siracusa, D. Santoro, and S. Cretti,“Swarm coordination of mini-uavs for target search using imperfectsensors,” in arXiv:1810.04442. 2018.

[15] A. L. Alfeo, M. Cimino, N. D. Francesco, A. Lazzeri, M. Lega, andG. Vaglini, “Swarm coordination of mini-uavs for target search usingimperfect sensors,” in Intelligent Decision Technologies, IOS Press, Vol.12, Issue 2, Pages 149-162. 2018.

[16] D. Albani, D. Nardi, and V. Trianni, “Field coverage and weed mappingby uav swarms,” in Proc. of the IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS). 2017.

[17] Z. Yuan, J. Jin, L. Sun, K.-W. Chin, and G.-M. Muntean, “Ultra-reliable iot communications with uavs: A swarm use case,” in IEEECommunications Magazine (Volume: 56, Issue: 12, December 2018).

[18] M. Campion, P. Ranganathan, and S. Faruque, “Uav swarm communica-tion and control architectures: a review,” Journal of Unmanned VehicleSystems, vol. 7, no. 2, pp. 93–106, 2019.

[19] M. Dogar, A. Spielberg, S. Baker, and D. Rus, “Multi-robot graspplanning for sequential assembly operations,” in IEEE InternationalConference on Robotics and Automation (ICRA), Seattle, WA, 2015.

[20] S. Nasser, A. Barry, M. Doniec, G. Peled, G. Rosman, D. Rus,M. Volkov, and D. Feldman, “Fleye on the car: Big data meets theinternet of things,” in Proc. of the 14th International Conference onInformation Processing in Sensor Networks (IPSN). Seattle, 2015.

[21] L. Sanneman, D. Ajilo, J. DelPreto, A. Mehta, S. Miyashita, N. A.Poorheravi, C. Ramirez, S. Yim, S. Kim, and D. Rus, “A distributedrobot garden system,” in IEEE International Conference on Roboticsand Automation (ICRA), 2015.

[22] Y. Gan, Y. Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, A. Bruno,J. Hu, B. Ritchken, B. Jackson, K. Hu, M. Pancholi, Y. He, B. Clancy,C. Colen, F. Wen, C. Leung, S. Wang, L. Zaruvinsky, M. Espinosa,R. Lin, Z. Liu, J. Padilla, and C. Delimitrou, “An Open-Source Bench-mark Suite for Microservices and Their Hardware-Software Implicationsfor Cloud and Edge Systems,” in Proceedings of the Twenty FourthInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS), April 2019.

[23] Y. Gan, Y. Zhang, K. Hu, Y. He, M. Pancholi, D. Cheng, and C. De-limitrou, “Seer: Leveraging Big Data to Navigate the Complexity ofPerformance Debugging in Cloud Microservices,” in Proceedings of theTwenty Fourth International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS), April 2019.

[24] Y. Zhang, Y. Gan, and C. Delimitrou, “µqSim: Enabling Accurate andScalable Simulation for Interactive Microservices,” in Proceedings ofthe 2019 IEEE International Symposium on Performance Analysis ofSystems and Software (ISPASS), March 2019.

[25] S. Fouladi, R. S. Wahby, B. Shacklett, K. V. Balasubramaniam, W. Zeng,R. Bhalerao, A. Sivaraman, G. Porter, and K. Winstein, “Encoding,fast and slow: Low-latency video processing using thousands of tinythreads,” in 14th USENIX Symposium on Networked Systems Designand Implementation (NSDI 17). Boston, MA: USENIX Association,Mar. 2017, pp. 363–376.

[26] “Aws lambda,” https://aws.amazon.com/lambda.[27] “Azure functions,” https://azure.microsoft.com/en-us/services/functions/.[28] “Google cloud functions: Event-driven serverless compute platform,”

https://cloud.google.com/functions/.[29] “fission: Serverless functions for kubernetes,” http://fission.io.[30] “Openlambda,” https://open-lambda.org.[31] “Opewhisk: Open source serverless cloud platform,” http://openwhisk.

apache.org/.[32] S. Hendrickson, S. Sturdevant, T. Harter, V. Venkataramani, A. C.

Arpaci-Dusseau, and R. H. Arpaci-Dusseau, “Serverless computationwith openlambda,” in 8th USENIX Workshop on Hot Topics in CloudComputing (HotCloud 16). Denver, CO: USENIX Association, Jun.2016.

[33] J. Hellerstein, J. Faleiro, J. Gonzalez, J. Schleier-Smith, V. Sreekanti,A. Tumanov, and C. Wu, “Serverless computing: One step forward, twosteps back,” in Proc. of Conference on Innovative Data Systems Research(CIDR), 2018.

[34] E. van Eyk, L. Toader, S. Talluri, L. Versluis, A. Uta, and A. Iosup,“Serverless is more: From paas to present cloud computing,” in IEEEInternet Computing (Volume: 22 , Issue: 5 , Sep./Oct. 2018).

[35] W. Lloyd, S. Ramesh, S. Chinthalapati, L. Ly, and S. Pallickara, “Server-less computing: An investigation of factors influencing microserviceperformance,” in Proc. of IEEE International Conference on CloudEngineering (IC2E). 2018.

[36] A. Klimovic, Y. Wang, P. Stuedi, A. Trivedi, J. Pfefferle, andC. Kozyrakis, “Pocket: Elastic ephemeral storage for serverlessanalytics,” in 13th USENIX Symposium on Operating SystemsDesign and Implementation (OSDI 18). Carlsbad, CA: USENIXAssociation, Oct. 2018, pp. 427–444. [Online]. Available: https://www.usenix.org/conference/osdi18/presentation/klimovic

[37] L. Baresi, D. Filgueira Mendonca, and M. Garriga, “Empowering Low-Latency Applications Through a Serverless Edge Computing Archi-tecture,” in 6th European Conference on Service-Oriented and CloudComputing (ESOCC), ser. Service-Oriented and Cloud Computing, F. D.Paoli, S. Schulte, and E. B. Johnsen, Eds., vol. LNCS-10465. Oslo,Norway: Springer International Publishing, Sep. 2017, pp. 196–210, part6: Internet of Things and Data Streams.

[38] T. Lynn, P. Rosati, A. Lejeune, and V. Emeakaroha, “A preliminaryreview of enterprise serverless cloud computing (function-as-a-service)platforms,” in IEEE International Conference on Cloud ComputingTechnology and Science (CloudCom). 2017.

[39] I. E. Akkus, R. Chen, I. Rimac, M. Stein, K. Satzke, A. Beck, P. Aditya,and V. Hilt, “SAND: Towards high-performance serverless computing,”in 2018 USENIX Annual Technical Conference (USENIX ATC 18).Boston, MA: USENIX Association, Jul. 2018, pp. 923–935.

[40] E. Oakes, L. Yang, D. Zhou, K. Houck, T. Harter, A. Arpaci-Dusseau, and R. Arpaci-Dusseau, “SOCK: Rapid task provisioning withserverless-optimized containers,” in 2018 USENIX Annual TechnicalConference (USENIX ATC 18). Boston, MA: USENIX Association,Jul. 2018, pp. 57–70.

13

https://platformlab.stanford.edu/news.php

https://platformlab.stanford.edu/news.php

https://techcrunch.com/2016/04/22/mit-creates-a-control-algorithm-for-drone-swarms/

https://techcrunch.com/2016/04/22/mit-creates-a-control-algorithm-for-drone-swarms/

http://news.mit.edu/2018/fleets-drones-help-searches-lost-hikers-1102

http://news.mit.edu/2018/fleets-drones-help-searches-lost-hikers-1102

http://doi.acm.org/10.1145/3173162.3173191

http://doi.acm.org/10.1145/3173162.3173191

https://aws.amazon.com/lambda

https://azure.microsoft.com/en-us/services/functions/

https://cloud.google.com/functions/

http://fission.io

https://open-lambda.org

http://openwhisk.apache.org/

http://openwhisk.apache.org/

https://www.usenix.org/conference/osdi18/presentation/klimovic

https://www.usenix.org/conference/osdi18/presentation/klimovic

[41] J. R. Gunasekaran, P. Thinakaran, M. T. Kandemir, B. Urgaonkar,G. Kesidis, and C. R. Das, “Spock: Exploiting serverless functions forslo and cost aware resourceprocurement in public cloud,” in Proc. of ofIEEE Cloud Computing (CLOUD). 2019.

[42] C. Reiss, A. Tumanov, G. Ganger, R. Katz, and M. Kozych, “Hetero-geneity and dynamicity of clouds at scale: Google trace analysis,” inProceedings of SOCC. 2012.

[43] C. Delimitrou and C. Kozyrakis, “Paragon: QoS-Aware Scheduling forHeterogeneous Datacenters,” in Proceedings of the Eighteenth Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS). Houston, TX, USA, 2013.

[44] ——, “Quasar: Resource-Efficient and QoS-Aware Cluster Manage-ment,” in Proceedings of the Nineteenth International Conference onArchitectural Support for Programming Languages and Operating Sys-tems (ASPLOS). Salt Lake City, UT, USA, 2014.

[45] ——, “HCloud: Resource-Efficient Provisioning in Shared Cloud Sys-tems,” in Proceedings of the Twenty First International Conferenceon Architectural Support for Programming Languages and OperatingSystems (ASPLOS), April 2016.

[46] D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis,“Heracles: Improving resource efficiency at scale,” in Proc. of the 42NdAnnual International Symposium on Computer Architecture (ISCA).Portland, OR, 2015.

[47] L. Barroso and U. Hoelzle, The Datacenter as a Computer: An Intro-duction to the Design of Warehouse-Scale Machines. MC Publishers,2009.

[48] L. Barroso, “Warehouse-scale computing: Entering the teenage decade,”in Proceedings of the 38th Intl. symposium on Computer architecture,San Jose, CA, 2011.

[49] C. Delimitrou, D. Sanchez, and C. Kozyrakis, “Tarcil: ReconcilingScheduling Speed and Quality in Large Shared Clusters,” in Proceedingsof the Sixth ACM Symposium on Cloud Computing (SOCC), August2015.

[50] C. Delimitrou and C. Kozyrakis, “Bolt: I Know What You Did LastSummer... In The Cloud,” in Proceedings of the Twenty Second Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS), April 2017.

[51] ——, “Quality-of-Service-Aware Scheduling in Heterogeneous Datacen-ters with Paragon,” in IEEE Micro Special Issue on Top Picks from theComputer Architecture Conferences. May/June 2014.

[52] ——, “QoS-Aware Scheduling in Heterogeneous Datacenters withParagon,” in ACM Transactions on Computer Systems (TOCS), Vol. 31Issue 4. December 2013.

[53] Y. Gan and C. Delimitrou, “The Architectural Implications of CloudMicroservices,” in Computer Architecture Letters (CAL), vol.17, iss. 2,Jul-Dec 2018.

[54] S. Fouladi, F. Romero, D. Iter, Q. Li, S. Chatterjee, C. Kozyrakis,M. Zaharia, and K. Winstein, “From laptop to lambda: Outsourcingeveryday jobs to thousands of transient functional containers,” in 2019USENIX Annual Technical Conference (USENIX ATC 19). Renton,WA: USENIX Association, Jul. 2019, pp. 475–488.

[55] S. Fouladi, J. Emmons, E. Orbay, C. Wu, R. S. Wahby, and K. Winstein,“Salsify: Low-latency network video through tighter integration betweena video codec and a transport protocol,” in 15th USENIX Symposium onNetworked Systems Design and Implementation (NSDI 18). Renton,WA: USENIX Association, Apr. 2018, pp. 267–282.

[56] D. R. Horn, K. Elkabany, C. Lesniewski-Lass, and K. Winstein, “Thedesign, implementation, and deployment of a system to transparentlycompress hundreds of petabytes of image files for a file-storage ser-vice,” in 14th USENIX Symposium on Networked Systems Design andImplementation (NSDI 17). Boston, MA: USENIX Association, Mar.2017, pp. 1–15.

[57] “Aws iot greengrass,” https://aws.amazon.com/greengrass/.[58] “Parrot ar.drone 2.0 edition,” https://www.parrot.com/us/drones/parrot-

ardrone-20-elite-edition.[59] “Linksys ea8300 max-stream ac2200 tri-band wifi router,” https://www.

linksys.com/us/p/P-EA8300/.[60] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction

to Algorithms.[61] “ardrone-autonomy,” https://ardrone-autonomy.readthedocs.io/en/latest/.[62] “Cylon.js,” https://cylonjs.com/.[63] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,

G. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur,J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah,M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wat-tenberg, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: Large-scalemachine learning on heterogeneous distributed systems,” in Proceedingsof OSDI, 2016.

[64] https://github.com/tensorflow/models/blob/master/research/objectdetection/g3doc.

[65] “Coco: Common objects in context,” http://cocodataset.org/.[66] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified em-

bedding for face recognition and clustering,” in Proc. of the IEEEConference on Computer Vision and Pattern Recognition (CVPR). 2015.

[67] K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica, “Sparrow: Dis-tributed, low latency scheduling,” in Proceedings of SOSP. Farminton,PA, 2013.

[68] J. Dean and L. A. Barroso, “The tail at scale,” in CACM, Vol. 56 No.2, Pages 74-80.

14

https://aws.amazon.com/greengrass/

https://www.parrot.com/us/drones/parrot-ardrone-20-elite-edition

https://www.parrot.com/us/drones/parrot-ardrone-20-elite-edition

https://www.linksys.com/us/p/P-EA8300/

https://www.linksys.com/us/p/P-EA8300/

https://ardrone-autonomy.readthedocs.io/en/latest/

https://cylonjs.com/

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc

http://cocodataset.org/

Documents

HiveMind: A Scalable and Serverless Coordination Control ... · of AWS’s general serverless framework tailored to the require-ments and characteristics of IoT applications, which