14
Vulcan: Lessons in Reliability of Wear OS Ecosystem through State-Aware Fuzzing ABSTRACT As we look to use Wear OS (formerly known as Android Wear) devices for fitness and health monitoring, it is impor- tant to evaluate the reliability of the operating system and its ecosystem. Prior work has analyzed the OS in isolation without considering the state of the device, hardware sen- sors, or the interaction between the wearable device and the mobile device. In this paper, we study the impact of the state of a device on the overall reliability of the apps running in Wear OS. We propose a stateful approach which infers a state model for the app and allows our fuzz testing-based tool, Vulcan, to drive the app to any specific state. We then use this model in Vulcan to uncover new vulnerabilities and app crashes on the wearable. We show how system reboots can be triggered through user-level fuzzing without any ele- vated system privileges. We then develop a proof-of-concept solution that can be implemented on a per-app basis without requiring any framework changes. Keywords Wearable, Android, Wear OS, Reliability, Fuzzer 1. INTRODUCTION Google’s Wear OS has been one of the dominant OSes for wearable devices, which include smartwatches, smart glasses, and fitness trackers. Wear OS is based on An- droid with similar features such as kernel, program- ming model, app life cycle, development framework, etc. However, unlike Android apps on the mobile devices, wearable apps have somewhat different characteristics, as does Wear OS relative to Android. The fundamen- tal driver of the differences is the limited display area and the difficulty of executing interactive work (such as typing) on the wearable device. As a result, wearable apps tend to have more number of Service components (which run in the background) relative to Activities (which run in the foreground), have fewer GUI compo- nents, and have tethering to a counterpart app on the mobile [1]. Moreover, wearable devices are often fitted with a variety of sensors (e.g., heart rate monitor, pulse oximeter, and even electrocardiogram or ECG sensor) each with its own device driver software. Device drivers have been found to be reliability weak spots in the server world [2,3]. Further, wearable apps are generally more dependent on these sensors, have unique interaction patterns, both intra-device and inter-device (between the smartwatch and the phone it is paired with), and are physical context aware. These lead to distinctive software and software-hardware interaction patterns in wearable systems. As such systems are poised to be used for serious apps, such as clinical-grade health mon- itoring, it is important to understand the vulnerabilities in their software architecture and how best to mitigate them 1 . Although there have been several previous works fo- cusing on the reliability of the Android ecosystem [4–9], there are only few that study Wear OS [1,10–13]. How- ever, none of them consider the effects of the inter- device communication (between the mobile device and the wearable device) and the effect of sensor activities on the reliability of the wearable apps, while these are dominant software patterns in wearable systems. The authors in [12] presented a black-box fuzzing tool, QGJ, which defines a set of fuzzing campaigns to test the ro- bustness of wearable apps. However, QGJ is agnostic about the state of a wearable app and generates a large number of inputs that are invalid and silently discarded by Wear OS. Consequently, it cannot trigger the most severe failures and even for the ones that it can trigger, it is much less efficient in triggering them relative to our solution. Apart from QGJ, Monkey [14] and APE [9] are two tools for testing Android apps by generating different UI events, which simulate how a user would possibly interact with the apps. However, we show that UI fuzzers by themselves are not effective in uncover- ing vulnerabilities in wearable systems because the UI interactions are fundamentally less complex here. Our Solution: Vulcan In this paper, we present Vulcan 2 , a state-aware fuzzer for wearable apps. Its workflow is shown in Figure 1 and 1 We use the term “vulnerability” not only in the security sense, but any bug that that may naturally lead to failure or can be exploited through malicious intent. 2 Vulcan is the Roman God of fire and metalworking, the blacksmith of the Gods. The aspiration is for our tool to forge more reliable wearable systems. 1 Edgardo Barsallo Yi, Heng Zhang, Kefan Xu, Amiya Maji, Saurabh Bagchi Purdue University

Vulcan: Lessons in Reliability of Wear OS Ecosystem ... · able apps (or their mobile counterpart apps) nor root privilege of Wear OS. But it improves the e ciency in triggering bugs

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Vulcan: Lessons in Reliability of Wear OS Ecosystem ... · able apps (or their mobile counterpart apps) nor root privilege of Wear OS. But it improves the e ciency in triggering bugs

Vulcan: Lessons in Reliability of Wear OS Ecosystemthrough State-Aware Fuzzing

ABSTRACTAs we look to use Wear OS (formerly known as AndroidWear) devices for fitness and health monitoring, it is impor-tant to evaluate the reliability of the operating system andits ecosystem. Prior work has analyzed the OS in isolationwithout considering the state of the device, hardware sen-sors, or the interaction between the wearable device and themobile device. In this paper, we study the impact of thestate of a device on the overall reliability of the apps runningin Wear OS. We propose a stateful approach which infersa state model for the app and allows our fuzz testing-basedtool, Vulcan, to drive the app to any specific state. We thenuse this model in Vulcan to uncover new vulnerabilities andapp crashes on the wearable. We show how system rebootscan be triggered through user-level fuzzing without any ele-vated system privileges. We then develop a proof-of-conceptsolution that can be implemented on a per-app basis withoutrequiring any framework changes.

KeywordsWearable, Android, Wear OS, Reliability, Fuzzer

1. INTRODUCTIONGoogle’s Wear OS has been one of the dominant OSes

for wearable devices, which include smartwatches, smartglasses, and fitness trackers. Wear OS is based on An-droid with similar features such as kernel, program-ming model, app life cycle, development framework, etc.However, unlike Android apps on the mobile devices,wearable apps have somewhat different characteristics,as does Wear OS relative to Android. The fundamen-tal driver of the differences is the limited display areaand the difficulty of executing interactive work (such astyping) on the wearable device. As a result, wearableapps tend to have more number of Service components(which run in the background) relative to Activities(which run in the foreground), have fewer GUI compo-nents, and have tethering to a counterpart app on themobile [1]. Moreover, wearable devices are often fittedwith a variety of sensors (e.g., heart rate monitor, pulseoximeter, and even electrocardiogram or ECG sensor)each with its own device driver software. Device drivershave been found to be reliability weak spots in the server

world [2, 3]. Further, wearable apps are generally moredependent on these sensors, have unique interactionpatterns, both intra-device and inter-device (betweenthe smartwatch and the phone it is paired with), andare physical context aware. These lead to distinctivesoftware and software-hardware interaction patterns inwearable systems. As such systems are poised to beused for serious apps, such as clinical-grade health mon-itoring, it is important to understand the vulnerabilitiesin their software architecture and how best to mitigatethem1.

Although there have been several previous works fo-cusing on the reliability of the Android ecosystem [4–9],there are only few that study Wear OS [1,10–13]. How-ever, none of them consider the effects of the inter-device communication (between the mobile device andthe wearable device) and the effect of sensor activitieson the reliability of the wearable apps, while these aredominant software patterns in wearable systems. Theauthors in [12] presented a black-box fuzzing tool, QGJ,which defines a set of fuzzing campaigns to test the ro-bustness of wearable apps. However, QGJ is agnosticabout the state of a wearable app and generates a largenumber of inputs that are invalid and silently discardedby Wear OS. Consequently, it cannot trigger the mostsevere failures and even for the ones that it can trigger,it is much less efficient in triggering them relative to oursolution. Apart from QGJ, Monkey [14] and APE [9]are two tools for testing Android apps by generatingdifferent UI events, which simulate how a user wouldpossibly interact with the apps. However, we show thatUI fuzzers by themselves are not effective in uncover-ing vulnerabilities in wearable systems because the UIinteractions are fundamentally less complex here.Our Solution: VulcanIn this paper, we present Vulcan2, a state-aware fuzzerfor wearable apps. Its workflow is shown in Figure 1 and

1We use the term “vulnerability” not only in the securitysense, but any bug that that may naturally lead to failureor can be exploited through malicious intent.2Vulcan is the Roman God of fire and metalworking, theblacksmith of the Gods. The aspiration is for our tool toforge more reliable wearable systems.

1

Edgardo Barsallo Yi, Heng Zhang, Kefan Xu, Amiya Maji, Saurabh BagchiPurdue University

Page 2: Vulcan: Lessons in Reliability of Wear OS Ecosystem ... · able apps (or their mobile counterpart apps) nor root privilege of Wear OS. But it improves the e ciency in triggering bugs

includes the offline training and model building and theonline fuzzing components. Vulcan brings three in-novations, which work together to improve thefuzzing effectiveness. First, Vulcan separately trig-gers the intra-device and inter-device interactions, thusleveraging the insight that the system is more vulnera-ble when wearable-smartphone synchronization is tak-ing place. Second, we define the notion of vulnerablestates where bugs are more likely to be manifested. Vul-can steers the app to the vulnerable states and fuzzesthe app at those states, which significantly increases thefault activation rate. This leverages the insight that re-liability bottlenecks have been found in other platformsdue to OS-sensor interactions. Third, Vulcan parses theAndroid Intent specification and creates a directed setof Intents which are injected to the wearable apps toeffectively trigger bugs. Specifically, the created Intentsare based on what action-data pairs are valid accordingto the specification, e.g., ACTION_DIAL can take eitherempty data or the URI of a phone number. Summariz-ing, Vulcan needs neither the source codes of the wear-able apps (or their mobile counterpart apps) nor rootprivilege of Wear OS. But it improves the efficiency intriggering bugs in wearable apps and for the first timecauses device reboots through inputs from user-levelapps.

Figure 1: Workflow of Vulcan. This shows the offlinepart whereby, through simulated UI events, the tool learns

the finite state machine model of the app. In the onlinephase, the tool steers the app to the vulnerable states and

injects messages or Intents in a directed manner, whichlead to higher fault activation rate and even systemreboots. Offline, the collected logs are analyzed to

deduplicate the failure cases and identify themanifestations and root causes.

We conduct a detailed study of the top 100 apps fromthe Google Play Store — in terms of share, the cate-gories are, in order, health & fitness, maps and naviga-tion, shopping, and personalization. Using Vulcan, weinject over 1M Intents between inter-device and intra-device interactions. We uncover some fundational reli-ability weak spots in the design of Wear OS and several

vulnerabilities that are common across multiple apps.We categorize the failure events into three classes—appnot responding, app crash, and the most catastrophicfailure, system reboot, which refers to the case whereour injection, without elevated system-level privileges,is able to cause the entire wearable device to reboot,thus pointing to a vulnerability in the system softwarestack. Compared to the three prior tools, Monkey [14],APE [9], and QGJ [12], Vulcan is the only one that cantrigger system reboot. Vulcan does this deterministi-cally by sending fuzzed inputs from a synthetic user-level app. Further, for crash failures that both QGJand Vulcan can trigger, Vulcan triggers them more ef-ficiently, at a rate 5.5X compared to QGJ (Table 4).

We delve into the details to analyze the system re-boots in Section 5.6. We find that these reboots are dueto the Watchdog monitor of Wear OS over-aggressivelyterminating the Android System Server, a critical sys-tem level process, under various conditions of resourcecontention. In our experiments, we found 18 system re-boots from 13 apps, while none of the prior techniquesuncover any. We show the system reboots can be de-terministically triggered and show that they are corre-lated with high degree of concurrency such as in sensoractivity or inter-device communications. We also de-sign and develop a Proof-Of-Concept (POC) solution toshow how system reboots can be avoided. We evaluatethe solution’s effectiveness and its usability, the latterthrough a small user study. In the user study, only 6.7%of the users felt that the solution significantly affectedtheir usability of a third-party calendar app on the wear-able. Overall, our state-aware tool is more effective intriggering app crashes and is the only one to trigger sys-tem reboots, compared to state-agnostic fuzzers such asQGJ as well as UI fuzzers such as Monkey and APE.

Our results reveal several interesting root causes forthe failures—abundance of improper exception handling,presence of legacy codes in Wear OS copied from An-droid, and error propagation across devices. We findthat the distribution of exceptions that cause app crashesis different from that in Android [6]. While NullPoint-erException is still a dominant culprit, its relative in-cidence has become less while IllegalArgumentExcep-tion and IllegalStateException have become nu-merous in wearable apps.

Our key contributions in this paper are:1. State-aware fuzzing: We present a state-awarefuzzing tool for wearable apps. Our tool, Vulcan, caninfer the state-model of a wearable app without sourcecode access, guide it to specific states, and then run tar-geted fuzzing campaigns. Our tool can be downloadedfrom [15] and used with the latest version of Wear OS.2. Higher failure activation: We show that state-ful fuzzing increases the fault activation rate comparedto a state-agnostic approach and compare it to three

2

Page 3: Vulcan: Lessons in Reliability of Wear OS Ecosystem ... · able apps (or their mobile counterpart apps) nor root privilege of Wear OS. But it improves the e ciency in triggering bugs

state-of-the-art tools, QGJ, Monkey, and APE. We findseveral novel and critical failure cases through these ex-periments and suggest software architecture improve-ments to make the wearable ecosystem more reliable.3. System reboot: We demonstrate that it is possibleto trigger system reboots deterministically through In-tent injection, without elevated system-level privileges.We show that the Wear OS vulnerability is correlatedwith high degree of concurrency in sensor activity andinter-device communication between mobile and wear-able devices. Our POC solution shows an approach topreventing these system reboots without needing WearOS framework changes.

The rest of the paper is organized as follows. Af-ter presenting an overview of Wear OS and testing ap-proaches on Android in Section 2, we discuss the designof the fuzzing tool in Section 3. Then, in Section 4we introduce the relevant aspects of our implementa-tion and present the detailed experiments and resultsin Section 5. We discuss lessons and threats to validityin Section 6. Then we discuss related work and concludethe paper.

2. BACKGROUNDWear OS apps, compared to their mobile counterpart,

have a minimalist UI design, more focus on services(which are background processes), and have a sensor-rich application context. Although Wear OS supportsstandalone apps, some wearable apps are tightly cou-pled with their mobile counterparts. Those wearableapps frequently communicate with their companion appsin the mobile device to share data or to exchange IPCmessages. In our evaluation of the top 100 apps, wefind approximately 40 either need or are significantlyimproved with a mobile companion app.

Interaction in wearable apps can happen either intra-device or inter-device, typically between the mobile andthe wearable devices. The intra-device interactionwe will refer to as “Intent” and the inter-deviceinteraction as “Communication”. For communica-tion, two types of messages may be exchanged: messagepassing (asynchronous) or data synchronization (syn-chronous). These are commonly used respectively tosend commands or small data items, or synchronize dataamong devices. The inter-device communication is alsoused for delegation of actions from the wearable to themobile. An example is WiFi setup on the wearable,where the password entry happens on the mobile.

Intra-device interaction typically happens through thebinder IPC framework [4] and takes the form of Intentmessages. Intent is an abstraction of operations in theAndroid ecosystem. In an Intent message, the developercan specify many attributes such as a specific action(e.g., ACTION_CALL) and data (e.g., the phone numberof the callee). For example, the main activity of a hik-ing app can periodically send Intent messages to a back-

ground service to request the GPS location to build upthe hiking map of the user. In this paper, we evaluatethe robustness of wearable apps by fuzzing both inter-device communication and intra-device Intents, underdifferent intensities of sensor and communication activ-ities.

3. DESIGNA key motivation behind Vulcan comes from the fact

that behaviors of wearable apps are often state (or con-text) dependent, where state can either be the state ofthe app (e.g., operation being performed) or the stateof the device (e.g., physical location or movement of thewearable device). We hypothesize that this translatesto failures of the wearable software stack that are alsostate-dependent. To validate this hypothesis, we designand implement Vulcan with the following objectives:1) to define a state model that is widely applicable tomost wearable apps,2) to fuzz popular apps using this state model, and3) to verify if it can activate more faults compared tostate-agnostic fuzzers or UI fuzzers.

In the subsequent sections, we elaborate how each ofthese objectives are met.

3.1 OverviewIn order to fuzz wearable apps with the awareness

of their states, Vulcan needs to build a state modelfor each app (offline phase in Figure 1). The targetapp and a UI testing tool are concurrently running onthe wearable device. The UI testing tool is used toexplore different possible UI interactions that a usermight have with the target app. During the fuzzing,Vulcan collects logs for parsing later to build the statemodel. We adopt the widely used Droidbot UI testingtool and run it until a time threshold. Then we termi-nate the training process to parse the collected logs andfind out the intra-device interaction events (through In-tents) and the inter-device interaction events (throughcommunication messages). We track how the app be-haves on receipt of each of these events and use this in-formation to build the state model of the app (Fig. 2).We also record these events as a way for Vulcan to steerthe app to specific states. In Vulcan, we fuzz the app inthe states where it is involved in a large number of tasks,implying a high degree of concurrency. This is based onprior insight from server-class platforms and distributedsystems that failures are correlated with high degree ofconcurrency [16, 17]. This design feature allows us toincrease the fault activation rate and thus to reduce thetesting time, relative to state-agnostic fuzzers (as exper-imentally validated in Section 5.4). We analyze the logsand deduplicate the stack traces to identify the uniqueapp crashes. The following sections describe variouscomponents of Vulcan in detail.

3

Page 4: Vulcan: Lessons in Reliability of Wear OS Ecosystem ... · able apps (or their mobile counterpart apps) nor root privilege of Wear OS. But it improves the e ciency in triggering bugs

3.2 State ModelDuring our initial fuzzing experiments with wearable

apps, we found that several apps crashed while access-ing sensors. Moreover, sensors often determine the con-text of a wearable app (e.g., location or accelerome-ter readings). We therefore decided to use sensor ac-tivation as a characteristic of the wearable state. An-other unique characteristic of wearable apps is that theyare often tightly coupled with their mobile counterpartapps. This means that a wearable app needs a mobileapp for interactive or power-hungry operations. Aftermultiple sensor data items are collected at the wearable,they are sent to the mobile, or control commands aresent from the mobile app to the wearable app. Thisbackground data or control communication may lead tonew bugs not seen yet in the mobile app. Therefore,in our model, a combination of sensor activation andcommunication activity together determine the state ofthe app.

Figure 2: Simplified state transition diagram for afitness app that has a wearable and a mobile device

component. There are two sensors and inter-devicecommunication leading to a state vector of three variables.

As an example, consider a fitness app that helps peo-ple keep track of their workout (Figure 2). These appscommonly include two paired apps, installed on the mo-bile and the wearable device. Before starting the work-out, a person uses the mobile to indicate to the app thatshe is going to start (START_TRIP). The same happensonce the workout is done; the person uses the mobileapp to stop tracking the workout session (STOP_TRIP).

The state of the app is defined by a vector <COMM ,GPS ,HR> , where GPS and HR are the GPS and heartrate sensors that can have values 0 (deactivated) or 1(activated), and COMM is the data synchronization ac-tivity with values 0 (stopped) or 1 (started). Initially,both sensors and COMM are stopped and the app hasstate <0, 0, 0>. After receiving the START_TRIP mes-sage, the app goes to state <1, 0, 0>. At this point theuser may decide to start monitoring her heart rate andactivate the HR sensor (state <1, 0, 1>). If she startstracking her movement using GPS, the state changes to(<1, 1, 1>). After some time, she may choose to stopeither of the sensors. Finally, the app goes back to state<0, 0, 0> after receiving STOP_TRIP.

In general, if there are N sensors in the device, thenthe state of an app can be represented by a tuple <COMM , S1, S2, .., SN>. Even though wearable devicescan have in excess of 20 sensors leading to many possiblestates in theory, we found that, in practice, any givenapp uses only a few sensors. Therefore, we restrict thedimensionality of our state definition to include only the

sensors used in the app. Further, apps also often acti-vate sensors in groups, e.g., an app with 3 sensors can gofrom state <0, 0, 0, 0> to <1, 1, 1, 1> without travers-ing through other combinations of the sensors leading tofurther reduction of the state space. We found that thisrelatively coarse definition of states reduces the state-space size but is sufficient for our state-aware fuzzing,which targets vulnerable states, and leads to more faultactivations.

3.3 Target States for FuzzingIt is well known in the area of server reliability that

higher degree of concurrency can lead to higher failurerates [16, 17] in software. During preliminary experi-ments, we also observed the negative effect of concurrentactivities on the overall reliability of the device. We,therefore, use the degree of concurrency of an app as anindicator of its vulnerability. The underlying hypothe-sis is that by fuzzing the app in its vulnerable states,we can discover more bugs in a shorter time. We showempirical validation of this hypothesis in Section 5.4.

In Vulcan, higher concurrency relates to sensor ac-tivity on and handling of communication messages atthe wearable, each of which runs in a separate thread.Therefore, we define the vulnerable states of a givenapp as the states with a certain percentage or greaterdegree of concurrency than the maximum observed dur-ing the training runs. Vulcan steers the app to thesestates and then fuzzes communication messages or In-tents in those states. Our approach has the potentialshortcoming that bugs that are not related to concur-rency will be missed. One way to mitigate this problemwould be to broaden the definition of vulnerable statesto include other factors, such as, the use of deprecatedAPI calls.

3.4 Fuzzing StrategiesVulcan applies two different fuzzing strategies depend-

ing on the state of the app. If the app is in a statethat has ongoing communication, i.e., COMM = 1,then both intra-device interaction (Intent fuzzing) andinter-device interaction (Communication fuzzing) aredone. In contrast, if no communication is ongoing, i.e.,COMM = 0, only intra-device fuzzing is performed.This is due to the fact that when COMM = 0, commu-nication messages from mobile to wearable are silentlyignored. Below, we detail each of the two types offuzzing and the method to automate the process.

3.4.1 Intent InjectionWhile fuzzing intra-device interaction, we generate

explicit Intents targeting Activities and Services withinan app. We modify the Action, Data and Extra fieldsof Intents based on three pre-defined strategies: semi-valid (use combination of known Action and Data val-

4

Page 5: Vulcan: Lessons in Reliability of Wear OS Ecosystem ... · able apps (or their mobile counterpart apps) nor root privilege of Wear OS. But it improves the e ciency in triggering bugs

ues, but which are incorrect according to the actual us-age), random (use random strings as Action or Data),and empty (keep one of the fields blank). Vulcan gen-erates semi-valid injections following the API specifica-tion [18]. The Intents are composed using valid combi-nations of Action, Data, and Extra fields. For instance,using a semi-valid strategy, the fuzzer will generate anIntent{component=WearAct, act=action.RUN}, whenthe app is expecting: Intent{component=WearAct,act=fitness.TRACK}.

3.4.2 Communication FuzzingWe fuzz the communication messages between the

mobile and the wearable based on two strategies: ran-dom (use random message) or empty (use null value asthe message). For example, for a message like [/getOffDismissed, SomeMessage], the fuzzer replaces the mes-sage with [/getOffDismissed, null] using the emptyfuzzing strategy. We did not add semi-valid fuzzingstrategy here because we are unable to read the struc-tures of inter-device communication messages due to thelack of source code and the lack of any standard formatfor inter-device messages. To the best of our knowl-edge, no other Android testing tool fuzzes inter-devicecommunication messages. We found that communica-tion fuzzing led to some novel failures, including errorpropagation from a wearable app to a mobile app, anda higher rate of failures, as we discuss in Section 5.4.

3.4.3 Automated Intent Specification Generation.During our initial fuzzing experiments, we randomly

generated Intents but we found that a majority of ran-domly generated Intents were discarded by the AndroidRuntime, since the target apps only subscribed to spe-cific Intents. To maximize the effectiveness of our semi-valid Intent injection, we wanted to generate the Intentsfollowing the Android specification [18]. We found thatthe Android documentation specifies the valid fields (Data,Action, Category, Extra) for each Intent and the possi-ble valid combinations of those fields. According to thepurpose of the Intent, the valid type for each of its fieldsis also specified. For example, an Intent could specifythe action of ACTION_PACKAGE_REPLACED to broadcastan updated app package. The Intent will have the datafield as the name of the package and an extra field calledEXTRA_UID that indicates a unique integer identificationnumber associated with the updated package. This mo-tivates us to intelligently generate Intents by combiningonly valid fields instead of randomized values for thefields (e.g., as used in QGJ).

We automate the Intent specification extraction pro-cess by designing a text analysis tool. This tool takesthe Android specification HTML page [18] as the inputand iterates through this page as well as any other pos-sible hyper-linked pages to compose all possible Intents.

Two core text extraction techniques are used to extractinformation on the HTML pages: lexical matching andpattern matching. Lexical matching is used to extractsome specific keywords like “Constant Value”.

The text analysis tool produces a structured JSONfile describing the Intent specifications and this file isused by Vulcan to randomly inject Intents that followthe Android specification. We evaluate the correctnessof this automated tool by manually reviewing the gener-ated specifications. The manual process was performedindependently by three graduate students to minimizemistakes. We found that the text analysis tool achieveda reasonably high accuracy of 93.5%. We measured ac-curacy through a conservative calculation whereby ifthe tool got any of the fields wrong in an action, itwas taken to have made an error on that action. Thisautomation gives Vulcan the capability to dynamicallyadapt to changes in the Android Intent specification infuture API releases.

3.5 Vulcan ComponentsVulcan consists of an ensemble of tools that performs

various functions at different stages of the offline train-ing or the online testing processes. The overall archi-tecture of our testing tool and the interaction betweeneach of the components is shown in Figure 3. Since Vul-can targets both interaction within the wearable device(inter-process interaction via Intents) and across devices(mobile and wearable), its components are spread acrossthe two devices.

3.5.1 Offline TrainingTraining runs for each target app are done using the

Droidbot UI testing tool [19], which uses the AndroidDebug Bridge (ADB) to send UI events to the wearabledevice. Training is initiated with a script on the hostcomputer. It is run for a pre-defined time interval (2hours for each app), by which time we estimate (andempirically validate for a sampled set of apps) that amajority of the app’s states are explored. Further, ifduring the online operation new states are discovered,we add them into the state model.Instrumentation. The instrumentation module cap-tures and logs inter-device communication messages thatare sent from the mobile to the wearable. These arelater used to generate mutated communication mes-sages during the fuzzing phase. A prominent designgoal for Vulcan was to make it less intrusive and tokeep the wearable device and the wearable apps unmod-ified. This is the primary reason why the instrumenta-tion module resides on the mobile device and interceptsthe communication to alter messages before it is signedby the Android runtime. Such an instrumentation in-frastructure does not exist yet on the wearable side. Ifwe were to alter the messages from the wearable to the

5

Page 6: Vulcan: Lessons in Reliability of Wear OS Ecosystem ... · able apps (or their mobile counterpart apps) nor root privilege of Wear OS. But it improves the e ciency in triggering bugs

Figure 3: Vulcan architecture. The diagram shows the location for various components and their interaction ineach of the two stages: (a) The tools collect information from the log traces to build a model that represents

the app; (b) The model is used to guide fuzzing campaigns targeting the wearable app.

mobile in transit, or at the mobile end, these messageswould fail the Android check and be silently dropped.Monitor. The Monitor captures log traces from vari-ous sources such as logcat [20] and dumpsys [21]. Throughthis it calculates the values of the state variables, suchas, for each sensor, whether it is active or not.Model Builder. It parses the collected logs from theMonitor and the traces of UI events from Droidbot anddetermines the state model of the wearable app. If mul-tiple UI events occur sequentially without causing anychange to the state variables, then Vulcan creates onetransition event as the concatenation of the multiple UIevents. This design helps in part to contain the explo-sion of the state space.

3.5.2 Online FuzzingDuring online testing, fuzzing happens in an auto-

mated manner managed by the Orchestrator, given thestate model and the replay traces.Orchestrator. The Orchestrator is the core of Vulcanand it determines what actions will be performed next.Given a target app and its state model, the orchestra-tor can run the app in one of two modes–replay modeand fuzzing mode. In the replay mode, the orchestratorguides an app to a target state. This is used to steer thewearable app to a vulnerable state as described earlier.In the fuzzing mode, it directs the Fuzzer to inject interor intra-device interaction messages.Replay Module. The replay module is responsiblefor guiding the target app to a given state. For this, ituses the state model of the app and the sequence of UIevents needed to drive the app to a target state, bothof which have been determined offline and are providedby the Module Builder. This ordered set of UI eventsare injected into the mobile device using ADB [22]. Incase of a crash on the wearable app during fuzzing, thereplay module also steers the app to the last valid statebefore the failure.Fuzzer. The Fuzzer in Vulcan has two submodules,one for each of the fuzzing strategies described in Sec-tion 3.4. Intra-device interaction fuzzing (Intents) is

done by a service component in the wearable device,whereas, inter-device communication fuzzing is done us-ing the instrumentation module in the mobile device asdescribed above. Depending on the state of the wear-able (i.e., is communication in progress between the mo-bile and the wearable), one or both of the fuzzing strate-gies may be applied. In either of these campaigns, if theapp state changes while fuzzing, the control goes backto the Orchestrator to decide the next fuzzing strategyto execute.Monitor. As in the offline mode, the Monitor keepstrack of current state of the wearable app during thefuzz testing. If during fuzzing, it discovers a new state(which was not observed during the training phase), itupdates the state model with this new state and recordsthe events that caused that transition. The new stateis not immediately fuzzed during the current iteration,but is considered in future runs if it meets the definitionof a vulnerable state.

4. IMPLEMENTATIONThis section describes major aspects of the imple-

mentation. We implemented Vulcan using Java (Intentfuzzer), Python (scripts for training and execution),and Javascript (Communication fuzzer). Vulcan doesnot require access to source code of the target app andcan work with binaries. This is crucial for adoption ofany tool and is different from many tools in this space,such as ACTEve [23], JPF-Android [24], AW UIAu-tomator [11]. All our instrumentation and fuzzing aredone without any modifications to the wearable app it-self.Data Collection and State Model Building. Thefact that Vulcan is designed to fuzz apps without ac-cess to the source code posed a challenge to infer thestate model. The events relevant to our approach to cre-ate the state model are interaction between devices andhardware sensor activation and these are observed us-ing the adb logcat, dumpsys, and the communicationtraces collected from the callback functions (instrumen-tation). The communication between a wearable device

6

Page 7: Vulcan: Lessons in Reliability of Wear OS Ecosystem ... · able apps (or their mobile counterpart apps) nor root privilege of Wear OS. But it improves the e ciency in triggering bugs

and a mobile device is visible using the Google Play ser-vices framework relying on the Wearable Data Layer

API. We collected the log traces of inter-device inter-actions by enabling logging in Android’s WearableLis-tenerService. The sensor events are readily availableby invoking the dumpsys service, through the adb. Thisservice provides the current status of the sensor andwhich process is interacting with it, if it is activated,Dynamic Binary Instrumentation. In order to fuzzinter-device communication messages, we alter the mes-sages at source (i.e., on the mobile side). This wasachieved by dynamically instrumenting the binary ofthe mobile app and registering a callback for all its DataLayer API calls. The instrumentation module was im-plemented using the Frida dynamic binary instrumen-tation framework [25].Fuzzing Components. In terms of deployment, Vul-can is deployed across both the mobile and the wear-able device as shown in Fig. 3. Our inter-device inter-action fuzzer is implemented as a Javascript code thatis invoked by Frida when the mobile app communicateswith the wearable. Our intra-device interaction (Intent)fuzzer was implemented as a separate app.

5. EXPERIMENTSFor the experiments we used 4 Google Nexus 6P smart-

phones with Android 8.1 paired with 4 different smart-watches (two of Huawei Watch 2 and two of Fossil Gen4). All the smartwatches have installed Wear OS 2.8 (re-leased in July 2019). For the 100 apps that we tested, weevenly distributed 25 apps to each smartwatch. Duringthe experiments, the mobile was connected to a com-puter via USB to collect the log traces using adb, whilethe smartwatch was connected to adb over Bluetooth.

5.1 Experiment SetupWe evaluated our fuzz tool on the top 100 wearable

apps available in the Google Play Store in the AndroidWear section. The three predominant categories thatthese apps fall under are (in order) Health and Fitness,Tools, and Productivity. The criteria for ranking are asper Google’s definition–number of downloads being aprimary one, but there are several other factors as well.We downloaded each app from the Google Play Storeand installed on the wearable devices. If the app has acompanion app, we installed the companion app on themobile device. Table 1 shows the detailed informationon the selected apps, which combine for a total of 861activities and 604 services.

5.2 Failure CategoriesFrom our experiments, we found failures to have three

types of manifestations, which are all user-visible fail-ures. ANR: This condition is triggered when the fore-ground app stops responding, leading to the manifesta-tion that the system appears frozen. The system gener-

Table 1: Overview of selected apps for the evaluationgrouped by category. These are the top 100 apps onthe Google Play Store between Mar and Aug 2019.

Category # Apps # Act. # Serv.

Books and references 1 3 1Communication 7 149 173Entertainment 7 14 0Finance 5 69 30Food and drink 1 7 6Game arcade 1 2 0Game puzzle 1 2 2Health and fitness 16 160 85Lifestyle 2 18 1Maps and navigation 5 24 16Medical 1 3 1Music and audio 7 58 57News and magazines 2 7 7Personalization 5 62 29Photography 1 6 4Productivity 10 82 65Shopping 2 24 14Sports 5 25 16Tools 11 95 51Travel and local 4 16 11Weather 6 35 35

100 861 604

ates a log entry containing the message ANR (App NotResponding). App Crash: This corresponds to anapp crash. The system generates a log entry with themessage FATAL EXCEPTION. System Reboot: This isthe most severe of the failures and triggers a soft re-boot of the Android runtime causing the device to re-boot. The system generates a log entry with the mes-sage android.os.DeadSystemException. These man-ifestations are equivalent to the Catastrophic, Restart,and Abort failures proposed in [26] for the classificationof the robustness of server OSes. Another interestingclass of errors, silent data corruption, is kept outsidethe scope of this paper as it needs expert users writingspecific data validity checks. We capture the stack tracefollowing a failure message to identify unique failures,by looking for matching stack traces (after removingnon-deterministic fields like pid). For our results, weonly count unique failures after deduplication.

5.3 State-aware Injection CampaignsOne of the primary objectives of Vulcan is to evalu-

ate the effectiveness of a state-aware fuzzing strategy incomparison with state-agnostic fuzzing. To this end, weran our experiments with fuzzing strategies described inSec. 3.5. Based on the state of an app, our fuzzer ei-ther alters intra-device interaction messages (Intents) orinter-device interaction messages (control commands ordata synchronization messages from the mobile to thewearable). We hypothesize that altering global state ofthe device (how many and which sensors are activated)can lead to different failure manifestations. One ap-proach could be to stress the sensors from within thetarget app, e.g., querying the same sensors but at ahigher rate. However, doing so would require alter-

7

Page 8: Vulcan: Lessons in Reliability of Wear OS Ecosystem ... · able apps (or their mobile counterpart apps) nor root privilege of Wear OS. But it improves the e ciency in triggering bugs

ation of the target app, and our usage model does notallow modification of apps. Therefore, we decided tostress the apps, and thus the device, by activating sen-sors through an external app, introduced earlier as the“Manipulator app”. The Manipulator app either acti-vates the same set of sensors as those in an app (therebycausing contention for those sensors) or it activates thecomplementary set of sensors (increasing load on theSensorManager, the base class that lets apps access adevice’s sensors). Thus, we ran Vulcan under three sce-narios as shown in Figure 4.

Some of the apps in our sample set did not use anysensor, therefore, Expt III is not applicable to theseapps. Due to this, we split the results into two cat-egories: (a) apps that use sensors for which all threeexperiments are run and (b) apps that do not use sen-sors for which only experiments I and II are run. Wefound that despite not using any sensors, these appsstill showed a larger number of ANRs and reboots forExpt II compared to Expt I.BaselinesWe compare the results of Vulcan against three state-of-the-art tools for testing Android or Wear OS: QGJ [12],Android Monkey tool [14], and the recently releasedAPE [9], the first two are state-agnostic while the lastis state-aware. QGJ, a black box fuzz testing tool, in-jects malformed Intents to wearable apps based on fourcampaigns which vary the Intent’s input (e.g., Action,Data or Extras). Monkey is a widely used UI fuzzingtool that generates pseudo-random UI events and in-jects them to the device or an emulator. We assignedan equal probability to each of the possible events inMonkey. Finally, APE, built on top of Monkey, is aGUI injection tool that does dynamic refinement of thestatically determined state model to increase the effec-tiveness of Monkey’s testing strategies.

5.4 State-Aware Injection ResultsFigure 5 shows the results obtained from our experi-

ment. In general, we found that our state-aware fuzzingwith Vulcan generated more crashes than the state-agnostic fuzzers. Moreover, Vulcan is the only one thatis able to trigger system reboots in the wearable device.However, Monkey generated more ANRs compared toVulcan, since it sends events at a higher rate than Vul-can. Vulcan pauses injections after a set and invokesthe Garbage Collector to prevent overloading the sys-tem. To see the effect of altering the global state ofthe device, according to Experiments I-III, we look atTables 2 (for all apps, using sensors and not using sen-sors) and 3 (only for the apps that use sensors). As wecan observe, ANR and system reboots increase whenthe sensors on the device are stressed. We omit ExptIII from Table 2 since it has fewer apps than other ex-periments and cannot be compared directly. Table 3depicts the distribution of failure manifestations for all

three experiments.We also found differences in failure manifestations

across Experiments I, II, and III, which supports our hy-pothesis that fuzzing in the vulnerable states can triggernew bugs on the wearable app. Although, the numberof crashes remain same across the experiments (Table3), the number of ANRs is much higher for ExperimentsII and III. This is expected considering the Manipula-tor app is stressing the SensorService on the wearabledevice. Similarly, number of reboots is higher in ExptII and III compared to Expt I (Tables 2 and 3). Thisindicates that sensor activation (even through an exter-nal app) has a negative effect on overall reliability of thesystem. A state-aware fuzzer like Vulcan can automat-ically guide a wearable app to a vulnerable state andthen fuzz the state, thus increasing the rate at whichbugs can be discovered.

Table 2: Failure manifestations for all apps. Thisindicates that state-aware fuzzing leads to more

ANR, crashes, and reboots. For Vulcan, the valuesare presented in parenthesis as (Intent fuzzing,

communication fuzzing).

State #ANR #Crashes #Reboots

Vulcan (Expt. I) 12 44 (39, 5) 3 (3, 0)

Vulcan (Expt. II) 20 45 (40, 5) 12 (12, 0)

QGJ 12 38 0Monkey 57 17 0

APE 20 15 0

Table 3: Failure manifestations for apps that usesensors. This indicates that state-aware approach

can trigger more failures than state-agnosticapproaches like QGJ or Monkey. For Vulcan all the

failures correspond to the Intent fuzzing strategy.

State #ANR #Crashes #Reboots

Vulcan (Expt. I) 1 10 2

Vulcan (Expt. II) 12 10 3

Vulcan (Expt. III) 9 10 3

QGJ 2 8 0

Monkey 18 5 0APE 10 0 0

Figure 6 shows a comparison of unique and overlap-ping crashes triggered by each of the tools. It can beseen that QGJ and Vulcan has a large degree of overlapthough there are 8 crashes that were not triggered byQGJ. On the contrary, Monkey and APE have very lit-tle overlap among the crashes because they generate UIevents very differently and they have no overlap withthe Intent fuzzing tools, QGJ and Vulcan. This hintsat the usage of complementary tools for testing WearOS apps, UI and Intent fuzzing tools.

To see the distribution of exception types that re-sulted in the crashes across the experiments, we lookat Figure 7. It can be seen that NullPointerExcep-

tion dominates as the leading cause of failures, followedby IllegalArgumentException and IllegalStateEx-

ception. This ordering is quite consistent across the

8

Page 9: Vulcan: Lessons in Reliability of Wear OS Ecosystem ... · able apps (or their mobile counterpart apps) nor root privilege of Wear OS. But it improves the e ciency in triggering bugs

Exp. Device State

I Not modified from targetapp

II Activate complementarysensors

III Activate same sensors asin target app

Figure 4: Variation ofsensor activations across

experiments.

Figure 5: Distribution of failuremanifestation for all the tools. Vulcan leads

to more crashes and reboots than others.

Figure 6: Unique crashes.Comparison of unique crashesacross tools after normalizing

stack traces.

Figure 7: Distribution of Exception Types acrossruns that resulted in a crash. The

NullPointerException dominates across all tools.

four fuzzing-based tools. Previous work [5,6] has shownthat input validation bugs (null pointer and illegal ar-gument exceptions) have been a common category inAndroid apps over the years. Our results indicate thatsimilar situation persists in the current generation ofwearable apps. Most of these crashes can be avoidedby proper exception handling code in the apps. IDEssuch as Android studio can implement better exceptionhandling. If a system service (e.g., Google Fit in ourcase) throws an exception, the tools should check ab-sence of exception handling codes and throw a compiletime error.

From Tables 2 and 3, we see Monkey triggers a highnumber of ANRs. We performed an experiment to un-derstand the reason behind this. When run by defaultthere is no delay between UI events and this causes anoverload of the app. When we ran Monkey with a delayof 500 ms between UI events to make it comparable tothat of inter-Intent delay of Vulcan, then we see thatthe ANRs come down significantly. For example, forthe 7 most ANR-prone apps, the number of ANRs withthe default execution and with the delay went downfrom 16 to 4 (the number of crashes also incidentallywent down, from 12 to 8, while no system reboots weretriggered). We also observed that ANRs were rathernon-deterministic, unlike crashes and system reboots,and this is understandable because these correspond toindividual apps’ response to overload conditions, whichtend to be variable themselves.

5.5 Efficiency of QGJ vs VulcanHere we take a look at the efficiency of triggering

the crashes in Table 4 for QGJ versus Vulcan Exp I.Efficiency is measured by how many Intents have to beinjected to cause a single failure. We replicate the fourfuzzing campaigns of QGJ described in [12].

Table 4: Efficiency of triggering failures betweenQGJ and Vulcan.

Tool QGJ Vulcan

# Intents injected 3,390,010 631,049# crashes 7,220 1,990# unique crashes 38 39# of unique crashes per100K injected Intents

1.12 6.18 (5.52X)

QGJ is able to trigger comparable numbers of uniquecrashes through Intent fuzzing as Vulcan (Table 2 ExpI). However, its efficiency in triggering the crashes ismuch lower. Table 4 shows how Vulcan is 5.5X more ef-ficient than QGJ in inducing the crashes. The efficiencydifference comes from the fact that QGJ creates fuzzedIntents without awareness of the validity of the combi-nation of Intent fields according to the Android Intentspecification [18]. Hence, a significant number of In-tents sent by QGJ are rejected by the Wear OS. Vulcan,on the other hand, parses and follows the specificationto create specific Intents so that far fewer Intents arerejected. Thus, Vulcan shows better efficiency in find-ing bugs and is the only one that can trigger the mostcatastrophic of the failure categories, system reboot.

5.6 System RebootsIn this section, we show the cases where Vulcan causes

the wearable device to reboot and delve into the rootcauses of these system reboots. We found 18 systemreboots across 13 apps, out of the 100 evaluated apps.As part of responsible disclosure, we have shared thefailure details with the OS vendor. System rebootsare identified by the exception DeadSystemException

in the Logcat logs, which essentially means that theAndroid System Server is dead. Android System Serveris the middleware to support user level apps such asa browser. It creates most of the core components inAndroid such as Activity Manager, Package Manager,etc. If the System Server is down, user apps will not

9

Page 10: Vulcan: Lessons in Reliability of Wear OS Ecosystem ... · able apps (or their mobile counterpart apps) nor root privilege of Wear OS. But it improves the e ciency in triggering bugs

function so the whole system has to reboot to restorethe normal operations.

In the system reboots that Vulcan triggered, the Sys-tem Server is killed by the Android Watchdog. TheWatchdog is a protection mechanism to prevent thewearable from becoming unresponsive; it monitors thesystem processes in a while(true) loop with a fixed de-lay between iterations. The Watchdog as well as all thecomponents shown in Figure 8 run as threads within theSystem Server process. If the Watchdog finds a compo-nent that is hung for more than 60 seconds (the defaulttimeout value), it will call Process.killProcess(Process.myPid()) to send a SIGKILL signal. Because themonitored components share the same process id withthe Watchdog as well as the System Server, this essen-tially means killing the System Server, which causes thedevice to reboot.

By delving into the logs and stack traces of thosecases, we categorize those 18 reboots into two groups.8 reboots are due to blocked lock and 10 other rebootsare due to busy waiting. Blocked lock means that athread is trying to acquire a lock but is blocked becausethe lock is held by another thread. Busy waiting meansa thread is in the waiting status for some other threadto finish such as a UI thread (android.ui) or an I/Othread (android.io) for read or write operation.

We give two examples to explain each of the two cat-egories. For Blocked Lock, Figure 9 shows the stack-trace where the ActivityManager is waiting for the lock(0x05804bd6 in line 266) that is held by thread 82. Thiswait passes the 60 seconds default timeout value. Asa thread created by the System Server, ActivityMan-ager shares the same process id with the System Server.There is no way for the Watchdog to just kill the Activ-ityManager but has to kill the System Server processthereby rebooting the system. For Busy Waiting Fig-ure 10 shows an example; here the ActivityManager

calls __epoll_pwait at line 513 which is a Linux func-tion to wait for an I/O event. This wait expires the 60second threshold in the Watchdog, so the System Serveris also killed thereby rebooting the device.

5.7 Mitigation of System RebootsKeeping all the core system services as threads in the

System Server makes for a simple design. However, itcomes with the price of unnecessary reboots of the wholesystem because of the death of a single service. Withincreasing number of system services, the reboots willbecome more frequent and it will also require longer toreinitialize the System Server process.

Our solution strategy is to use an Intent buffer to mit-igate the Intent injection-based attack. Since Wear OSis closed-sourced, we cannot modify the framework sowe implement a Proof-Of-Concept (POC) solution. Weimplement an Intent buffer as a middle layer for Intent

handing. All the Intents sent from one app to anotherare stored in the buffer. We then have a Fetcher pro-cess fetch a single Intent periodically and forward theIntent to the destination(s). When the buffer is full, thesystem rejects any incoming Intent to keep the deviceoperational thus preventing resource starvation from aburst of Intents. Our experiment shows that a period of2.5 seconds (or more) between two consecutive fetchesis fully effective and removes all system reboots. Thisalso does not overflow the buffer, for a buffer capacityof 5K Intents.

Next, we discuss a small user study to quantify theeffect on usability due to the delay interposed by theFetcher. First, we developed a (trivial) wearable app,which communicates with a third-party calendar app [27]using Intents. Our app can interact with the calendarapp using two approaches: either using our POC so-lution or the standard mechanism. Then, we asked 15random people from our department to use our app toschedule events on the smartwatch. Each person re-peated the process three times for each approach, runas a blind study. Finally, we asked the users to evalu-ate their experience with the app. As Table 5 shows,most of the users experienced no difference (60.0%) orlittle difference (33.3%) while using the app with ourPOC solution versus the standard mechanism. In con-trast, only (6.7%) noted a significant difference betweenthe two approaches. This shows that an appropriatelytuned delay for our Intent buffer solution can potentiallymitigate system reboot attacks, while not affecting us-ability of wearable apps. A full validation will requirea larger user study with a greater and a diverse set ofwearable activities.

Table 5: User experience reported while using thesmartwatch with our POC solution.

User Experience % Response

No difference 60.0 %Little difference 33.3 %Significant difference 6.7 %

Relation to prior Android reboot and mitigation.In prior work [7], it was shown that Android suf-

fers from coarse-grained locking in its system services,whereby, seemingly innocuous apps can cause starva-tion of system resources by acquiring these locks in aloop. This would cause the Android device to reboot.We tried to replicate this failure mode on Wear OS andcould not. On investigating, we find that their attackerapp, designed for Android 5, relies on successfully re-ceiving two Broadcast Intents: PACKAGE_REMOVED, andACTION_PACKAGE_REPLACED. But, since Android 8.0, re-ceiving Broadcast Intents has been severely restrictedfor user-level apps and these two specifically can only bereceived by system apps. In contrast, our Intent fuzzeris a user-level app and can cause the system rebootswithout system-level privileges.

10

Page 11: Vulcan: Lessons in Reliability of Wear OS Ecosystem ... · able apps (or their mobile counterpart apps) nor root privilege of Wear OS. But it improves the e ciency in triggering bugs

Figure 8: Various system service threadsmonitored by the Watchdog. The systemwill reboot if any thread is killed by the

Watchdog.

Figure 9: Example stacktraceof the blocked lock systemreboots. ActivityManager iswaiting for the lock at line

266.

Figure 10: Example stacktraceof the busy waiting system

reboots. ActivityManager callsa I/O wait function at line

513.

(a) Varying number ofsensors activated

(b) Varying sensorsampling period

Figure 11: Effect of load of sensors on systemreboots. Fewer Intents are needed to trigger a

system reboot as (a) the number of sensors beingactivated increases or (b) the sensor sampling perioddecreases. The bars represent the range from the min to

the max values.

The mitigation proposed in [7] is finer-grained lockingin Android system services, a smarter Watchdog (thatdoes not automatically kill a non-responsive process),and restricting the number of resources (e.g., sensors)an app can acquire. All their defenses require modifi-cation of the Android framework and are therefore notapplicable for (closed-source) Wear OS. In comparison,our solution can be implemented on a per-app basis (anIntent buffer at the receiving app), or, at the frameworklevel with vendor cooperation.

5.8 Effect of Load on System RebootsSensor activity. Our central hypothesis is that thereliability of the software system is compromised by thedegree of concurrency. Here, we quantify this effect byvarying the number of sensors that are activated andmeasuring how many Intents are needed to trigger adevice reboot. For the experiment, we choose a pop-ular health and fitness app, the Cardiogram app [28],and one that has been implicated in our earlier exper-iments with system reboots. We vary the number ofsensors activated (by the Manipulator app) from 0 tothe maximum available in the device, 22 sensors (15hardware sensors and 7 software sensors), while con-currently injecting fuzzed Intents into the Cardiogramapp. In each trial, we keep the rate of activation of eachsensor fixed at once every 60 ms. We repeated the ex-periment 5 times for each number of activated sensors.Importantly, device reboots were triggered determinis-tically in every single trial. As shown in Figure 11(a),as the number of sensors activated increases, we need

fewer number of Intents to trigger a system reboot. Thisnumber drops sharply for the first 5 sensors, and thendrops slowly. This result indicates that with increasingconcurrency, the vulnerability of the system increasesto the most catastrophic of failures, system reboots.Sensor sampling period. In this part, we show theeffect of changing sensor sampling periods on the easeof triggering system reboots. We inject Intents to theCardiogram app while concurrently using our Manipu-lator app to sample all 22 sensors in synch, with varyingperiodicities. We show the result in Figure 11(b). Foreach sensor sampling period, we run 5 trials and in eachtrial, we measure the number of injected Intents untilthe wearable device reboots. The Android Developerdocument [29] specifies four different sampling periodsand we use all of them in our experiments. In every sin-gle trial, the device eventually rebooted (approximatelyin 20–30 minutes), emphasizing the deterministic natureof this failure. The result can be explained by the factthat sensor sampling consumes system resources. Thus,the faster the sensors are sampled, the more resourcesthey consume and therefore, it requires injecting fewerIntents to trigger system reboots.

5.9 Failure Case StudiesIn this section, we present various failure categories

that highlight some nuances about the Android Wearecosystem and provide motivation for future research.Android-Wear OS Code Transfer. Among the crashesobserved, we found that every time our fuzzer injected aKEYCODE_SEARCH event, the wearable app crashed. Thisis a widespread problem—among the 100 tested apps,95 crashed when this event was injected. We foundempirically that no other key event triggered a similarvulnerability. On Android, the KEYCODE_SEARCH eventis used to launch a text search inside the app (appscan also silently ignore the event if it has not imple-mented this feature). However, in the case of WearOS, when the event is injected, the apps crashed withan IllegalStateException. Digging deeper we foundthat Wear OS, when inheriting code from Android, re-moved the SearchManager service class, assuming prob-ably that text searches are not suited for the form factorof wearables, but did not silently drop the correspond-ing Intent. This kind of behavior can be seized on bya malicious app (with system-level privileges however)

11

Page 12: Vulcan: Lessons in Reliability of Wear OS Ecosystem ... · able apps (or their mobile counterpart apps) nor root privilege of Wear OS. But it improves the e ciency in triggering bugs

to crash other apps. The root problem can be handledat the Wear OS level by simply discarding the searchevent in the base Activities class (WearableActivity)instead of trying to launch SearchManager.Error Propagation. Vulcan found several vulnerabil-ities related to ongoing synchronization between mobileand wearable (refer Table 2). It is noteworthy that someof these failures even propagated back to the mobile de-vice, crashing not only the wearable app but also its mo-bile counterpart. The crash of the wearable app triggersa communication from the wearable to the mobile withthe Exception object. However, this causes a Null-

PointerException in the mobile app when it tries toinvoke a method getMessage. This exception is causedby a mismatch of the serialized object sent by the wear-able and that expected by the mobile. The wearablesends the exception wrapped as a GSON object, while themobile device expects a Throwable object. This can behandled better by strongly typing the data being sharedbetween devices and throwing a compile-time error incase of type mismatch.

6. LESSONS AND THREATSLessons Learned. Through the empirical evidence,we have drawn several lessons for better design of WearOS apps and framework. We have mentioned these atthe respective locations together with the insights andwe summarize them here. We have provided insightabout better exception handling being enforced by IDEsto reduce the incidence of crashes (Section 5.4). Wehave demonstrated how a better design of the SystemServer process can mitigate the problem of system re-boots, and demonstrated a POC solution (Section 5.7).We have proposed how error propagation across devicescan be reduced through stronger typing of the exceptionclasses (Section 5.9).Threats to Validity. The experiments presented inthis paper have a few shortcomings that may bias ourobservations. First, our study is based on two differ-ent smartwatch models each of which can have individ-ual vendor-specific customizations. Second, our inter-device communication fuzzing is currently one-directional,for messages going from the mobile to the wearable.This is primarily to keep the wearable device unalteredand because the injection tool Frida only exists for An-droid. Third, even though we selected popular appsfrom different categories, our sample size of 100 canbe considered small. However, this is a shortcomingof most fuzzing studies which are usually time consum-ing. Even with 100 apps, Vulcan tested more than 1,400components and took more than two weeks for the ex-periments of Table 2, running on 4 devices in parallel.

7. RELATED WORKStateful Fuzzers. Most of the previous work on state-

ful fuzzing has been focused on communication proto-cols, such as SIP or FTP [30–33]. A common approachhere is to infer a finite state machine by analyzing thenetwork traffic. Then, this model is used to guide theblack-box fuzz testing of the communication messages,rather than the intra-protocol interactions.Android. Since its release in 2008, there have been nu-merous studies on the reliability of Android OS. Theseworks vary from random fuzzing testing tools based onUI events and system events [34,35], fuzzers focused onAndroid IPC [6,36,37], model-based testing tools basedon static analysis [38] or specialized approach based onconcolic testing [23]. Previous works based on modelrepresentations of the apps, focused on GUI naviga-tion [9, 19, 38–42], rather than a stateful approach ofcommunication and sensors as in Vulcan.Wear OS. Research on Wear OS focuses on the perfor-mance and the reliability of the OS itself [1, 10, 13, 43].They found inefficiencies in the OS because of deficien-cies in the design. However, those design flaws werenever tied to any vulnerabilities that made the WearOS prone to system reboots. Recent studies have ad-dressed the need for testing tools for Wear OS ecosys-tem [9, 11, 12]. Yi et al. [12] proposed QGJ, a testingtool for Wear OS that creates four campaigns to injectfaulty Intents to the wearable. These campaigns areless efficient than the strategies of Vulcan. Being state-agnostic, it is unable to trigger system reboots. To thebest of our knowledge, Vulcan is the first tool that fuzztests Wear OS using a stateful approach.

8. CONCLUSIONWe presented the design and implementation of Vul-

can, a state-aware fuzzing tool for wearable apps. Vul-can can automatically build a state model for an appfrom its logs and steer the app to specific states by re-playing the traces. It then launches state-aware fuzzingon the wearable app targeting both inter-device com-munication and intra-device Intents. It focuses on whatit determines to be vulnerable states, which are stateswith high degrees of concurrent activities. We foundthat state-aware fuzzing leads to more crashes of appsand reboots of the device compared to stateless fuzzing.We also found that it is possible to predictably reboota wearable device from a user app, with no system-level privileges, by targeting specific states. We pro-vide a proof-of-concept solution to mitigating systemreboots. We draw lessons for improving the wearableecosystem—better exception handling, type checking ofcommunication between mobile and wearable devices,and diagnosing and terminating user-level componentsthat are causing resource starvation for sensor resources.Our future work will focus on automatic identificationof the minimum working example for triggering failuresand improvements to the exception handling mechanismthrough static and dynamic techniques.

12

Page 13: Vulcan: Lessons in Reliability of Wear OS Ecosystem ... · able apps (or their mobile counterpart apps) nor root privilege of Wear OS. But it improves the e ciency in triggering bugs

9. REFERENCES

[1] R. Liu and F. X. Lin, “Understanding thecharacteristics of android wear os,” in Mobisys,2016, pp. 151–164.

[2] J. LeVasseur, V. Uhlig, J. Stoess, and S. Gotz,“Unmodified device driver reuse and improvedsystem dependability via virtual machines.” inOSDI, vol. 4, no. 19, 2004, pp. 17–30.

[3] M. M. Swift, M. Annamalai, B. N. Bershad, andH. M. Levy, “Recovering device drivers,” ACMTransactions on Computer Systems (TOCS),vol. 24, no. 4, pp. 333–360, 2006.

[4] E. Chin, A. P. Felt, K. Greenwood, andD. Wagner, “Analyzing inter-applicationcommunication in android,” in Mobisys. ACM,2011, pp. 239–252.

[5] A. K. Maji, K. Hao, S. Sultana, and S. Bagchi,“Characterizing failures in mobile oses: A casestudy with android and symbian,” in SoftwareReliability Engineering (ISSRE), 2010 IEEE 21stInternational Symposium on. IEEE, 2010, pp.249–258.

[6] A. K. Maji, F. A. Arshad, S. Bagchi, and J. S.Rellermeyer, “An empirical study of therobustness of inter-component communication inandroid,” in DSN, 2012, pp. 1–12.

[7] H. Huang, S. Zhu, K. Chen, and P. Liu, “Fromsystem services freezing to system servershutdown in android: All you need is a loop in anapp,” in ACM CCS, 2015, pp. 1236–1247.

[8] A. K. Iannillo, R. Natella, D. Cotroneo, andC. Nita-Rotaru, “Chizpurfle: A gray-box androidfuzzer for vendor service customizations,” inISSRE, 2017.

[9] T. Gu, C. Sun, X. Ma, C. Cao, C. Xu, Y. Yao,Q. Zhang, J. Lu, and Z. Su, “Practical gui testingof android applications via model abstraction andrefinement,” in Proceedings of the 41stInternational Conference on SoftwareEngineering. IEEE Press, 2019, pp. 269–280.

[10] X. Liu, T. Chen, F. Qian, Z. Guo, F. X. Lin,X. Wang, and K. Chen, “Characterizingsmartwatch usage in the wild,” in Mobisys, 2017,pp. 385–398.

[11] H. Zhang and A. Rountev, “Analysis and testingof notifications in android wear applications,” inICSE, 2017, pp. 347–357.

[12] E. Barsallo Yi, A. K. Maji, and S. Bagchi, “Howreliable is my wearable: A fuzz testing-basedstudy,” in DSN, 2018, pp. 410–417.

[13] H. Zhang, H. Wu, and A. Rountev, “Detection ofenergy inefficiencies in android wear watch faces,”in Proceedings of the 2018 26th ACM JointMeeting on European Software EngineeringConference and Symposium on the Foundations of

Software Engineering. ACM, 2018, pp. 691–702.[14] Android, “UI/Application Exerciser Monkey,”

2017. [Online]. Available: https://developer.android.com/studio/test/monkey.html

[15] “Vulcan: A wearable app fuzzing tool(anonymized repo),” Dec. 2019. [Online].Available:https://github.com/vulcanisgreat/vulcan

[16] B. Lucia and L. Ceze, “Cooperative empiricalfailure avoidance for multithreaded programs,”ACM SIGPLAN Notices, vol. 48, no. 4, pp. 39–50,2013.

[17] F. A. Bianchi, M. Pezze, and V. Terragni,“Reproducing concurrency failures from crashstacks,” in Proceedings of the 2017 11th JointMeeting on Foundations of Software Engineering.ACM, 2017, pp. 705–716.

[18] Google, “Android Developers. IntentSpecification,” 2019. [Online]. Available:https://developer.android.com/reference/android/content/Intent

[19] Y. Li, Z. Yang, Y. Guo, and X. Chen, “Droidbot:a lightweight ui-guided test input generator forandroid,” in Software Engineering Companion(ICSE-C), 2017 IEEE/ACM 39th InternationalConference on. IEEE, 2017, pp. 23–26.

[20] Google, “Android Developers. Logcat,” 2019.[Online]. Available: https://developer.android.com/studio/command-line/logcat

[21] ——, “Android Developers. Dumpsys,” 2019.[Online]. Available: https://developer.android.com/studio/command-line/dumpsys

[22] ——, “Android Debug Bridge,” 2017. [Online].Available: https://developer.android.com/studio/command-line/adb

[23] S. Anand, M. Naik, M. J. Harrold, and H. Yang,“Automated concolic testing of smartphone apps,”in Proceedings of the ACM SIGSOFT 20thInternational Symposium on the Foundations ofSoftware Engineering. ACM, 2012, p. 59.

[24] H. van der Merwe, B. van der Merwe, andW. Visser, “Verifying android applications usingjava pathfinder,” ACM SIGSOFT SoftwareEngineering Notes, vol. 37, no. 6, pp. 1–5, 2012.

[25] O. Andre, “Frida,” 2018. [Online]. Available:https://www.frida.re

[26] P. Koopman, J. Sung, C. Dingman, D. Siewiorek,and T. Marz, “Comparing operating systemsusing robustness benchmarks,” in Proceedings ofSRDS’97: 16th IEEE Symposium on ReliableDistributed Systems. IEEE, 1997, pp. 72–79.

[27] appfour. (2019, Nov) Calendar for wear os(android wear). [Online]. Available:https://play.google.com/store/apps/details?id=com.appfour.wearcalendar

13

Page 14: Vulcan: Lessons in Reliability of Wear OS Ecosystem ... · able apps (or their mobile counterpart apps) nor root privilege of Wear OS. But it improves the e ciency in triggering bugs

[28] I. Cardiogram. (2019, Nov) Cardiogram: Wear os,fitbit, garmin, android wear. [Online]. Available:https://play.google.com/store/apps/details?id=com.cardiogram.v1&hl=en US

[29] A. Developers. Sensors overview. [Online].Available: https://developer.android.com/guide/topics/sensors/sensors overview.html

[30] H. Gascon, C. Wressnegger, F. Yamaguchi,D. Arp, and K. Rieck, “Pulsar: stateful black-boxfuzzing of proprietary network protocols,” inInternational Conference on Security and Privacyin Communication Systems. Springer, 2015, pp.330–347.

[31] G. Banks, M. Cova, V. Felmetsger, K. Almeroth,R. Kemmerer, and G. Vigna, “Snooze: toward astateful network protocol fuzzer,” in InternationalConference on Information Security. Springer,2006, pp. 343–358.

[32] P. M. Comparetti, G. Wondracek, C. Kruegel,and E. Kirda, “Prospex: Protocol specificationextraction,” in Security and Privacy, 2009 30thIEEE Symposium on. IEEE, 2009, pp. 110–125.

[33] J. De Ruiter and E. Poll, “Protocol state fuzzingof tls implementations.” in USENIX SecuritySymposium, 2015, pp. 193–206.

[34] A. Machiry, R. Tahiliani, and M. Naik,“Dynodroid: An input generation system forandroid apps,” in FSE, 2013, pp. 224–234.

[35] H. Ye, S. Cheng, L. Zhang, and F. Jiang,“Droidfuzzer: Fuzzing the android apps withintent-filter tag,” in Proceedings of InternationalConference on Advances in Mobile Computing &Multimedia. ACM, 2013, p. 68.

[36] J. Burns, “Intent Fuzzer,” 2012. [Online].Available: https://www.nccgroup.trust/us/about-us/resources/intent-fuzzer

[37] R. Sasnauskas and J. Regehr, “Intent fuzzer:

crafting intents of death,” in WODA andPERTEA, 2014, pp. 1–5.

[38] D. Amalfitano, A. R. Fasolino, P. Tramontana,S. De Carmine, and A. M. Memon, “Using guiripping for automated testing of androidapplications,” in Proceedings of the 27thIEEE/ACM International Conference onAutomated Software Engineering, ser. ASE 2012.New York, NY, USA: ACM, 2012, pp. 258–261.[Online]. Available:http://doi.acm.org/10.1145/2351676.2351717

[39] W. Choi, G. Necula, and K. Sen, “Guided guitesting of android apps with minimal restart andapproximate learning,” in Acm Sigplan Notices,vol. 48, no. 10. ACM, 2013, pp. 623–640.

[40] T. Su, G. Meng, Y. Chen, K. Wu, W. Yang,Y. Yao, G. Pu, Y. Liu, and Z. Su, “Guided,stochastic model-based GUI testing of androidapps,” in Proceedings of the 2017 11th JointMeeting on Foundations of Software Engineering.ACM, 2017, pp. 245–256.

[41] K. Mao, M. Harman, and Y. Jia, “Sapienz:multi-objective automated testing for androidapplications,” in Proceedings of the 25thInternational Symposium on Software Testing andAnalysis. ACM, 2016, pp. 94–105.

[42] H. Zheng, D. Li, B. Liang, X. Zeng, W. Zheng,Y. Deng, W. Lam, W. Yang, and T. Xie,“Automated test input generation for android:towards getting there in an industrial case,” inSoftware Engineering: Software Engineering inPractice Track (ICSE-SEIP), 2017 IEEE/ACM39th International Conference on. IEEE, 2017,pp. 253–262.

[43] R. Liu, L. Jiang, N. Jiang, and F. X. Lin,“Anatomizing system activities on interactive

wearable devices,” in APSys, 2015, pp. 1–7.

14