15
Scalable and Precise Taint Analysis for Android Wei Huang Google [email protected] Yao Dong Ana Milanova RPI {dongy6,milana2}@rpi.edu Julian Dolby IBM Research [email protected] ABSTRACT We propose a type-based taint analysis for Android. Con- cretely, we present DFlow, a context-sensitive information flow type system, and DroidInfer, the corresponding type in- ference analysis for detecting privacy leaks in Android apps. We present novel techniques for error reporting based on CFL-reachability, as well as novel techniques for handling of Android-specific features, including libraries, multiple entry points and callbacks, and inter-component communication. Empirical results show that our approach is scalable and precise. DroidInfer scales well in terms of time and memory and has false-positive rate of 15.7%. It detects privacy leaks in apps from the Google Play Store and in known malware. 1. INTRODUCTION Android is the most popular platform on mobile devices. As of the second quater of 2014, Android has reached 84.4% share of the global smartphone market [19]. Android’s success is partly due to the enormous number of applications available at the Google Play Store, as well as other third-party app stores. However, Android apps often collect sensitive data such as location and phone state, usually for the purpose of tracking and targeted advertising. In this paper we consider a threat model where an app, legitimate or malicious, obtains sensitive data and leaks this data to either logs or the network. Logs are an issue, because until Android 4.0 any app that held the READ LOG permission could read all logs. We track log flows, but we emphasize network flows (e.g., flows of the device identifier to the Internet through an Http request), which present a more pertinent and challenging problem. Taint analysis detects flows from sensitive data sources (e.g., location, phone state) to untrusted sinks (e.g., logs, the Internet). Many researchers have tackled taint analysis for Android. Dynamic analyses such as Google Bouncer [11], TaintDroid [4], DroidScope [41], CopperDroid [32], and Aura- sium [40] instrument the app bytecode and/or use customized execution environment to monitor the transition of sensitive Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. data. Unfortunately, dynamic analysis slows execution and typically lacks code coverage. Static taint analysis detects privacy leaks without running the app. There has been considerable effort on static taint analysis, with the majority of work focusing on dataflow and points-to-based approaches [23, 42, 20, 10, 22, 7, 1]. Yet a solution remains elusive. FlowDroid, a highly-precise, context-, flow-, field-, object- sensitive and lifecycle-aware static taint analysis for An- droid [1], is the state-of-the-art. Unfortunately, FlowDroid is computationaly- and memory-intensive. Further, while it re- ports numerous log flows in apps from the Google Play Store, it reports no network flows. This is surprising, given the common knowledge that apps track their users pervasively. We propose type-based taint analysis for Android lever- aging previous work on type-based taint analysis for web applications [16]. Our approach is modular and composi- tional. It can analyze any given set of classes. Modular analysis is particularly suitable for Android apps because 1) the Android app is an “open” program with multiple entry points through callbacks, and 2) it uses large libraries that can be suitably handled with conservative defaults. The anal- ysis requires annotations only on sources and sinks. Once the sources and sinks are built into annotated libraries, Android apps are analyzed without any input from the user. Concretely, we propose DFlow, a context-sensitive infor- mation flow type system and DroidInfer, the corresponding type inference analysis. In contrast to FlowDroid, Droid- Infer is lightweight and runs in 2 minutes on average, within a memory footprint of 2GB. It uncovers numerous network flows in apps from the Google Play Store and in known malware. DroidInfer posts an F-measure of 0.88 on DroidBench [7, 1], the standard for evaluating static taint analysis for Android. FlowDroid’s F-measure is 0.89. DroidInfer scales because it completely avoids points-to analysis. It is precise because in essence it is CFL-reachability computation, a highly-precise analysis technique [33]. An important contribution of our work is that it explains source- sink flows intuitively in terms of CFL-reachability paths. This paper makes the following contributions: DFlow, a context-sensitive information flow type system and DroidInfer, the corresponding worst-case cubic inference analysis which amounts to CFL-reachability. DroidInfer works on Application Package files (APKs). Effective handling of Android-specific features: “open” programs with multiple entry points, callbacks and large libraries, and a technique that improves precision in the handling of inter-component communication.

Scalable and Precise Taint Analysis for Android

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalable and Precise Taint Analysis for Android

Scalable and Precise Taint Analysis for Android

Wei HuangGoogle

[email protected]

Yao Dong Ana MilanovaRPI

{dongy6,milana2}@rpi.edu

Julian DolbyIBM Research

[email protected]

ABSTRACTWe propose a type-based taint analysis for Android. Con-cretely, we present DFlow, a context-sensitive informationflow type system, and DroidInfer, the corresponding type in-ference analysis for detecting privacy leaks in Android apps.We present novel techniques for error reporting based onCFL-reachability, as well as novel techniques for handling ofAndroid-specific features, including libraries, multiple entrypoints and callbacks, and inter-component communication.Empirical results show that our approach is scalable andprecise. DroidInfer scales well in terms of time and memoryand has false-positive rate of 15.7%. It detects privacy leaksin apps from the Google Play Store and in known malware.

1. INTRODUCTIONAndroid is the most popular platform on mobile devices.

As of the second quater of 2014, Android has reached 84.4%share of the global smartphone market [19]. Android’s successis partly due to the enormous number of applications availableat the Google Play Store, as well as other third-party appstores. However, Android apps often collect sensitive datasuch as location and phone state, usually for the purpose oftracking and targeted advertising.

In this paper we consider a threat model where an app,legitimate or malicious, obtains sensitive data and leaksthis data to either logs or the network. Logs are an issue,because until Android 4.0 any app that held the READ LOGpermission could read all logs. We track log flows, but weemphasize network flows (e.g., flows of the device identifierto the Internet through an Http request), which present amore pertinent and challenging problem.

Taint analysis detects flows from sensitive data sources(e.g., location, phone state) to untrusted sinks (e.g., logs, theInternet). Many researchers have tackled taint analysis forAndroid. Dynamic analyses such as Google Bouncer [11],TaintDroid [4], DroidScope [41], CopperDroid [32], and Aura-sium [40] instrument the app bytecode and/or use customizedexecution environment to monitor the transition of sensitive

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.

data. Unfortunately, dynamic analysis slows execution andtypically lacks code coverage.

Static taint analysis detects privacy leaks without runningthe app. There has been considerable effort on static taintanalysis, with the majority of work focusing on dataflow andpoints-to-based approaches [23, 42, 20, 10, 22, 7, 1]. Yet asolution remains elusive.

FlowDroid, a highly-precise, context-, flow-, field-, object-sensitive and lifecycle-aware static taint analysis for An-droid [1], is the state-of-the-art. Unfortunately, FlowDroid iscomputationaly- and memory-intensive. Further, while it re-ports numerous log flows in apps from the Google Play Store,it reports no network flows. This is surprising, given thecommon knowledge that apps track their users pervasively.

We propose type-based taint analysis for Android lever-aging previous work on type-based taint analysis for webapplications [16]. Our approach is modular and composi-tional. It can analyze any given set of classes. Modularanalysis is particularly suitable for Android apps because 1)the Android app is an “open” program with multiple entrypoints through callbacks, and 2) it uses large libraries thatcan be suitably handled with conservative defaults. The anal-ysis requires annotations only on sources and sinks. Once thesources and sinks are built into annotated libraries, Androidapps are analyzed without any input from the user.

Concretely, we propose DFlow, a context-sensitive infor-mation flow type system and DroidInfer, the correspondingtype inference analysis. In contrast to FlowDroid, Droid-Infer is lightweight and runs in ≈ 2 minutes on average,within a memory footprint of 2GB. It uncovers numerousnetwork flows in apps from the Google Play Store and inknown malware. DroidInfer posts an F-measure of 0.88 onDroidBench [7, 1], the standard for evaluating static taintanalysis for Android. FlowDroid’s F-measure is 0.89.

DroidInfer scales because it completely avoids points-toanalysis. It is precise because in essence it is CFL-reachabilitycomputation, a highly-precise analysis technique [33]. Animportant contribution of our work is that it explains source-sink flows intuitively in terms of CFL-reachability paths.

This paper makes the following contributions:

• DFlow, a context-sensitive information flow type systemand DroidInfer, the corresponding worst-case cubicinference analysis which amounts to CFL-reachability.DroidInfer works on Application Package files (APKs).

• Effective handling of Android-specific features: “open”programs with multiple entry points, callbacks andlarge libraries, and a technique that improves precisionin the handling of inter-component communication.

Page 2: Scalable and Precise Taint Analysis for Android

1 public class WallpapersMain extends Activity {2 private String BASE URL, deviceId;3 private int pageNum, catId;4 private DisplayMetrics metrics;5 private WebView browser1;6 protected void onCreate(Bundle b) {7 start();8 }9 protected void onActivityResult(int rq, int rs, Intent i){

10 navigate();11 }12 private void start() {13 BASE URL = ”getWallpapers Android2/”;14 TelephonyManager mgr =15 (TelephonyManager)this.getSystemService(”phone”);16 deviceId = mgr.getDeviceId(); // source17 }18 private void navigate() {19 String str = BASE URL + pageNum + ”/”+ catId + ”/”

+ deviceId + ”/”+ metrics.widthPixels + ”/”+metrics.heightPixels;

20 browser1.loadUrl(str); // sink21 }22 }

Figure 1: WallpapersMain leaks the device identifier(the source at line 16) to a content server (the sinkat line 20) in a URL.

• An extensive empirical evaluation on three sets of An-droid apps: 1) DroidBench [7, 1], 2) 22 malware appsfrom the Contagio website [24] and 3) 144 apps fromthe Google Play Store [12]. DroidInfer achieves thesame F-measure as FlowDroid on DroidBench. It un-covers all network flows in the Contagio apps, as wellas numerous network flows across 40 Play Store apps.

The rest of the paper is organized as follows. Sect. 2 givesa motivating example and a brief discussion of our type-based approach. Sect. 3 presents the DFlow type systemand the inference analysis. Sect. 4 outlines the connectionwith CFL-reachability and the reporting technique. Sect. 5describes the handling of Android-specific features. Sect. 6presents the empirical evaluation. Sect. 7 discusses relatedwork and Sect. 8 concludes.

2. OVERVIEWWe begin with a motivating example that shows a privacy

leak in an Android app and proceed to outline our approach.

2.1 Motivating ExampleThe example shown in Fig. 1 is refactored from one of our

benchmarks, Backgrounds HD Wallpapers version 2.0.1from the Google Play Store. The WallpapersMain activityfirst obtains the device identifier by calling the getDeviceIdmethod and stores it into a field deviceId when it is created(onCreate). Then it appends the deviceId into a search URLurl, which is sent to a content server in the navigate method.Finally, the navigate method is called in callback methodonActivityResult, resulting in a privacy leak.

This example illustrates several challenges. First, unlikeJava programs, Android apps do not have a single entry point.Instead, each callback method is a potential entry point asit could be called by the Android framework. In Wallpa-persMain, both onCreate and onActivityResult are callbackmethods that are implicitly called by the Android framework.

An Android app consists of a number of components, each ofwhich can be instantiated and run within the Android frame-work. The Android app defines its own behaviors at differentstates of the component lifecyle by overriding pre-definedcallback methods. Multiple entry points present a challengeto traditional points-to-based static analyses, which usuallyrequire whole-program analysis. Furthermore, control flow isinterrupted by callbacks from Android. In the Wallpapers-Main example, control from onCreate to onActivityResult mustbe captured in order to detect the leak.

2.2 Type QualifiersIn our type-based approach, each variable is typed by

a type qualifier. There are two basic qualifiers in DFlow:tainted and safe.

• tainted: A variable x is tainted, if there is flow from asensitive source to x. In the WallpapersMain example,the return value of TelephonyManager.getDeviceId istyped as tainted.

• safe: A variable x is safe if there is flow from x toan untrusted sink. For example, the parameter url ofWebView.loadUrl(String url) is a safe sink.

Note that our analysis for Android is actually a confiden-tiality analysis. We keep the term taint analysis and qualifierstainted and safe only in deference to previous work [4, 7].

In order to disallow flow from tainted sources to safe sinks,DFlow enforces the following subtyping hierarchy:

safe <: tainted1

where q1 <: q2 denotes q1 is a subtype of q2. (q is also asubtype of itself q <: q.) Therefore, it is allowed to assign asafe variable to a tainted one:

safe String s = ...;tainted String t = s;

However, it is not allowed to assign a tainted variable to asafe one:

tainted String t = ...;safe String s = t; // type error!

In the WallpapersMain example, the return value of getDe-viceId is typed as tainted and the url parameter of loadUrl istyped as safe, as they are a source and a sink, respectively.The field deviceId is tainted and so is the local variable strsince it contains the value of deviceId. Because it is not al-lowed to assign a tainted str to the safe parameter of loadUrl,the program results in a type error, signaling the leak.

Once the sources and sinks are given, type qualifiers areinferred automatically by our inference tool (Sect. 3.2). Ifthere is a valid typing, then there is no flow from a sourceto a sink. Otherwise, the tool reports type errors, signalingpotential privacy leaks.

A longstanding issue with type inference is explaining typeerrors [21, 39]. In general, the inference tool can issuea type error anywhere along the (usually long) flow pathfrom source to sink! We map each type error into intuitive,humanly-readable CFL-reachability flow paths (Sect. 4). Forexample, the type error in Fig. 1 (roughly) maps to

source[deviceId→ thisstart → thisnavigate

]deviceId→ str→ sink

1This is the desired subtyping. However, it is not safe whenmutable references are involved [29, 35].

Page 3: Scalable and Precise Taint Analysis for Android

1 class Util {2 poly String id(tainted Util this, poly String p) {3 return p;4 }5 }6 ...7 Util y = new Util();8 tainted String src = ...;9 safe String sink = ...;

10 tainted String srcId = y.id10(src);

11 safe String sinkId = y.id11(sink);

Figure 2: Context sensitivity example.

meaning that the source flows into field deviceId of implicitparameter this of start, which in turn flows into this of navigate,where field deviceId is read into str and str flows to sink.

The problem is not limited to type inference. Any staticanalysis (e.g., [1]) faces the issue of error reporting andthere are no satisfying solutions at this point. Our approachpresents a step forward.

2.3 Context SensitivityDFlow achieves context sensitivity by using a polymorphic

type qualifier, poly, and viewpoint adaptation [3].

• poly: poly is interpreted as tainted in some contextsand as safe in other contexts.

The subtyping hierarchy becomes

safe <: poly <: tainted

The concrete value of poly is interpreted by the viewpointadaptation operation. Viewpoint adaptation of a type q′

from the viewpoint of another type q, results in the adaptedtype q′′. This is written as q�q′ = q′′. Viewpoint adaptationadapts fields, formal parameters, and method return valuesfrom the viewpoint of the context at the field access or methodcall. DFlow defines viewpoint adaptation below:

� tainted = tainted� safe = safe

q � poly = q

The underscore denotes a “don’t care” value. Qualifierstainted and safe do not depend on the viewpoint (context).Qualifier poly depends on the viewpoint: e.g., if the viewpoint(context) is tainted, then poly is interpreted as tainted.

The type of a poly field f is interpreted from the viewpointof the receiver at the field access. If the receiver x is tainted,then x.f is tainted. If the receiver x is safe, then x.f is safe.The type of a poly parameter or return value is interpretedfrom the viewpoint of qi, the context at the method call.Consider the example in Fig. 2, where method id is typedas follows (code throughout the paper makes parameter thisexplicit when necessary):

poly String id(tainted Util this, poly String p)

This enables context sensitivity because id can take as inputa tainted String as well as a safe one. poly is interpreted astainted at callsite 10, and as safe at callsite 11.

3. TYPE SYSTEMIn this section, we define the DFlow type system and

present the type inference technique.

cd ::= class C extends D {fd md} classfd ::= t f fieldmd ::= t m(t this, t x) { t y s; return y } methods ::= s; s | x = new t() | x = y | x = y.f

| y.f = x | x = y.m(z) statementt ::= q C qualified typeq ::= tainted | poly | safe qualifier

Figure 3: Syntax. C and D are class names, f is a fieldname, m is a method name, and x, y, and z are namesof local variables, formal parameters, or parameterthis. As in the code examples, this is explicit. Forsimplicity, we assume all names are unique.

(tnew)

Γ(x) = qx q <: qx

Γ ` x = new q C

(tassign)

Γ(x) = qx Γ(y) = qy qy <: qx

Γ ` x = y

(twrite)

Γ(y) = qy typeof (f) = qf Γ(x) = qx qx <: qy � qf

Γ ` y.f = x

(tread)

Γ(y) = qy typeof (f) = qf Γ(x) = qx qy � qf <: qx

Γ ` x = y.f

(tcall)

typeof (m) = qthis, qp → qret Γ(y) = qy Γ(x) = qx Γ(z) = qzqy <: qi � qthis qz <: qi � qp qi � qret <: qx

Γ ` x = y.mi(z)

Figure 4: Typing rules. Function typeof retrievesthe DFlow types of fields and methods, Γ is a typeenvironment that maps variables to DFlow qualifiers.qi is the context of adaptation at call site i.

3.1 Typing RulesWe define our typing rules over a syntax in “named form”

where the results of field accesses, method calls, and instan-tiations are immediately stored in a variable. The syntaxis shown in Fig. 3. Without loss of generality, we assumethat methods have parameter this, and exactly one otherformal parameter. The DFlow type system is orthogonal to(i.e. independent of) the Java type system, which allows usto specify typing rules over type qualifiers q alone.

The typing rules are defined in Fig. 4. Rules (tnew) and(tassign) enforce the expected subtyping constraints. Therules for field access, (tread) and (twrite), adapt field f fromthe viewpoint of receiver y and then enforce the subtypingconstraints. Recall that the type of a poly field f is interpretedin the context of the receiver y. If the receiver y is tainted,then y.f is tainted. If the receiver y is safe, then y.f is safe.

The rule for method call, (tcall), adapts formal parametersthis and p and return value ret from the viewpoint of callsitecontext qi, and enforces the subtyping constraints that cap-ture flows from actual arguments to formal parameters, andfrom return value to the left-hand-side of the call assignment.

The callsite context qi is a value that is not important, ex-cept that it should exist. It can be any of {tainted, poly, safe}.

Page 4: Scalable and Precise Taint Analysis for Android

Consider the example in Fig. 2. At callsite 10, q10 is taintedand q10 � poly is interpreted as tainted. The following con-straints generated at callsite 10 are satisfied2:

y <: q10 � tainted src <: q10 � poly q10 � poly <: tainted

At callsite 11, q11 is safe and q11 � poly is interpreted as safe.Therefore, the constraints at callsite 11 are satisfied:

y <: q11 � tainted sink <: q11 � poly q11 � poly <: safe

We compose DFlow with ReIm, a reference immutabilitytype system [17]. This is necessary to overcome known issueswith subtyping in the presence of mutable references [29, 35].Specifically, if the left-hand-side of an assignment (explicitor implicit) is immutable according to ReIm, we enforcea subtyping constraint; otherwise, we enforce an equalityconstraint. For example, at (tassign) x = y, if x is immutable,i.e. x is not used to modify the referenced object, we enforceqy <: qx; otherwise, we enforce qy = qx. The more variablesare proven immutable, the more subtyping constraints thereare, and hence, the more precise DFlow is [25].

Method overriding is handled by the standard constraintsfor function subtyping. If n overrides m, we require typeof (n) <:typeof (m) and thus

(qthisn , qpn → qretn) <: (qthism , qpm → qretm)

This entails qthism<:qthisn , qpm<:qpn , and qretn<:qretm .

3.2 Type InferenceGiven sources and sinks, type inference derives a valid

typing, i.e. an assignment from program variables to typequalifiers that type checks with the typing rules in Fig. 4. Iftype inference succeeds, then there are no leaks from sourcesto sinks. If it fails the app may contain leaks.

Type inference first computes a set-based solution S, whichmaps variables to sets of potential type qualifiers. Then ituses method summary constraints, a technique that refinesthe set-based solution and helps derive a valid typing.

3.2.1 Set-based SolutionThe set-based solution is a mapping S from variables to

sets of qualifiers. For instance, if S(x) = {tainted, poly},that means variable x can be tainted or poly, but not safe.Programmer-annotated variables, including sources and sinks,are initialized to the singleton set that contains the providedtype qualifier. For example, sources and sinks from theannotated library map to {tainted} and {safe}, respectively.Fields f are initialized to S(f) = {tainted, poly}; we forgosafe fields, which makes the inference converge faster. Allother variables and callsite contexts qi are initialized to themaximal set of qualifiers, i.e. S(x) = {tainted, poly, safe}.

The inference then creates constraints for all program state-ments according to the typing rules in Fig. 4. It takes intoaccount the mutability of the left-hand-side of assignmentsas discussed in Sect. 3. Then the set-based solver iteratesover constraints c and calls SolveConstraint(c). Solve-Constraint(c) removes infeasible qualifiers from the set ofvariables in c [14]. Consider constraint c: qy <: qx whereS(y) = {tainted} and S(x) = {tainted, poly, safe} before solv-ing the constraint. The solver removes poly and safe fromS(x), because there does not exist a qy ∈ S(y) that satisfies

2For brevity and clarity, we omit q when dealing with vari-ables from code examples, i.e., we write y instead of qy.

1: procedure SummarySolver2: repeat3: for each c in C do4: SolveConstraint(c)5: if c is qx <: qy � qf and S(f) is {poly} then . Case 16: Add qx <: qy into C7: else if c is qx � qf <: qy and S(f) is {poly} then . Case 28: Add qx <: qy into C9: else if c is qx <: qy then . Case 310: for each qy <: qz in C do Add qx <: qz to C end for11: for each qw <: qx in C do Add qw <: qy to C end for

12: for each qw <: qi � qx and qi � qy <: qz in C do . Case 413: Add qw <: qz to C14: end for15: end if16: end for17: until S remains unchanged18: end procedure

Figure 5: Initially, S is the result of the set-basedsolution and C is the set of constraints for programstatements. Cases 1 and 2 add qx <: qy to C be-cause qy � poly always yields qy. Case 3 adds con-straints due to transitivity. Case 4 adds constraintsbetween actual(s) and left-hand-side(s) at calls: ifthere are constraints qw <: qi � qx (flow from actualw to formal x) and qi � qy <: qz (flow from returnvalue y to left-hand-side z), and also qx <: qy (flowfrom formal to return value), Case 4 adds qw <: qz.Line 4 calls SolveConstraint(c): the solver infers newconstraints, which remove additional infeasible qual-ifiers from S. This process repeats until S stays un-changed. We represent equality constraints as twosubtyping constraints (e.g., qx <: qy and qy <: qx),hence no explicit mention of equality constraints.

qy <: poly and qy <: safe. In the case that the infeasiblequalifier is the last element in S(x), the solver reports a typeerror. For example, y{tainted} <: x{safe} is a type errorbecause it is not satisfiable.

The solver keeps removing infeasible qualifiers for eachconstraint until it reaches a fixpoint. If there are type errors,this indicates potential flows from sources to sinks.

3.2.2 Valid TypingUnfortunately, even if the set-based solver terminates with-

out type errors, a valid typing still may not exist. That is,there still may be undiscovered flows from sources to sinks.

We adapt method summary constraints, a technique thatremoves additional infeasible qualifiers, and helps arrive at avalid typing or uncover additional type errors. The algorithm,adapted from [16] to DFlow, is shown in Fig. 5.

The method summary constraints are constraints that“connect” parameters to return values. Recall the id examplein Fig. 2. p <: ret is a method summary constraint reflectingthe flow from the parameter p to the return value ret.

In many cases however, the flow from formal parameters toreturn values is “connected” indirectly. For example, the pa-rameter p and the return value ret can be connected throughtwo constraints: qp <: qx and qx <: qret. Due to transitivity,we have qp <: qret. The algorithm “searches” for a subtypingchain from the formal parameter (including this) to the re-turn value of method m (Cases 1, 2, and 3 in Fig. 5). It usesthe method summary constraints to “connect” the actualargument and the left-hand-side of the call assignment atcalls to m (Case 4).

Page 5: Scalable and Precise Taint Analysis for Android

1 public class Data {2 String secret;

3 String get(Data this) {return this.secret;}4 void set(Data this, String p){this.secret = p;}5 }6 public class FieldSensitivity2 extends Activity {7 protected void onCreate(Bundle b) {8 Data dt = new Data();

9 TelephonyManager tm = (TelephonyManager)getSystemService(”phone”);

10 String sim = tm.getSimSerialNumber();

11 dt.set(sim);12

13 SmsManager sms = SmsManager.getDefault();14 String sg = dt.get();15

16 sms.sendTextMessage(”+123”,null,sg,null,null);17 }18 }

Constraints Set-based solution New constraints�� ��S(secret) = {poly}

thisget � secret <: retget�� ��S(retget) = {poly, safe} thisget <: retget

p <: thisset � secret�� ��S(p) = {tainted, poly} p <: thissecret

�� ��S(dt) = {tainted, poly, safe}

q10 � tainted <: sim�� ��S(sim) = {tainted}

sim <: q11 � p sim <: dt

dt = q11 � thisset�� ��S(thisset) = {tainted, poly}

dt <: q14 � thisget�� ��S(thisget) = {poly, safe}

q14 � retget <: sg dt <: sg�� ��TYPE ERROR!

sg <: q16 � safe�� ��S(sg) = {safe}

Figure 6: FieldSensitivity2 example refactored from DroidBench. The frame box beside each statement showsthe corresponding constraints the statement generates. We omitted uninteresting constraints. The ovalboxes show propagation during the set-based solution. 16 forces sg to be {safe}, then 14 forces retget to be{poly, safe} and then 3 forces thisget to be {poly, safe} and secret to be {poly} (recall that fields are initialized to{tainted, poly}, Sect. 3.2.1). 10 forces sim to be {tainted}, which in turn forces the parameters p and thisset to be{tainted, poly}. There are no type errors in the initial set-based solution. The red frame boxes in the fourthcolumn (New constraints) show the constraints due to SummarySolver. Since field secret is poly, constraintthisget � secret <: retget leads to method summary constraint thisget <: retget, which in turn leads to dt <: sg due tothe call at 14. Similarly, p <: thisset � secret leads to p <: thisset, which in turn leads to sim <: dt due to the callat 11. Since sim is {tainted} and sg is {safe}, these constraints cause a TYPE ERROR, detecting the leak.

Consider again the id method in Fig. 2. We have p <: retdue to the return statement return p. The inference addsconstraints between actual arguments and left-hand-sides atcallsites 10 and 11. First, p <: ret implies q10�p <: q10� ret.This constraint and the constraints at callsite 10

src <: q10 � p <: q10 � ret <: srcId

entail src <: srcId. The inference adds src <: srcId, connectingthe actual argument src and the left-hand side srcId at callsite10. Similarly, the inference adds sink <: sinkId at callsite 11.Such new constraints remove additional infeasible qualifiers,and help arrive at a valid typing or uncover new type errors.

When SummarySolver (Fig. 5) terminates without typeerrors, the inference derives a concrete typing by pickingup the maximal element of S(x) according to the rankingtainted > poly > safe. Such maximal typing almost alwaystype-checks, which guarantees that there is no unsafe flowfrom sources to sinks. Even in the rare cases when themaximal typing does not type check, there is no unsafeflow [27]. In contrast, the maximal typing derived from theset-based solution before running SummarySolver, usuallydoes not type check.SummarySolver, which dominates the inference, has

worst-case complexity of O(n3), where n is the size of theprogram. There are at most O(n) iterations of the outerloop (line 2) because at least one of O(n) variables is up-dated in each iteration. There are at most O(n2) iterationsof the inner loop (line 3) because in the worst case everytwo variables can form a constraint, resulting in O(n2) con-straints. Note that lines 10-12 need to run only when a newconstraint is first discovered, so they contribute O(n3) overall constraints. Altogether, we have worst-case complexity

of O(n3). Soundness of DFlow is argued as in [15].

3.2.3 ExampleLet us consider the FieldSensitivity2 example refactored

from DroidBench [7] in Fig. 6. The return of Telephony-Manager.getSimSerialNumber (line 10) is a source and theparameter msg of SmsManager.sendTextMessage (line 16) isa sink. The serial number of the SIM card is obtained andstored into a Data object. Later, it is retrieved from the Dataobject and sent out through an SMS message without userconsent. Fig. 6 demonstrates the analysis.

4. EXPLAINING TYPE ERRORSType inference produces type errors whenever there may

be flow from a source to a sink. Unfortunately, type errorsby themselves are rarely useleful. For example, DroidInferproduces the following type error at statement 10 in Fig. 6:

q10 � retgetSimSerialNumber{tainted} <: sim{safe}

meaning that the right-hand-side of the call assignment istainted while the left-hand-side is inferred safe. The challengeis to map each type error into a concise and intuitive source-sink path that explains the error.

In recent work [26], we studied the connection betweenDFlow/DroidInfer and CFL-reachability [33, 6]. The keyidea is that the type constraints in Fig. 4 correspond to edgesin an annotated dependence graph, and that type inferenceamounts to CFL-reachability computation over the graph.

Field access constraints correspond to field-annotated edges(those constraints account for structure-transmitted depen-dences in Reps’ terminology [34]). In the example in Fig. 6,

Page 6: Scalable and Precise Taint Analysis for Android

the field read return this.secret and its DFlow constraintthisget � secret <: retget correspond to edge

thisget]secret→ retget

As it is standard in CFL-reachability, the open bracket [fdenotes a write to field f, and the close bracket ]f denotesa read of f. Similarly, callsite constraints correspond tocallsite-annotated edges (those account for call-transmitteddependences). In the example in Fig. 6, callsite 14 gives riseto the following constraints:

dt <: q14 � thisget q14 � retget <: sg

which correspond to the following edges:

dt(14→ thisget retget

)14→ sg

Again standard in CFL-reachabililty, the open parenthesis(i denotes a call at callsite i, and the closed parenthesis )idenotes a return at callsite i.

The constraints in Fig. 6 give rise to source-sink path

source → sim(11→ p

[secret→ thisset)11→ dt

(14→ thisget]secret→ ret

)14→ sg→ sink

which presents an intuitive explanation of the type error weshowed at the beginning of this section: the source flows intolocal variable sim, which in turn flows to formal parameterp at callsite 11, where in turn p is written into field secretof this, etc. Perhaps the only unintuitive part is the edge

this)11→ dt (naturally, the flow at callsite 11 is from dt to this).

This inverse edge is due to the mutation of thisset, whichamounts to a return from set at 11.

Let L(F ) denote the context-free language of balancedopen and closed brackets, and let L(C) denote the analogouslanguage of balanced open and closed parentheses. For ex-ample, string [f ]f [g ]g is in L(F ) but [f ]g is not in L(F ). Forprecise treatment, we refer the interested reader to [26]. Afeasible source-sink path is a path where the field string be-longs to L(F ) and the call string belongs to L(C). The abovepath is feasible because its field string [secret ]secret ∈ L(F ) andits call string (11 )11 (14 )14 ∈ L(C). Our goal is to map eachtype error to one or more feasible source-sink paths.

We run DroidInfer with the option that pushes type errorstowards sources. (This can be done with a prioritization ofthe constraints in SummarySolver in Fig. 5.) The result isthat when DroidInfer terminates, the safe sinks have affectedthe set-based solution of each variable that flows to a sink.More precisely, if x flows to a sink, then tainted /∈ S(x).Thus, we can construct the dependency graph from theconstraints for program statements, omitting all nodes whoseset-based solution contains tainted. The resulting graphcan be viewed as a backward slice that excludes the partsof the program unaffected by the sinks. This significantlyreduces the size of the dependency graph and renders CFL-reachability reasoning practical.

For each type error, we run CFL-explain, which prints fea-sible paths from the source at the type error, to all reachablesinks. CFL-explain, a breadth-first-search (BFS) augmentedwith CFL-reachability, is described in detail in Fig. 7. Notethat one must restrict the keys of map M to ensure termi-nation. Currently we distinguish keys by the last two openparentheses and the last two open brackets. This meansthat if CFL-Explain has recorded in M a path to x with acall string, say, that ends at (i (j , and it later arrives at a

1: procedure CFL-Explain2: Add 〈start, ε, ε〉 to Q3: Add 〈start, ε, ε〉 → [] to M4: while Q is not empty do5: dequeue next node 〈n, f, c〉 from Q6: if n is a sink node then7: print the path in M associated with 〈n, f, c〉8: end if9: for each edge e = 〈n,m, f ′, c′〉 s.t. Method(m) ∈ CG do10: Let p be the path in M associated with 〈n, f, c〉11: Let p′ be the path formed by appending e to p12: if f+f ′∈L(F ) ∧ c+c′∈L(C) ∧ 〈m, f+f ′, c+c′〉 /∈M then13: Add 〈m, f+f ′, c+c′〉 → p′ to M14: Add 〈m, f+f ′, c+c′〉 to Q15: end if16: end for17: end while18: end procedure

Figure 7: CFL-Explain is a BFS augmented withCFL-reachability. M maps graph nodes n, aug-mented with field-access strings f ∈ L(F ) and callstrings c ∈ L(C), to paths in the graph. f ′ is afield write [f , a field read ]f or ε. Similarly, c′ is(i, )i or ε. For each edge, f ′ or c′ is empty (e.g.,e = (thisget, ret, ]secret, ε)) CG is a precomputed callgraph. Method(m) gives the enclosing method of m.

different path to x, whose call string also ends at (i (j , thelatter path will not be recorded.

Continuing with the example in Fig. 6, CFL-explain takesas input the variable on the right-hand-side of the type error,sim, and produces the source-sink path we showed earlier:

sim(11→ p

[secret→ thisset)11→ dt

(14→ thisget]secret→ ret

)14→ sg→ sink

Type inference and CFL-reachability inherently provide adata-flow guarantee but lack a control-flow guarantee. Inother words, in order for the flow from source to sink tohappen, control must reach the statements on the path inthe particular order specified by the path. But does controlreach the path? DroidInfer takes as input the entire APKand infers types and source-sink paths across the entire APK,even though some classes and methods may be unreachable.

To provide a (degree of) control-flow guarantee, we incor-porate a conservative call graph. Concretely, line 9 in Fig. 7verifies that the target node m appears in a method, whichis live in the call graph CG.

CFL-Explain can refute a type error reported by DroidInferfor one of two reasons: 1) one or more methods on the pathfrom source to sink is unreachable on the call graph and 2)the type error is a false positive due to the approximation instructure-transmitted dependences employed by DroidInfer(see [26]), and CFL-Explain cannot confirm a feasible path.

5. ANDROID-SPECIFIC FEATURESIn this section, we discuss our techniques for handling

Android-specific features, including libraries, multiple entrypoints and callbacks, and inter-component communication.

5.1 LibrariesLibraries are ubiquitous in Android apps. An effective

analysis should keep track of flows through library methodcalls. Unfortunately, analyzing the Android library is a signif-icant challenge. Computing safe summaries for the Androidlibrary is an open problem (to the best of our knowledge).

Page 7: Scalable and Precise Taint Analysis for Android

Analyzing library calls on-demand, i.e., using some form ofreachability analysis faces challenges due to callbacks andreflection, which are pervasive in Android. The most popularsolution appears to be manual summaries for common librarymethods [20, 7], which is clearly unsatisfying.

DroidInfer inserts annotations (type qualifiers) into theAndroid library for sources (e.g. location access, phone state)and for sinks (e.g., Internet access) by using the Stub Gener-ation Tool and the Annotation File Utility from the CheckerFramework [31]. DroidInfer uses conservative defaults forall unknown library methods. For any unanalyzed librarymethod m, it assumes the typing poly, poly → poly. Thistyping conservatively propagates a tainted receiver/argumentto the left-hand side of the call assignment. Similarly, itpropagates a safe left-hand-side to the receiver/arguments.Consider the following code snippet:

1 public class MyListener implements LocationListener {2 @Override3 public void onLocationChanged(Location loc){//source4 double lat = loc.getLatitude();5 Log.d(”History”, ”Latitude: ”+ lat); // sink6 }7 }

LocationListener.onLocationChanged(tainted Location l) is acallback method. Parameter l is a tainted source that mustpropagate throughout the overriding user-defined methodMyListener.onLocationChanged(Location loc). The methodoverriding constraints (Sect. 3) lead to:

typeof (MyListener.onLocationChanged(Location loc))<:

typeof (LocationListener.onLocationChanged(tainted Location l))

This entails l <: loc, forcing loc to be tainted as well.DroidInfer assumes that library method Location.getLatitude()

is typed as follows:poly double getLatitude(poly Location this)

and creates the following constraints at Statement 4:

loc <: q4 � poly q4 � poly <: lat

Because loc is tainted, the callsite context q4 is inferred astainted. Consequently, lat is inferred as tainted as well, whichleads to a type error because Statement 5 requires a safeargument. (Here the parameter msg of Log.d(String tag,String msg) is a safe sink.)

We apply these conservative defaults to the Java andAndroid libraries. We can apply these defaults to any third-party library we do not wish to analyze.

5.2 Multiple Entry Points and CallbacksMultiple entry points and the ubiquitous use of callbacks

in Android apps cause difficulty for traditional dataflow andpoints-to-based analysis. The Android app is not a closedprogram. Instead, it runs within the Android framework,which implicitly creates objects of the user-defined classesand calls user-defined methods in the app through callbacks.

DroidInfer is type-based and modular. Therefore, it cananalyze any given set of classes.

However, the analysis of an Android app is different fromthe analysis of an open library and it requires special con-sideration. Roughly speaking, we need to capture the “con-nections” among callback methods, or DroidInfer might missprivacy leaks through fields. Consider the LocationLeak2example refactored from DroidBench in Fig. 8. The tainted

1 public LocationLeak2 extends Activity implementsLocationListener {

2 private double latitude;3 protected void onResume() {4 double d = this.latitude; // TYPE ERROR!5 Log.d(”Latitude”, ”Latitude: ”+ d); // sink6 }7 public void onLocationChanged(Location loc) {8 double lat = loc.getLatitude(); // loc is a source9 this.latitude = lat;

10 }11 }

Figure 8: LocationLeak2 refactored from DroidBench,highlights DroidInfer’s novel handling of callbacks.

lat of the current location, obtained in callback method on-LocationChanged, flows through field latitude and reachesthe safe parameter of Log.d in another callback method,onResume. Local variables lat and d are tainted and safe,respectively. If DroidInfer analyzed the app as a stan-dard open library (e.g., as in [17]), it would infer this ofonResume as safe. This is because of (tread) constraintthisonResume � latitude <: d where S(latitude) = {tainted, poly}and S(d) = {safe}. Due to this constraint, S(latitude) wouldbe updated to {poly}. Further, DroidInfer would infer this ofonLocationChanged as tainted, because of (twrite) constraintlat <: thisonLocationChanged � latitude where S(lat) = {tainted}.The inferred typing would type check and the leak throughfield latitude would be missed.

If the app were a standard open library, it would be com-posed with user code, which would instantiate the Activityand reveal the leak. Consider this hypothetical user code:

1 Activity a = new LocationLeak2();2 a.onLocationChanged(loc);3 a.onResume();

When composing this code with the inferred typing for Loca-tionLeak2, there would be a type error because Statement 2requires a to be tainted (thisonLocationChanged is mutable; thus,there is equality constraint at 2: a = q2 � thisonLocationChanged),while Statement 3 requires a to be safe.

The Android app is not composed with user code. Instead,the Activity, as well as other component objects, are instanti-ated by the Android framework. Therefore, DroidInfer needsto handle the implicit instantiation of app objects. DroidIn-fer creates equality constraints for all pairs of this parametersof callback methods in the same class. This “connects” this ofcallback methods. If the app type checks, this means thereis a solution for constraints

qa <: qi1 � qthiscallback1qa = qi2 � qthiscallback2

...

which correspond to the calls to callback methods in theAndroid framework.

In the LocationLeak2 example, the inference creates anequality constraint for the this parameters of onResume andonLocationChanged:

thisonResume = thisonLocationChanged

thisonResume becomes tainted. There is a type error at State-ment 4, thus detecting the privacy leak.

Page 8: Scalable and Precise Taint Analysis for Android

1 public class SmsReceiver extends BroadcastReceiver {2 public void onReceive(Context c, Intent i) {3 Bundle bundle = intent.getExtras();4 Object[] pdusObj = (Object[]) bundle.get(”pdus”);5 StringBuilder sb = new StringBuilder();6 for (int i = 0; i < pdusObj.length; i++) {7 SmsMessage msg = SmsMessage.createFromPdu((byte

[]) pdusObj[i]); // source8 String body = msg.getDisplayMessageBody();9 sb.append(body);

10 }11 Intent it = new Intent(c, TaskService.class);12 it.putExtra(”data”, sb.toString());13 startService(i);14 }15 }16 public class TaskService extends Service {17 public void onStart(Intent it, int d) {18 String body = it.getSerializableExtra(”data”);19 List list = new LinkedList();20 list.add(body);21 HttpClient client = ....getHttpClient();22 HttpPost post = new HttpPost();23 post.setURI(URI.create(”http://103.30.7.178/getMotion.

htm”));24 Entity e = new UrlEncodedFormEntity(list, ”UTF8”);25 post.setEntity(e); // sink26 client.execute(post);27 }28 }

Figure 9: SMS message stealing in Fakedaum. TheSMS message is intercepted in SmsReceiver and passedto TaskService via Intent. Finally, the message is sentout to the Internet using HTTP post method, re-sulting in a message leak.

5.3 Inter-Component Communication (ICC)Android components (activity, service, broadcast receiver

and content provider) interact through ICC objects — mainlyIntents. Communication can happen across applications aswell, to allow functionality reuse. There are two forms ofIntent in Android: 1) Explicit Intents have an explicittarget component — the exact target class of the Intentis specified, and 2) Implicit Intents do not have a targetcomponent, but they include enough information for thesystem to implicitly determine the target component.

Capturing data flow through Intents is important for de-tecting privacy leaks in Android. Consider the examplerefactored from a real malware app, Fakedaum3 in Fig. 9,where the return value of SmsMessage.createFromPdu is asource and the parameter of HttpPost.setEntity is a sink. Thebroadcast receiver SmsReceiver intercepts the SMS messages,then puts the messages into an Intent and starts the back-ground service TaskService with the Intent. Then TaskServicesends the messages to the Internet without user consent. Ifthe communication between the broadcast receiver SmsRe-ceiver and the background service TaskService is not captured,there is no way to detect the privacy leak.

We propose a technique to improve analysis precision inthe presence of ICC through Intents. For an explicit Intentwhose target class is specified by a final or constant string,DroidInfer connects the data carried by the Intent usingplaceholders. DroidInfer replaces the Intent with a “typed”

3http://contagiominidump.blogspot.com/2013/11/fakedaum-vmvol-android-infostealer.html

Figure 10: Architecture of the inference framework.

Intent at both the sender and the receiver components. Inaddition, each putExtra and getExtra are treated as writingand reading a field in the “typed” Intent, respectively. Sincethe target class of Intent it in Fig. 9 (line 11) is specified byconstant TaskService.class, DroidInfer transforms the programinto:

10 ...11 TaskService Intent it = new TaskService Intent();12 TaskService Intent.data = sb.toString();13 ...18 String body = TaskService Intent.data;

As a result, the intercepted message is connected to the postdata via placeholder data of TaskService Intent. The leak iscaptured by DroidInfer.

For explicit Intents whose target class is not specifiedby a constant string, a string analysis, which we leave forfuture work, is required to determine the target. DroidInfermakes the worst-case assumption for such explicit Intents,as well as for implicit Intents carrying sensitive data, astheir content can be intercepted by any, possibly malicious,component. This is achieved by annotating as safe the Intentparameter of library methods that start new components,such as startActivity and startService. Suppose it2 refers to animplicit Intent carrying current location information, thenthere is a type error at statement startSerivce(it2) becausestartSerivce requires a safe argument, but it2 is tainted as itcontains tainted data.

6. EMPIRICAL RESULTSWe have built a type inference and checking framework

and we have instantiated the framework with several typesystems and their corresponding inferences. Initially, theframework had one front-end, built on top of the CheckerFramework [31] (CF). CF takes as input the Java source code,which unfortunately is not available for most Android apps, asthey are usually delivered as Android Package Files (APKs).Therefore, we extended our type inference framework bybuilding an Android constraint generation front-end, basedon Soot [37] and Dexpler [2]. The architecture of our inferenceframework is shown in Fig. 10. It is worth noting thatwe came upon DFlow and DroidInfer as instances of ourframework and they proved very effective.

The Android front-end takes as input the Jimple filestransformed by Soot and Dexpler and outputs the constraintsgenerated according to the typing rules in Sect. 3. Next, thegenerated constraints along with the annotated librarieswhere sources and sinks are defined, are supplied to the typeinference engine, which computes the set-based solution theneither extracts a valid typing or reports type errors for theanalyzed program. All sources and sinks are listed in the

Page 9: Scalable and Precise Taint Analysis for Android

Tool Name AppScan Source Fortify SCA FlowDroid DroidInferSum, Precision and Recall–excluding implicit flows√

, higher is better 14 17 26 28×, lower is better 5 4 4 8©, lower is better 14 11 2 0Precision p =

√/(√

+×) 74% 81% 86% 78%Recall r =

√/(√

+©) 50% 61% 93% 100%F-measure 2pr/(p + r) 0.60 0.70 0.89 0.88

Figure 11: Summary of comparison on DroidBench [7] with other taint analysis tools (√

= correct warning,× = false warning, © = missed flow).

Appendix. They are the union of the sources and sinks ofDroidBench [7, 1] and the network sinks of Contagio [24].The sources are various phone sata and location. The sinksinclude the expected log sinks and the following Internet sinks:WebView.loadUrl, URL.openConnection, and Http request. Thesinks include Intents, which are necessary for the flows inDroidBench. The type inference framework, including DFlowand DroidInfer, is publicly available at http://code.google.com/p/type-inference/.

We build 0-CFA callgraphs using WALA4. Recall that weuse the set of reachable methods from the call graph to checkthat the finding of DroidInfer occurs entirely within thosemethods (Sect. 4). We use support in WALA, contributedby SCanDroid[9], to build call graphs of APKs.

All experiments run on a server with IntelR© XeonR© CPUX3460 @2.80GHz and 8 GB RAM. The maximal heap sizeis set to 2 GB. The software environment consists of OracleJDK 1.6 and the Soot 2.5.0 nightly build.

6.1 HypothesesWe evaluate the DroidInfer system along three hypotheses:

(H1) High recall and precision. DroidInfer misses fewtrue flows and reports few false positive flows.(H2) Network flows. DroidInfer detects leaks of phone orlocation info to the network, in known malicious apps andin Google Play Store apps.(H3) Scalability. DroidInfer scales to large apps.

We run DroidInfer on three sets of apps: 1) DroidBench [7],2) 22 apps from the Contagio website [24], known to containleaks, and 3) 144 popular apps from the Google Play store,including the top 30 apps at the time of writing.

6.2 DroidBenchWe run DroidInfer on DroidBench, which is a suit of 39

Android apps designed by Fritz et al. [7, 1]. DroidBenchexercises many difficult flows, including flows through fieldsand method calls, as well as Android-specific flows. Droid-Bench is the standard evaluating taint analyses for Android.We compare with three other taint analysis tools – AppScanSource [18], Fortify SCA [13], and FlowDroid [7, 1], usingthe results presented by Fritz et al. [7]. Fig. 11 summarizesthe comparison. DroidInfer outperforms AppScan Sourceand Fortify SCA, which miss substantial amount of flows.The low recall contributes to the slightly higher precisionreported by Fortify SCA. FlowDroid is slightly more pre-cise than DroidInfer because it uses a flow-sensitive analysis.DroidBench tests for flow sensitivity and our analysis, whichis flow-insensitive, misses those tests. In our experiencewith real-world apps however, flow sensitivity will not help.Overall, the F-measures for FlowDroid and DroidInfer are

4http://wala.sourceforge.net

AwesomeJokes (4)DeviceId→URLConnection

Backflash (3)SmsMessage→URLConnection

BatteryDoctor (5)Location→HttpEntityDeviceId→HttpEntityDeviceId→WebView

BatteryImprove (1)DeviceId→URLConnection

Beita (4)DeviceId→HttpEntityLocation→HttpEntity

DroidKungFu (1)DeviceId→HttpEntity

FakeBanker (7)PhoneNumber→HttpEntitySimSerialNumber→HttpEntitySmsMessage→HttpEntityPhoneNumber →HttpEntity

Fakedaum (3)SimSerialNumber→HttpEntityPhoneNumber→HttpEntitySmsMessage→HttpEntity

FakeTaobao (2)PhoneNumber→HttpEntityDeviceId→HttpEntity

Godwon (5)DeviceId→HttpEntityPhoneNumber→HttpEntitySmsMessage→HttpEntity

Jollyserv (2)DeviceId→sendTextMessagePhoneNumber→HttpEntity

Kmin (2)SubscriberId→URLConnectionDeviceId→URLConnection

Loozfon (4)PhoneNumber→HttpEntityDeviceId→HttpEntityContact→HttpEntity

Roidsec (4)PhoneNumber→Socket OutputStreamLocaion→Socket OutputSteamContact→Socket OutputStreamDeviceId→Socket OutputStream

Scipiex (1)Contact→Socket OutputStream

Simhosy (3)DeviceId→URLConnectionSubscriberId→URLConnection

Skullkey (1)DeviceId→HttpEntity

Uranai (5)DeviceId→HttpEntityPhoneNumber→HttpEntitySimSerialNumber→HttpEntityContact→HttpEntity

Zertsecurity (3)DeviceId→HttpEntitySmsMessage→HttpEntity

Figure 12: Leaks detected in Malware.

essentially the same. This strongly supports hypothesis H1.

6.3 ContagioWe analyzed all 22 apps tagged as “infostealer” on the

contagio website [24]. Fig. 12 summarizes the analysis result.DroidInfer detects that 19 out of the 22 apps send out phonestate (e.g. DeviceId, SimSerialNumber, and PhoneNumber),SMS messages, and/or location information through HTTPor text messages, or write into a socket. DroidInfer detectsno leaks for the remaining 3 apps. For two of the APK files,FakePlay and Repane, Soot/Dexpler did not generate Jim-ple files and DroidInfer in turn did not generate constraints.DroidInfer reports zero type errors on Phospy. (Phospyappears to steal jpg and mp4 files, and such sources are notincluded in DroidInfer at this point). All type errors on theseapps are explained and there are no false positives. Theseresults strongly support hypotheses H1 and H2.

6.4 Google Play Store

We analyze 144 free Android apps from the official GooglePlay Store. These include the top 30 free apps (as of Jan5th 2015, the time of writing) as well as other popular apps

Page 10: Scalable and Precise Taint Analysis for Android

2048 Number puzzle game 6.06 (8)PhoneData→LogPhoneData→Network

Location→LogLocation→Network

AccuWeather 3.2.14.1 (8)PhoneData→LogPhoneData→NetworkLocation→Network

Location→LogLocation→Intent

Amazon 2.9.7 (2)Location→Network

Backgrounds HD Wallpapers 2.0.1 (3)PhoneData→LogPhoneData→Network

Contact→Log

Chase Mobile 3.16 (2)Contact→Intent

Clash of Mafias 1.0.45 (9)PhoneData→LogPhoneData→Network

Location→LogLocation→Network

Clean Master 5.4.0 (6)PhoneData→NetworkLocation→LogLocation→Network

Location→IntentContact→Intent

CSI: Hidden Crimes 1.6.0 (3)PhoneData→Log Location→Log

Cut the Rope 2 1.3.0 (13)PhoneData→LogLocation→Log

Location→Network

DealMoon 3.4.0 (4)PhoneData→Log PhoneData→Network

Dice With Buddies 3.3.5 (14)PhoneData→LogPhoneData→Network

Location→LogLocation→Network

Dictionary.com 4.4 (11)PhoneData→LogPhoneData→Network

Location→LogLocation→Network

Don’t Tap The White Tile 2.2.3 (22)PhoneData→LogPhoneData→Network

Location→LogLocation→Network

Dumb Ways to Die 2 1.1.1 (1)PhoneData→Intent

ebay 2.6.0.98 (3)Location→Log

ES File Explorer 3.1.3 (5)PhoneData→LogPhoneData→Network

Location→Log

ES Task Manager 1.4.2 (1)PhoneData→Intent

ESPN Fantasy Football 1.1.6 (1)PhoneData→Log PhoneData→Network

Expedia 3.6 (2)PhoneData→Network Location→Log

Facebook 4.0.0.26.3 (2)Location→Log

Facebook Messenger 5.0.0.16.1 (2)Location→Log

Flow Free: Bridges 1.7 (3)PhoneData→Log

GO SMS Pro 5.44 (14)PhoneData→LogPhoneData→Network

PhoneData→IntentContact→Intent

Groupon 3.1.3273 (6)PhoneData→LogPhoneData→Network

Location→Log

GUNSHIP BATTLE 1.3.0 (3)Location→Log

HotPads 3.1 (2)Location→Log

iHeartRadio 4.11.0 (5)PhoneData→LogPhoneData→Network

Location→LogLocation→Network

Iron Force 1.5.2 (6)PhoneData→LogPhoneData→Network

Location→LogLocation→Network

Job Search 2.3 (1)Location→Log

Kakao Talk 4.6.8 (5)PhoneData→LogPhoneData→IntentLocation→Log

Contact→LogContact→Intent

LINE 4.3.0 (8)PhoneData→Intent Contact→Intent

LinkedIn 3.4.3 (2)PhoneData→Network Contact→Log

Looney Tunes Dash 1.45.11 (11)PhoneData→LogPhoneData→Network

Location→LogLocation→Network

Make It Rain: Love of Money 1.2 (5)Location→Log

Mint 3.1.1 (1)Location→Log

MITBBS reader 8.5 (7)Location→Log

Multi Touch Painting Demo 2.6.0 (3)PhoneData→Log Location→Log

My Talking Angela 1.1 (4)PhoneData→LogPhoneData→Intent

Location→Log

My Talking Tom 2.1.1 (5)PhoneData→LogPhoneData→Network

PhoneData→IntentLocation→Log

Netflix 3.9.1 (2)PhoneData→Log PhoneData→Network

Next Radio 2.0.757 (6)PhoneData→LogLocation→Log

Location→Network

Noom Weight Loss Coach 4.0.5 (4)PhoneData→LogPhoneData→Network

Location→Network

NYTimes 3.6.3 (3)Location→Log

ooVoo 2.1.3 (7)PhoneData→Log PhoneData→Network

Pandora 5.4 (2)Location→Log Location→Network

Paperama 1.1.0 (3)PhoneData→Log Location→Log

Pool Billiards Pro 2.49 (1)PhoneData→Log PhoneData→Network

Powerboat Racing 1.1 (1)PhoneData→Log

Priceline 2.8.21 (5)PhoneData→LogPhoneData→Network

Location→Log

Real Fingerprint Scanner 3.4 (3)PhoneData→Network

Skype 5.1.0.57240 (6)PhoneData→Log Location→Log

Smash Hit 1.3.3 (3)PhoneData→Log Location→Log

Snapchat 5.0.34.10 (3)PhoneData→Log Location→Log

Solitaire 3.0.3 (11)PhoneData→LogPhoneData→NetworkLocation→Log

Location→IntentLocation→Network

Sound Cloud 14.10.27 (2)Location→Intent

Spotify Music 1.0.0.82 (1)PhoneData→Log

StudyBlue 5.4.2 (3)PhoneData→LogPhoneData→Network

Location→Log

Subway Surfers 1.33.0 (8)PhoneData→Log Location→Log

Super-Bright LED FlashLight 1.0.3 (21)PhoneData→LogPhoneData→Network

Location→LogLocation→Network

Swype 1.7.3.28966 (4)Location→Log

Tango Messenger 3.6.84179 (10)PhoneData→LogPhoneData→Intent

Location→LogContact→Intent

Temple Run 2 1.9 (1)PhoneData→Log

textPlus 5.9.1.4671 (12)PhoneData→LogPhoneData→NetworkPhoneData→Intent

Location→LogLocation→Network

The Weather Channel 4.2.5 (6)Location→Log

Tinder 3.3.2 (6)PhoneData→LogPhoneData→Intent

Location→Log

Trigger 8.9.590 (2)Location→Log Location→Intent

Trivia Crack 1.9.3 (10)PhoneData→LogLocation→Log

Location→Network

TuneIn Radio 12.0 (3)Location→Log

Twitter 5.32.0 (4)PhoneData→Log Location→Intent

Uber 2.7.15 (5)Location→Log

Venmo 6.4.2 (8)PhoneData→LogLocation→Log

Contact→Intent

Viber 4.3.1.21 (18)PhoneData→Intent Location→Network

Virtual Pet Care 1.52 (2)Location→Log Location→Network

Walmart 1.7.2 (1)Location→Log

WeatherBug 3.4.33 (10)Location→Log Location→Network

WhatsApp Messenger 2.11.238 (16)PhoneData→Log Contact→Intent

Words 7.1.4 (17)PhoneData→LogPhoneData→NetworkPhoneData→Intent

Location→LogLocation→Network

World of Battleships 1.0.05 (3)PhoneData→LogPhoneData→Network

Location→Log

Yahoo Mail 3.1.3 (8)PhoneData→Network

Yik Yak 2.0.002 (2)PhoneData→Log

Yo 1.110640.50 (4)PhoneData→Log Location→Log

ZEDGE Ringtones Wallpapers 4.3.1 (6)Location→Log Contact→Intent

Zillow 5.7.257 (5)PhoneData→LogPhoneData→Network

Location→Log

Figure 13: Actual leaks (i.e., non-false-positives) in Google Play Store shown as Source→Sink pairs. Thenumber in parentheses (n) is the number of type errors reported by DroidInfer for the app (some of these nerrors are false positives and do not have Source→Sink pairs). Sometimes several errors correspond to oneSource→Sink pair. Some flows happen in advertising libraries such as InMobi, Millenial Media and Flurry,called from the apps. 5 of the 88 apps with type errors had only false positives or all type errors were refutedby CFL-Explain, and are not included here.

Page 11: Scalable and Precise Taint Analysis for Android

!"#$%&'()&*+&,-%&./( $0(

1-($&*( $23( $24( $22(*"5*$-6)( 788&6/(

$90(*"5*$-6)(

$022(788&6/(

$&*( :-!*'($4(

.*&$7*"$'($04( ;<=&%*'($3(

-*&$7*"$(

5>6)'($0?(@A( 8B(

!"#"$%&CD:'($3B(

!-6A'(CD:"8&6E"66&%>"6(

1=( $F(7//(

6&G*(

H&*I"/((J-A!#+&,-%&H767)&$K)&*+&,-%&./(

H&*I"/((L,&6*C8M"7/&$K<#-M/CD:(

H&*I"/((L,&6*C8M"7/&$K#8M"7/N"N$7%A-6)(

H&*I"/((L,&6*C8M"7/&$K/"C8M"7/(

Figure 14: A source-sink path in Fiksu. When a flowis triggered by a library call, CFL-Explain labels theedge with the corresponding library method. Whentypes change due to library calls, we show the newtype at the target (e.g., List r5). We keep the identi-fiers exactly as they appear in the Jimple code.

!"#$%&'(

)&*+&,-%&./($0(

1-(

$&*( $22( /#'(*3-!(45(

$2(16( 7

8(9:(

7;(9:(

<5(

=*$->)'($2?( @=AB'($CC(9#*7D#/-/EF$2?1(

$&*(!"#

$2:( @=AB'($G(

9#*7D/&,-%&H->5"EF$2:1( 18(

$&*( $CI(<)(5$'(*3-!(

4)(@=AB'($JI( =*$->)'($K:(

*"=*$->)(

92(7L(

92(7>( <

3(

5/'(*3-!(

1>(

$&*($J(

1L($KJ(

79(*3-!(

43(=*$->)'($JC( =*$->)M>N*O'($K:(

P->-*Q(

!->8'(

!&*M>N*O71(

&R&%#*&71(

Figure 15: A source-sink path in Tremorvideo.

471,

74.5%

161,

25.5%

Call-graph ReachableCall-graph Unreachable

397,

84.3%

74,

15.7%

True FlowsFalse Positives

113,

28.5%

284,

71.5%

Network FlowsOther Flows

Figure 16: Results.

from the Editor’s Choice list, and cover at least 24 categories.DroidInfer throws an Internal error in Dexpler on 1 appand an Out-of-memory error on 5 apps. (Recall that themax heap size is set to 2GB.) It analyzes all other 138 appssuccessfully.

6.4.1 ResultsDroidInfer identifies sources and sinks in 111 apps and

reports 632 type errors over 88 apps. Two authors of thepaper inspected all type errors with CFL-Explain.

Fig. 16 summarizes our results. Of those 632, 161 typeerrors are refuted by CFL-Explain. Almost all of the refu-tations are due to the call graph. The false positive rateis 15.7%, which is well within the accepted bounds. (Thereason false positives happen will be explained shortly.) 113true flows (29%), spanning 40 apps, are network flows (i.e.,Location or DeviceId flows to the Internet). The remainingflows are flows of Location or DeviceId to Logs and to a lesserextent to Intent. This strongly supports hypothesis H2.

DroidInfer takes 139 seconds per app on average. It takesless than 3 minutes on 99 of the 138 apps, between 3 and 5minutes on 31 apps, and between 5 and 8 minutes on 6 apps.The 2 outliers run in 18 and 19 minutes. The call graphs arebuilt in, on average, 90 seconds per APK, with a range of 6sto 373s and CFL-Explain prints source-sink paths instantly.This timings strongly support hypothesis H3.

Fig. 13 summarizes the leaks we found. In contrast to the

FlowDroid researchers [1], who report no network flows, weuncover many network flows. Almost one third of all appsand almost one half of the apps with errors collect sensitivedata and send this data over the network. In numerous cases,the DeviceId is sent over the network as part of the URLstring.

We examined several representative network flows. Droid-Infer reports the following type error in the Fiksu trackinglibrary (com.fiksu.asotracking.*) included in the Zillow app:

qi � retgetDeviceId{tainted} <: r2{safe}

The source-sink path reported by CFL-Explain is shownin Fig. 14. Source DeviceId is returned from method get-DeviceId into method buildUrl, which forms a URL string”https://...&deviceId=...&uiud=...”. buildUrl adds this stringto a list of saved URLs; subsequently it iterates over thelist, retrieves each URL string and sends the string as ar-gument to method doUpload. DroidInfer reports an in-teresting type error in the Tremorvideo video ad library(com.tremorvideo.sdk.android.videoad.*) included in the ac-cuweather app. The source-sink path reported by CFL-Explain is shown in Fig. 15. Note that the path has interleav-ing parentheses and brackets and yet the flow is quite obvious.The deviceId source is returned at callsite i and written intofield f of the du object (names of classes and methods inTremorvideo appear to have been obfuscated.) The du flowsthrough several calls, until it’s f field is retrieved into Stringr29 which is then put into a JSON object to form the uiud-deviceId key-value pair. This complex and yet feasible pathattests to the power of DroidInfer and CFL-Explain.

Similarly to the FlowDroid researchers [1], we uncovermany flows of DeviceId and Location to logs. In one inter-esting case, the Whatsapp app dumps the SMS messagebody into the log when a certain IOException occurs. In themajority of cases the logs appear for debugging purposes (tothe best of our understanding.) It is unclear why apps wouldlog so much sensitive info, usually in clear text, given thatmalicious apps may read the logs and retrieve the info (untilAndroid 4.0, any app that held the READ LOGS permissioncould read the logs).

One may wonder why false positives occur given that CFL-Explain filters out infeasible paths. Recall that the DroidInfersystem does not analyze libraries. Thus, constraints due tolibrary calls result in “local” edges by CFL-Explain, that is,edges connecting two local variables, with no field or callannotations. Edge r4 → r5 in Fig. 14, constructed fromDroidInfer constraint r4 <: r5 is an example of such localedge. These edges subsume the field accesses and methodcalls that happen inside the library.

In rare cases, these edges cause infeasible paths. The mostcommon case writes sensitive data (e.g., DeviceId) into a field,then calls a library method on the object: e.g., source →r1

[f→ UserActivity : r2getPackageName→ sink . We assume that the

library method does not retrieve sensitive information andcount these cases as false positives.

We conclude this section with a brief discussion of theusability of the system. DroidInfer is completely automatic.CFL-Explain requires users to enter an identifier and examinethe paths, because of the reason discussed above, i.e., thatlibrary calls may give rise to false positive paths. In ourexperience, it takes less than 1 minute to vet the flow pathsfor a given type error, 2 minutes in rare cases. The toolwas used successfully by two of the authors of this paper,

Page 12: Scalable and Precise Taint Analysis for Android

62,

81.6%

14,

18.4%

Not Covered Covered

32,

51.6%17,

27.4%

13,

21.0%

Ad Libraries Not Loaded

Main App/Ad Libraries Loaded

Impossible to Cover

Figure 17: Runtime results.

as well as an undergraduate research assistant with minimalknowledge of program analysis.

6.4.2 Runtime ResultsTo gauge the usefulness of the static results, we run 10

random apps and collect and analyze their logs using AndroidDevice Monitor. There are 76 type errors reported as trueflows across the 10 apps. Despite short runs we covered 14type errors, or almost 20% of all errors. These errors span8 apps and expose flows of DeviceId to both logs and thenetwork. The flows are obvious tracking, as in Fig. 14, whichis covered.

Fig. 17 summarizes the results. Of the 62 type errorswe did not cover, 13 are beyond our runtime analysis. Forexample, there are several type errors reported as flows tothe network. However, there are no logs around the networkcall and we cannot judge if the flow is covered or not.

This experiment shows that the analysis reports a sub-stantial number of type errors that reveal true, dangerousflows. In the same time, it reports many ”difficult” errors,i.e., type errors that are likely true flows, but are difficult totrigger with runtime analysis. A lot of the uncovered typeerrors are in ad libraries that never loaded during our runs.Yet we found it impossible to trigger a specific ad library.For example, in Cut the Rope 2 we observed ads fromAdMarvel and other libraries in unrecorded runs. (Our toolreported several type errors in AdMarvel.) Unfortunately,when recording the logs, we observed ads only from Unity3Duntil the app stopped serving ads altogether.

We conclude this section with one of the most interestingflows we observed. When the user requests the details of amortgage quote, the Zillow app grabs the phone number(and other info) stores it into an Object array, then proceedsto format the array into a URL string, log the string andsend it to a loadUrl. The CFL-Explain paths are as follows:

!"#$%&'()*"+&,#-.&$($/(

01($&2( $34( 5%67128'((2*1!(

9:$$:8(2*1!(;<( =>( $?4(

9>( @.<&%29='(($A((

>"$-:2( B2$1+C'(($D((

!1+E'(F"C(!1+E'(G":HI$G(

The URL string with the phone number appears in the log!

6.4.3 Comparison with FlowDroidWe ran FlowDroid on the top 30 free apps from the Google

Play Store with max heap size set to 6 GB. FlowDroid threwan Out-of-memory error on 28 of the apps (we confirmedwith the developers that FlowDroid indeed requires morethan 6 GB of memory5). In contrast, DroidInfer runs witha max heap size of 2 GB and succeeds on 28 of the 30apps. DroidInfer’s scales well because it avoids points-toanalysis, while FlowDroid uses computationally expensive

5Eric Bodden: personal communication.

field- and object-sensitive points-to analysis. This resultstrongly supports hypothesis H3 well.

Of the remaining 114 apps, FlowDroid succeeds on 48. Itreports more than 4000 flow paths over the 50 apps. Weexamined a random 21 of those apps and compared theresults with DroidInfer. In only 6 apps does FlowDroid report“classical”flows: there are 4 log flows (DeviceId or Location tolog) and 2 network flows (DeviceId or Location to Internet).In contrast, DroidInfer reports “classical” flows in all 21 apps.FlowDroid reports thousands of flows from Bundle, Intentand Context, as it is overly-conservative in its handling ofinter-process communication. These results are consistentwith Artz et al. [1]. It is unclear why FlowDroid reports nonetwork flows — it does specify DeviceId and Location assources and URL.openConnection and Http request as sinks.

7. RELATED WORKThere is a large body of work on Android malware analysis,

both dynamic and static. We focus the discussion on staticanalyses, excluding FlowDroid. LeakMiner [42] is a points-tobased static analysis for Android. It models the Androidlifecycle to handle callback methods. However, LeakMiner iscontext-insensitive which may lead to false positives. It isunclear whether LeakMiner supports ICC. SCANDAL [20] isa static analyzer that detects privacy leaks in Android apps.It directly processes Dalvik bytecode. SCANDAL is limitedby high false positive rate — the average false positive rate isabout 55%, primarily due to the unknown paths, which makeup more than half of the total paths [20]. AndroidLeaks [10]finds potential leaks of private information in Android apps.It uses WALA to construct a context-sensitive System De-pendence Graph (SDG) and a context-insensitive overlayfor tracking heap dependencies in the SDG. CHEX [23] canautomatically vet Android apps for component hijackingvulnerabilities. It models the vulnerabilities from a data-flow analysis perspective and detects possible hijack-enablingflows and data leakage. Unfortunately, these tools are notpublicly available and we cannot compare with DroidInfer.Fritz et al. have contacted the authors of these tools, butstill, they were unable to compare due to various reasons [7].

SCanDroid [8] focuses on ICC. It formalizes the data flowsthrough and across components in a core calculus. Epicc [30]discovers ICC for Android apps by identifying a specificationfor every ICC source and sink, including the ICC Intentaction, data type, category, etc. We plan to integrate Epiccin DroidInfer, which will provide more channels for privacyleaks.

In previous work we built SFlow [16], a type-based taintanalysis for Java web applications, which can also analyzeJava source of Android apps. Although we build upon thiswork, this paper has several substantial contributions. First,we interpret type errors in terms of CFL-reachability, whichis a major step towards usability of type-based tools. Second,we incorporate control-flow guarantees via call graphs; SFlowprovides no such guarantees, which means that many typeerrors may be unreachable. Another key difference is thatSFlow uses the receiver, while DFlow uses the callsite contextat method calls. Thus, DFlow is more precise than SFlowand accepts more programs. Consider again the examplein Fig. 2. SFlow generates these constraints at callsite 10:y <: y� tainted src <: y� poly y� poly <: srcId. Becausesrc = tainted, y must be tainted. However, y being tainteddoes not satisfy the SFlow constraints at callsite 11: y <:

Page 13: Scalable and Precise Taint Analysis for Android

y � tainted sink <: y � poly y � poly <: sinkIdwhere both sink and sinkId are safe. This is because y�poly =tainted is not a subtype of sinkId = safe. As a result, SFlowrejects this program, even though there is no flow fromthe source to the sink. In contrast, DFlow accepts thisprogram as shown in Sect. 3.1. In addition, SFlowInfer, theinference tool of SFlow, only works on Java source, which isnot available for most Android apps. DroidInfer works onboth Java source and Android APKs and therefore it cananalyze any real-world Android app. Last but not least, theextensive evaluation on Google Play Store apps is a majorcontribution over our previous work. IFC [5] is anotherrecent type-based taint analysis. It also works only on sourceand therefore cannot analyze real-world apps. Furthermore,it requires user annotations, while DroidInfer requires nouser annotations. Earlier work on type-based taint analysiscomes from Shankar et al. [36] who present a type system fordetecting string format vulnerabilities in C. Classical workon type-based information flow control includes the typesystems by Volpano et al. [38] and Myers [28]. DFlow andDroidInfer are substantially simpler and thus more practical.

8. CONCLUSIONWe have presented DFlow, a context-sensitive information

flow type system, and DroidInfer, the corresponding inferencetool for detecting privacy leaks in Android apps. Empiricalevaluation has shown that our approach is effective.

9. REFERENCES[1] S. Arzt, S. Rasthofer, C. Fritz, E. Bodden, A. Bartel,

J. Klein, Y. le Traon, D. Octeau, and P. McDaniel.FlowDroid: Precise context-, flow-, field-,object-sensitive and lifecycle-aware taint analysis forandroid apps. In PLDI, to appear, 2014.

[2] A. Bartel, J. Klein, Y. Le Traon, and M. Monperrus.Dexpler: Converting Android Dalvik bytecode toJimple for static analysis with Soot. In SOAP, pages27–38, 2012.

[3] W. Dietl and P. Muller. Universes: Lightweightownership for JML. Journal of Object Technology,4(8):5–32, 2005.

[4] W. Enck, P. Gilbert, B.-G. Chun, L. P. Cox, J. Jung,P. Mcdaniel, and A. N. Sheth. TaintDroid : Aninformation-flow tracking system for realtime privacymonitoring on smartphones. In OSDI, pages 1–6, 2010.

[5] M. D. Ernst, R. Just, S. Millstein, W. M. Dietl,S. Pernsteiner, F. Roesner, K. Koscher, P. Barros,R. Bhoraskar, S. Han, P. Vines, and E. X. Wu.Collaborative verification of information flow for ahigh-assurance app store. Technical ReportUW-CSE-14-04-02, University of WashingtonDepartment of Computer Science and Engineering, Apr.2014.

[6] M. Fahndrich, J. Rehof, and M. Das. Scalablecontext-sensitive flow analysis using instantiationconstraints. In ACM Conference on ProgrammingLanguages Design and Implementation, pages 253–263,2000.

[7] C. Fritz, S. Arzt, S. Rasthofer, E. Bodden, J. Klein,Y. le Traon, D. Octeau, and P. McDaniel. Highlyprecise taint analysis for Android application. TechnicalReport TUD-CS-2013-0113, EC SPRIDE, 2013.

[8] A. P. Fuchs, A. Chaudhuri, and J. S. Foster.SCanDroid : Automated security certification ofAndroid applications. unpublished.

[9] A. P. Fuchs, A. Chaudhuri, and J. S. Foster. Scandroid:Automated security certification of androidapplications.

[10] C. Gibler, J. Crussell, J. Erickson, and H. Chen.AndroidLeaks: Automatically detecting potentialprivacy leaks in Android applications on a large scale.In TRUST, pages 273–290, 2012.

[11] Google. Android and Security. http://googlemobile.blogspot.com/2012/02/android-and-security.html,2012.

[12] Google. Google Play Store. https://play.google.com,2014.

[13] HP. HP Fortify Static Code Analyzer.http://www8.hp.com/us/en/software-solutions/

application-security/, 2013.

[14] W. Huang, W. Dietl, A. Milanova, and M. D. Ernst.Inference and checking of object ownership. In ECOOP,pages 181–206, 2012.

[15] W. Huang, Y. Dong, and A. Milanova. Type-basedTaint Analysis for Java Web Applications. Technicalreport, Rensselaer Polytechnic Institute, 2013.

[16] W. Huang, Y. Dong, and A. Milanova. Type-basedtaint analysis for Java web applications. In FASE,pages 140–154, 2014.

[17] W. Huang, A. Milanova, W. Dietl, and M. D. Ernst.ReIm & ReImInfer: Checking and inference of referenceimmutability and method purity. In OOPSLA, pages879–896, 2012.

[18] IBM. IBM Security AppScan. http://www-03.ibm.com/software/products/en/appscan,2013.

[19] IDC. Smartphone OS Market Share, Q3 2014.http://www.idc.com/prodserv/

smartphone-os-market-share.jsp, 2014.

[20] J. Kim, Y. Yoon, and K. Yi. SCANDAL: Staticanalyzer for detecting privacy leaks in Androidapplications. In MoST, 2012.

[21] B. S. Lerner, M. Flower, D. Grossman, andC. Chambers. Searching for type-error messages. InProceedings of the 2007 ACM SIGPLAN Conference onProgramming Language Design and Implementation,PLDI ’07, pages 425–434, New York, NY, USA, 2007.ACM.

[22] S. Liang, A. W. Keep, M. Might, S. Lyde, T. Gilray,and P. Aldous. Sound and precise malware analysis forAndroid via pushdown reachability and entry-pointsaturation. In SPSM, pages 21–32, 2013.

[23] L. Lu, Z. Li, Z. Wu, W. Lee, and G. Jiang. CHEX:Statically vetting Android apps for componenthijacking vulnerabilities. In CCS, pages 229–240, 2012.

[24] Mila. contagio mobile.http://contagiominidump.blogspot.com, 2014.

[25] A. Milanova and W. Huang. Composing polymorphicinformation flow systems with reference immutability.In FTfJP, pages 5:1–5:7, 2013.

[26] A. Milanova, W. Huang, and Y. Dong. In ACMConference on Programming and Practice ofProgramming in Java, pages 99–109, 2014.

Page 14: Scalable and Precise Taint Analysis for Android

[27] A. Milanova, W. Huang, and Y. Dong. Cfl-reachabilityand context-sensitive integrity types. In Proceedings ofthe 2014 International Conference on Principles andPractices of Programming on the Java Platform:Virtual Machines, Languages, and Tools, PPPJ ’14,pages 99–109, New York, NY, USA, 2014. ACM.

[28] A. C. Myers. JFlow: Practical mostly-staticinformation flow control. In POPL, pages 228–241,1999.

[29] A. C. Myers, J. A. Bank, and B. Liskov. Parameterizedtypes for Java. In POPL, pages 132–145, 1997.

[30] D. Octeau, P. Mcdaniel, S. Jha, A. Bartel, E. Bodden,J. Klein, and Y. L. Traon. Effective inter-componentcommunication mapping in Android with Epicc: Anessential step towards holistic security analysis. InUSENIX Security, pages 543–558, 2013.

[31] M. M. Papi, M. Ali, T. L. Correa Jr, J. H. Perkins, andM. D. Ernst. Practical pluggable types for Java. InISSTA, pages 201–212, 2008.

[32] A. Reina, A. Fattori, and L. Cavallaro. A systemcall-centric analysis and stimulation technique toautomatically reconstruct Android malware behaviors.In EuroSec, 2013.

[33] T. Reps. Program analysis via graph reachability.Information and Software Technology, 40:5–19, 1998.

[34] T. Reps. Undecidability of context-sensitivedata-independence analysis. ACM Transactions onProgramming Languages and Systems, 22(1):162—-186,2000.

[35] A. Sampson, W. Dietl, and E. Fortuna. EnerJ:Approximate data types for safe and general low-powercomputation. In PLDI, pages 164–174, 2011.

[36] U. Shankar, K. Talwar, J. S. Foster, and D. Wagner.Detecting format string vulnerabilities with typequalifiers. In USENIX Security, pages 201–220, 2001.

[37] R. Vallee-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam,and V. Sundaresan. Soot: A Java bytecodeoptimization framework. In CASCON, pages 13–23,1999.

[38] D. Volpano, G. Smith, and C. Irvine. A sound typesystem for secure flow analysis. Journal of ComputerSecurity, pages 167–187, 1996.

[39] M. Wand. Finding the source of type errors. InProceedings of the 13th ACM SIGACT-SIGPLANSymposium on Principles of Programming Languages,POPL ’86, pages 38–43, New York, NY, USA, 1986.ACM.

[40] R. Xu, H. Saıdi, R. Anderson, and H. SaAsdi.Aurasium: Practical policy enforcement for Androidapplications. In USENIX Security, pages 539–552, 2012.

[41] L. Yan and H. Yin. Droidscope: seamlesslyreconstructing the os and dalvik semantic views fordynamic android malware analysis. In USENIXSecurity, pages 569–584, 2012.

[42] Z. Yang and M. Yang. LeakMiner: Detect informationleakage on Android with static taint analysis. 2012Third World Congress on Software Engineering, pages101–104, Nov. 2012.

Page 15: Scalable and Precise Taint Analysis for Android

APPENDIXA. LISTS OF SOURCES AND SINKS

PhoneDataIn package android.telephony:TelephonyManager.getDeviceId()TelephonyManager.getLine1Number()TelephonyManager.getSimSerialNumber()TelephonyManager.getSubscriberId()SmsManager.getAllMessagesFromIcc()SmsMessage.createFromPdu(byte[] arg0)SmsMessage.createFromPdu(byte[] arg0, String arg1)SmsMessage.newFromCMT(String[] arg0)SmsMessage.newFromParcel(Parcel arg0)SmsMessage.createFromEfRecord(int arg0, byte[] arg1)

LocationIn package android.location:LocationListener.onLocationChanged(Location arg0)LocationManager.onLocationChanged(Location arg0)LocationManager.getLastLocation()LocationManager.getLastKnownLocation(String arg0)ILocationListener.onLocationChanged(Location arg0)In package com.millennialmedia:android.MMRequest.getUserLocation()

Contactandroid.provider.Contacts.android.net.Uri.CONTENT URIandroid.provider.ContactsContract.android.net.Uri.CONTENT URI

Figure 18: Sources in DroidInfer: Location, PhoneData and Contact. In each category, the specific source isunderlined for each item. If the source is a parameter of a method, the parameter is underlined, e.g. onLoca-tionChanged(Location arg0). If the source is the return value, the method name is underlined, e.g. getLastLocation().If the source is a field of a class, the field name is underlined, e.g. Uri.CONTENT URI.

LogIn package android.util:Log.d(String arg0, String arg1)Log.d(String arg0, String arg1, Throwable arg2)Log.e(String arg0, String arg1)Log.e(String arg0, String arg1, Throwable arg2)Log.i(String arg0, String arg1)Log.i(String arg0, String arg1, Throwable arg2)Log.println(int arg0, String arg1, String arg2)Log.println native(int arg0, int arg1, String arg2, String arg3)Log.v(String arg0, String arg1)Log.v(String arg0, String arg1, Throwable arg2)Log.w(String arg0, String arg1)Log.w(String arg0, String arg1, Throwable arg2)Log.w(String arg0, Throwable arg1)Log.wtf(String arg0, String arg1)Log.wtf(String arg0, String arg1, Throwable arg2)

Networkandroid.telephony.SmsManager.sendDataMessage(String arg0, Stringarg1,

short arg2, byte[] arg3, PendingIntent arg4, PendingIntent arg5)android.telephony.SmsManager.sendTextMessage(String arg0,

String arg1, String arg2, PendingIntent arg3, PendingIntent arg4)android.webkit.WebView.loadUrl(String arg0, Map<String,String> arg1)android.webkit.WebView.loadUrl(String arg0)android.webkit.WebView.postUrl(String arg0, byte[] arg1)com.millennialmedia.android.MMRequest.setUserLocation(Location arg0)java.net.URL.openConnection(Proxy proxy) thisjava.net.URL.getOutputStream() thisjava.net.Socket.getOutputStream() thisorg.apache.http.client.methods.HttpPost.setEntity(HttpEntity arg0)org.apache.http.client.methods.HttpEntityEnclosingRequestBase

.setEntity(HttpEntity arg0)

IntentIn package android.app:Activity.setResult(int arg0, Intent arg1)Activity.startActivities(Intent[] arg0)Activity.startActivities(Intent[] arg0, Bundle arg1)Activity.startActivity(Intent arg0)Activity.startActivity(Intent arg0, Bundle arg1)

Intent (continued)In package android.app:Activity.startActivityAsUser(Intent arg0, Bundle arg1, UserHandle arg2)Activity.startActivityAsUser(Intent arg0, UserHandle arg1)Activity.startActivityForResult(Intent arg0, int arg1)Activity.startActivityForResult(Intent arg0, int arg1, Bundle arg2)Activity.startActivityFromChild(Activity arg0, Intent arg1, int arg2)Activity.startActivityFromChild(Activity arg0, Intent arg1, int arg2, Bundlearg3)Activity.startActivityFromFragment(Fragment arg0, Intent arg1, int arg2)Activity.startActivityFromFragment(Fragment arg0, Intent arg1,

int arg2, Bundle arg3)Activity.startActivityIfNeeded(Intent arg0, int arg1)Activity.startActivityIfNeeded(Intent arg0, int arg1, Bundle arg2)Activity.startIntentSender(IntentSender arg0, Intent arg1, int arg2,

int arg3, int arg4)Activity.startIntentSender(IntentSender arg0, Intent arg1, int arg2,

int arg3, int arg4, Bundle arg5)Activity.startIntentSenderForResult(IntentSender arg0, int arg1,

Intent arg2, int arg3, int arg4, int arg5)Activity.startIntentSenderForResult(IntentSender arg0, int arg1,

Intent arg2, int arg3, int arg4, int arg5, Bundle arg6)Activity.startIntentSenderFromChild(Activity arg0, IntentSender arg1,

int arg2, Intent arg3, int arg4, int arg5, int arg6)Activity.startIntentSenderFromChild(Activity arg0, IntentSender arg1,

int arg2, Intent arg3, int arg4, int arg5, int arg6, Bundle arg7)Activity.startNextMatchingActivity(Intent arg0)Activity.startNextMatchingActivity(Intent arg0, Bundle arg1)In package android.content:Context.startActivity(Intent arg0)Context.startActivityAsUser(Intent arg0, UserHandle arg1)Context.startActivity(Intent arg0, Bundle arg1)Context.startActivityAsUser(Intent arg0, Bundle arg1, UserHandle arg2)Context.startActivities(Intent[] arg0)Context.startActivities(Intent[] arg0, Bundle arg1)Context.startActivitiesAsUser(Intent[] arg0, Bundle arg1, UserHandle arg2)Context.startIntentSender(IntentSender arg0, Intent arg1, int arg2,

int arg3, int arg4)Context.startIntentSender(IntentSender arg0, Intent arg1, int arg2,

int arg3, int arg4, Bundle arg5)Context.startService(Intent arg0)Context.startServiceAsUser(Intent arg0, UserHandle arg1)

Figure 19: Sinks in DroidInfer: Log, Network and Intent. The specific sink is underlined for each item. Ifthe sink is a parameter, the parameter is underlined, e.g. Log.d(String arg0, String arg1). If the sink is the returnvalue, the method name is underlined, e.g. getOutputStream().