18
Technical Report RIA: an Audition-based Method to Protect the Runtime Integrity of MapReduce Applications Yongzhi Wang Xidian University Yulong Shen Xidian University Xiaohong Jiang Future University Hakodate

RIA: an Audition-based Method to Protect the Runtime Integrity of …web.xidian.edu.cn/yzwang/files/20170426_110302.pdf · 2017. 4. 26. · Abstract—Public cloud vendors have been

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: RIA: an Audition-based Method to Protect the Runtime Integrity of …web.xidian.edu.cn/yzwang/files/20170426_110302.pdf · 2017. 4. 26. · Abstract—Public cloud vendors have been

TechnicalReport

RIA:anAudition-basedMethodtoProtecttheRuntimeIntegrityofMapReduceApplications

YongzhiWangXidianUniversity

YulongShen

XidianUniversity

XiaohongJiangFutureUniversityHakodate

Page 2: RIA: an Audition-based Method to Protect the Runtime Integrity of …web.xidian.edu.cn/yzwang/files/20170426_110302.pdf · 2017. 4. 26. · Abstract—Public cloud vendors have been

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

RIA: an Audition-based Method to Protect theRuntime Integrity of MapReduce Applications

Yongzhi Wang, Yulong Shen, Xiaohong Jiang, Senior Member, IEEE,

Abstract—Public cloud vendors have been offering varies big data computing services on their clouds. However runtime integrity isone of the major security concerns that hinders those services to be widely used. Without a secure solution, cloud users are reluctantto outsource their critical computations to the untrusted cloud. In this paper, we focus on MapReduce, a popular big data computingframework, propose the runtime integrity audition (RIA) technique, a solution that enables remote verification of the runtime integrity ofMapReduce applications. RIA records the execution trace of the MapReduce applications on the public cloud and checks the executiontraces against the application input and the application’s program on the private cloud. By doing so, RIA protects the runtime integrity ofMapReduce applications. We developed a prototype system, called MR Auditor, based on the idea of RIA, and tested its applicabilityand the performance with several applications. Our experimental results showed that MR Auditor is efficient to audit the runtimeintegrity of MapReduce applications and incurs a moderate performance overhead.

Index Terms—Runtime Integrity, Remote Verification, Big Data Computing, MapReduce.

F

1 INTRODUCTION

MAPREDUCE, a mainstream big data computing frame-work, has been gaining popularities in the past years.

By breaking down a computing job into tasks and executingthose tasks in a parallel manner, MapReduce enables usersto process the data with an arbitrary size. However, big datacomputing jobs usually is executed on a clusters, which con-sists of a number of hosts. For many users, setting up a ded-icated computing cluster incurs a significant financial cost,which is not affordable to them. In this case, public cloudsoffer a viable solution. On the public cloud, a user can rent acluster and pays the cost only for the rented resources. Manypublic cloud venders have offered MapReduce services,such as Amazon Elastic MapReduce, Azure HDInsight, etc.Such an outsource-based business model boosts the soft-ware development speed and reduces users’ cost. However,since the computation is performed remotely, the compu-tation security is not guaranteed. Security breach incidentsreported by mass media [1] and vulnerabilities discoveredby researchers [2] [3] have obstructed the wide adoption ofsuch a service.

Among varies security issues, runtime integrity is oneof the most critical ones in the big data scenario, includingMapReduce. Existing works mainly focus on protecting thecontrol flow integrity [4] [5] [6] or only support limited dataflow integrity check [7] [8]. Apparently, those works do notfit in the MapReduce’s scenario, where the computation in-

• Y. Wang is with the School of Computer Science and Technology, XidianUniversity, China, 710071, and the Key Laboratory of Grain InformationProcessing and Control (Henan University of Technology), Ministry ofEducation. E-mail: [email protected]

• Y. Shen is with the School of Computer Science and Technology, XidianUniversity, China, 710071. E-mail: [email protected]

• X. Jiang is with the School of Systems Information Science, FutureUniversity Hakodate, 116-2, Kameda Nakano-Cho, Hakodate, Hokkaido,041-8655, Japan, and the School of Computer Science and Technology,Xidian University, Shaanxi 710071, China. E- mail:[email protected].

Manuscript received April 19, 2005; revised August 26, 2015.

tegrity highly depends on the integrity of the data flow, andthe computation is highly parallelized due to the significantamount of data to be processed. Therefore, a novel remoteverification scheme that caters for the characteristic of thebig data computing is needed.

In the paper, we propose runtime integrity audition, (orRIA for short), a hybrid cloud-based remote verificationmethod. RIA can be used to verify the runtime integrityof MapReduce applications executed on the untrusted pub-lic cloud. Specifically, by inserting logging statements toMapReduce programs, RIA is able to record the executiontraces of the MapReduce applications, when executed onthe public cloud. By auditing the execution traces on theprivate cloud, RIA is able to verify the runtime integrity onthe input, the control flow, the data flow and the outputof each MapReduce phase, and thus protects the runtimeintegrity of MapReduce applications.

In order to record the execution traces, RIA inserts log-ging statements to the original program to record certainvariables’ runtime values. The logged variables should con-vey two aspects of runtime information: the runtime controlflow and the mathematical relationships among variables’runtime values. The audition on the private cloud will checkthe former to ensure the control flow integrity and willcheck the latter to ensure the data flow integrity.

When auditing the runtime integrity of the executiontrace, RIA uses the execution traces as a guide to simulatethe program execution on the private cloud. Since the exe-cution traces reflect the runtime control flow, the analysiscan reproduce the statement execution sequence. Duringthe simulation, RIA derives the mathematical relationshipsamong variables according to the executed statements. Bychecking the runtime variable values that are reflected fromthe execution traces against the mathematical relationships,we verify the integrity of the execution on the public cloud.Any inconsistency during the audition indicates a runtimeintegrity violation.

Page 3: RIA: an Audition-based Method to Protect the Runtime Integrity of …web.xidian.edu.cn/yzwang/files/20170426_110302.pdf · 2017. 4. 26. · Abstract—Public cloud vendors have been

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

RIA removes the complicated function call relationshipsand “flattens” all the function executions. By doing so, eachfunction can be audited independantly. Such a techniqueremoves the cohesion between the function caller and thefunction callee. It enables users to audition the functionexecutions in a randomized manner, and thus gives users afreedom to make a trade-off between the audition accuracyand the audition speed.

The design of RIA enables users to audit the runtimeintegrity offline after the computation. Thus it gives usersa freedom to decide the audition time. In addition, theprocess of the audition can be reproduced, which facilitatesthe procedure of forensics.

It is necessary to point out that the application of RIAis not only restricted to MapReduce computation. The exe-cution log insertion and the execution audition techniquesproposed in Section 3.3 and Section 3.5 are applicable to thegeneral computations.

Based on the RIA technique, we implemented a proto-type system, called MR Auditor, which is able to performruntime integrity audition on Hadoop applications. Ourexperimental results showed that MR Auditor is a gener-alized tool that can efficiently protect the runtime integrityof Hadoop applications. It incurs a moderate performanceoverhead (with a 0.18% to 34% of extra execution time withan unoptimized implementation) and provides users a flex-ibility to make a trade-off between the audition’s accuracyand speed.

The contribution of this paper is as follows:

• We proposed a hybrid cloud-based remote verifica-tion method, called the runtime integrity audition, toverify the runtime integrity of MapReduce applica-tions executed on the public cloud. We performeda series of analysis to evaluate the security of RIA.Our analysis shows that RIA is effective in detectingthe runtime integrity violation of MapReduce appli-cations.

• We implemented a prototype system called MR Au-ditor, which can be applied to Hadoop applicationsand enforces the runtime integrity audition on eachHadoop jobs executed on the public cloud, andtherefore protects the runtime integrity of Hadoopapplications.

• We performed a set of experiments on MR Auditor.Our experimental results showed that MR Auditoris a general tool that is applicable to all Hadoopapplications that we tested. The results also indicatethat MR Auditor incurs a modest performance over-head on the application executions, and provides aflexibility for the user to make a trade-off betweenthe precision and the speed of the execution audition.

The paper is organized as follows: We introduce thepreliminary knowledge of MapReduce and describe thesecurity model in Section 2. We introduce the design detailsof RIA in Section 3. We performed a series of analysis of RIAin Section 4. We present the implementation details of MRAuditor and discuss our experimental results in Section 5.We discuss the related works in Section 6. We conclude thepaper in Section 7.

2 PRELIMINARIES AND SECURITY MODEL

2.1 MapReduce

MapReduce [9] is a parallel programming model for large-scale dataset processing. It usually has been implemented asa distributed system deployed on a cluster. In MapReduce,each computation request issued by the user is called ajob. Each job is usually broken down into multiple tasks.The traditional architecture of MapReduce consists of onemaster and multiple workers. The master controls the entirecomputation by managing jobs, scheduling tasks, and main-taining load balance, etc. Workers are hosts that contributecomputation resources to execute tasks assigned by themaster. In MapReduce, data are stored in the Distributed FileSystem (DFS). Each job usually retrieves input data fromthe DFS and stored the result back to DFS when the job isfinished.

Each MapReduce job consists of three consecutivephases: the map phase, the shuffle phase and the reduce phase.Workers executing tasks in the map phase are called map-pers. Tasks executed in the map phase are called map tasks.Workers executing tasks in the reduce phase are calledreducers. Tasks executed in the reduce phase are called reducetasks. In the map phase, input data are fetched from the DFSand are divided into multiple data blocks. Each data blockwill be assigned to one map task and processed indepen-dently. The result of each map task consists of a collectionof records with the format of <key, value> pairs. Recordsin each task result are sorted by the values of their keys.Map task results are stored in its worker’s local storage, andwill be used as input in the reduce phase. In the reducephase, each reduce task only processes the map task outputrecords with certain keys. Therefore a special function calledpartition determines to which reduce task a map task outputrecord should be assigned based on the content of the “key”in that record. Such a process is called the shuffle phase.After the shuffle phase, each reduce task receives its inputand executes the reduce function to aggregate input recordsinto the reduce task output, which is the final job output.

The computation infrastructure are managed by theMapReduce framework. The developer only needs to im-plement the map and the reduce functions, as well asthe partition function. The MapReduce framework willgenerate the map and the reduce tasks based on the imple-mented functions. Listing. 1 shows the implementation ofa map function. The function computes the sum of the firstvalue numbers in a Fibonacci sequence starting from thenumber key. If the sum is a prime number, the functionwill output the corresponding key and the sum. Note thatthe types of the key and value can be any object. In thisexample, we define the types as integers. Without losinggenerality, Listing. 1 shows the typical processing routineof a MapReduce function: read the data in the format of< key, value > pair, process the data, and output the datain the format of < key, value > pair.

MapReduce is usually deployed on a computer clusterin order to process computing tasks in a parallel manner.However, many users such as small companies and individ-uals do not want to or cannot afford investing a dedicatedcluster to process MapReduce jobs. Fortunately, public cloudvendors offer services that enable users to set up virtualized

Page 4: RIA: an Audition-based Method to Protect the Runtime Integrity of …web.xidian.edu.cn/yzwang/files/20170426_110302.pdf · 2017. 4. 26. · Abstract—Public cloud vendors have been

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

1 int counter = value;2 int sum = 0;3 int fib1 = key;4 int fib2 = fib1+1;5 while(counter>0){6 sum = fib1+fib2;7 fib1 = fib2;8 fib2 = sum;9 counter��;

10 }11 boolean output = isPrime(sum);12 if(output)13 context.write(key, sum);14 }

Listing 1. The sample code of a map function

clusters on demand and execute their MapReduce jobs onthose clusters. As a return, the cloud vendor charges thecloud users based on the cluster size and the rent time. Forexample, Amazon EC2 and Windows Azure allow users torent virtualized computers and run their applications on it.By using such a service, MapReduce users can set up theirown clusters based on the computation size and pay thecloud vendor only based on the resources they rent.

2.2 Security ModelIn our paper, we focus on protecting the runtime integrity ofthe MapReduce applications, which is defined as follows:

Definition 1 (The runtime integrity). The runtime integrity ofa MapReduce application is preserved if and only if the result ofthe MapReduce application is correct and will be correctly used inthe future computations.

In the hybrid cloud architecture that are employed inthis paper, we assume the public cloud is not trusted.Therefore, the public cloud can be malicious. The behaviorof a malicious public cloud (MPC for short) usually can becategorized into the following two models.

• The lazy cheating model: In this model, the goal of themalicious cloud is to reduce the computation cost.Therefore, it will cheat only if cheating will reduceits computing workload.

• The malicious cheating model: In this model, the goalof the malicious cloud is to introduce errors to theapplication result, for the propose of disrupting thecomputation. Such a cloud will not consider thecomputational cost.

In this paper, we assume that the MPC falls into themalicious cheating model. Therefore, the MPC will com-promise the runtime integrity without considering the cost.For instance, the MPC can tamper the variable values ofa MapReduce job and therefore compromises the resultintegrity of the current job. The MPC can also first compute ajob faithfully. After sending the correct job results to the user,the MPC can tamper these results and feed the tamperedresults as an input to the next job to compromise the resultintegrity of the next job.

On the other hand, we assume that the private cloud istrusted. We claim that this assumption is reasonable, since

Fig. 1. The MapReduce Program Model

the private cloud can be deployed in the user’s organization,or can be deployed on a dedicated infrastructure protectedby the benign cloud vendor.

3 SYSTEM DESIGN

In this section, we discuss the design of the runtime integrityaudition. We firstly define the MapReduce model and intro-duce our design motivation. After that, we give a designoverview, followed by the detailed description.

3.1 The MapReduce Model and The Design MotivationWithout losing generality, we define the model of theMapReduce application, as shown in Fig. 1. The modelconsists of a sequence of map or reduce phases, executedin a interleaved sequence. Since our method can be directlyapplied to both the map task and the reduce task, wedo not distinguish the map phase and the reduce phasein this model. In this model, the application input I isoriginally stored in the private cloud and then transmittedto the public cloud. On the public cloud, the MapReduceapplication consists of n phases, marked as p1, p2, ..., p

n

,respectively. For each phase p

i

, it consists of qi

tasks, markedas t

i1, ti2, ..., t

iqi . The input and the output of task tij

are Iij

and Oij

, respectively. The output of the last job is the outputof the entire application, which is marked as O. O will besent back to the private cloud after the entire applicationcompletes. To facilitate our description in this paper, wealso mark the application input I as O0, meaning the outputof phase 0. Similarly, we mark the application output O asIn+1, meaning the input of phase n+ 1.

According to this model, the runtime integrity of a phaseis guaranteed only when integrity is preserved in the taskinput, the task output and the task execution for each taskin that phase. For example, in Fig. 1, the integrity of phasei is preserved only when the input set of that phase I

i

={I

i1, ..., Iiqi} equals to the output set of the previous phase(O

i�1 = {Oi�11, ..., Oi�1qi�1}); the tasks of that phase t

i

={t

i1, ..., tiqi} are faithfully executed; and the output set ofphase i (O

i

= {Oi1, ..., Oiqi}) equals to the input set of the

next phase (Ii+1 = {I

i+11, ..., Ii+1qi+1}).For a MapReduce application, which consists of n

phases, if we have the input data set, the execution trace andthe output data set of each phase, we can audit the resultintegrity of that application in a chain. Since the applicationinput I is maintained in the private cloud, by using it asa trusted anchor, we can audit the consistency between Iand the input set of phase 1; when the input set of phase1 is audited, we can use it as the trusted anchor to verify

Page 5: RIA: an Audition-based Method to Protect the Runtime Integrity of …web.xidian.edu.cn/yzwang/files/20170426_110302.pdf · 2017. 4. 26. · Abstract—Public cloud vendors have been

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

the execution integrity of the tasks in phase 1; when theexecution integrity of phase 1 is trusted, we can use theoutput of this phase 1 to verify the integrity of the task inputof phase 2, and so on...By auditing the runtime integrity ofeach phase sequentially, we are able to verify the integrityof the last phase’s output set, i.e., O. If the audition fails atany point in the chain, we simply reject O; otherwise, O isaccepted.

To achieve such a goal, we need to log the input data, theexecution traces and the output data of each task while theapplication is executed on the public cloud. The executiontraces will be sent to the private cloud to perform runtimeaudition.

3.2 System Overview3.2.1 System Architecture

The runtime integrity audition (RIA) is performed on ahybrid cloud environment, shown in Fig. 2. On the privatecloud, RIA transforms the original programs by insertinglogging statements to them. The transformed program iscalled the Program with Log, or PWL for short (in step 1).The PWLs are then sent to the public cloud to execute (instep 2). During the execution, the PWL generates executionlogs, which reflect the statements execution sequence andthe runtime mathematical relationships among programvariables. When the application execution completes, the ex-ecution logs will be transferred back to the private cloud tofacilitate the audition of the application’s runtime integrity.(in step 3). The audition mainly checks the data integrity ofthe input, the runtime integrity of each task, and the dataintegrity of the output on each phase.

To check the input integrity of a certain phase, RIA usesthe output data of the previous phase as a baseline andleverages the Counting Bloom Filter to check the consistencybetween the input of the current phase and the output of theprevious phase. Such a procedure is called the input audition.Following the input audition is the execution audition, whichis to check the integrity of the execution and the output ofeach task.To check the execution integrity of each task, theprivate cloud simulates the task’s execution based on theruntime variable values obtained from the execution logsand compares the runtime information against the originalprogram’s control flow and the data flow, respectively. Tocheck the control flow integrity, the simulation reproducesthe runtime control flow based on the runtime variablevalues and compares the runtime control flow against theoriginal program’s control flow graph. To check the dataflow integrity, the simulation derives the mathematical re-lationships among variables and checks the feasibility ofthe runtime variable values against the derived relation-ships. The output integrity of each task is checked duringits data flow integrity check: while the simulation derivesthe mathematical relationship between the task output andother variables, RIA checks the feasibility of the task outputrecorded in the execution logs against that relationships.

The overall audition procedure is described in Algorithm1. The algorithm uses the application input I as the trustedanchor and, phase by phase, audits the integrity of the phaseinputs, the task executions and the task output. If a phasepasses the audition, its output becomes the baseline and will

Fig. 2. the Architecture of Runtime Integrity Audition

be used to audit the integrity of its succeeding phase. If theoutput of the last phase passes the audition, it will be usedto audit the application’s output O. During the execution ofthe algorithm, any inconsistency indicates a violation of theruntime integrity.

Algorithm 1 Runtime Integrity AuditionRequire: The application model is defined in Fig. 1.

1: inputBaseLine I2: for i = 1 to n do3: for j = 1 to q

i

do4: ETL

ij

retrieveExecutionTraceLog(tij

)5: end for6: phaseInput extractInput({ETL

i1, · · · , ETLiqi})

7: inputAudition(phaseInput, inputBaseLine)8: for j = 1 to q

i

do9: executionAudition(ETL

ij

)10: end for11: inputBaseline extractOutput({ETL

i1, · · · , ETLiqi})

12: end for13: inputAudition(O, inputBaseline)

With a global picture in mind, we are now delving intotechnical details. Specifically, we answer three questions insequence. How to generate the execution logs? How to auditthe phase inputs? How to audit the task executions and thephase outputs?

3.3 The Execution Log GenerationWe transform the original program by inserting loggingstatements, so that the execution logs will be recorded whilethe program is executed on the public cloud. We insert fourtypes of logs: the input log, the output log, the branch log andthe invoke log.

The input log records the input data read from the pro-gram. In MapReduce, task input is read in as the parametersof the map or reduce function, in the format of <key,value> pair. Therefore, the logging statements are insertedat the beginning of the map or reduce function, recordingvalues of the <key, value> pairs. Listing. 2 shows thePWL of the example function (in Listing. 1), on which aninvocation of function logInput is inserted in line 2.

The output log records the data written to the DFS orthe local storage. In MapReduce, task output is written withan invocation of context.write(outKey, outValue),where <outKey, outValue> is one record of the task

Page 6: RIA: an Audition-based Method to Protect the Runtime Integrity of …web.xidian.edu.cn/yzwang/files/20170426_110302.pdf · 2017. 4. 26. · Abstract—Public cloud vendors have been

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

TABLE 1The generated execution log when invoking map(3, 5,

context)

log file name log contentinput log of the current task < 3, 5 >output log of the current task < 3, 47 >branch log of function map 5, 4, 3, 2, 1, 0, trueinvoke log of function map < pre : 47, post : {true, 47} >

output. Therefore, the values of <outKey, outValue>in function map or reduce will be logged. As shown inListing. 2, the log will be inserted immediately before theinvocation of context.write(key, sum) as an invoca-tion of the function logOutput in line 17.

The branch log records the variable values involved inthe predicate of the branch statement. For example, thefunction calls of logBranch are inserted in line 7 and 16in Listing. 2.

Each invoke log consists of two statements, insertedbefore and after each function invocation, respectively. Theone inserted before the function invocation is called thepre-invoke log, which records the values of parameters inthe function invocation before the function was invoked.The one inserted after the function invocation is called thepost-invoke log, which records the values of the parametersafter the function was invoked, as well as its return value.The reason to record the parameter values before and aftereach invocation is that the parameters may be modifiedwhen they are passed to a function as references and up-dated inside that function. For example, the function call oflogPreInvoke is inserted in line 13 and logPostInvoke isinserted in line 15 in Listing. 2.

As a concrete example, Table 1 shows the log recordsgenerated in each log file while invoking map(3, 5,context). In this example, the variable values in thebranch statements (i.e., in line 7 and 16 in Listing. 2) areadded to the branch statement. There are one input recordand one output record that are added to the input log andthe output log, respectively. There is only one invoke logrecord pair recorded in the invoke log, corresponding to theinvocation in line 14.

The inserted logging statements will generate the exe-cution logs while the program is executed on the publiccloud. The execution logs will be sent to the private cloudto perform the runtime integrity audition. Specifically, theinput audition and the execution audition will be performedbased on the execution log. We elaborate the design of thetwo auditions in the following sections.

3.4 The Input Audition3.4.1 The basic algorithm

The input audition is performed on a phase basis. For eachphase, the input audition is performed based on the inputlog of the current phase and the output log of the previousphase. Specifically, the audition first constructs the input dataset from the input log of the current phase and the outputdata set from the output log of the previous phase (or theapplication input, if the current phase is the first phase).Since the output records of a task in the previous phase canbe distributed to multiple tasks in the current phase, the

1 void map(int key, int value, Context context){2 logInput(key, value);3 int counter = value;4 int sum = 0;5 int fib1 = key;6 int fib2 = fib1+1;7 while(logBranch(counter)&&counter>0){8 sum = fib1+fib2;9 fib1 = fib2;

10 fib2 = sum;11 counter��;12 }13 logPreInvoke(sum);14 boolean shouldOutput = isPrime(sum);15 logPostInvoke(shouldOutput, sum);16 if(logBranch(shouldOutput)&&shouldOutput){17 logOutput(key, sum);18 context.write(key, sum);19 }20 }

Listing 2. The PWL code of the sample map function.

input data set should contain the input records from all thetasks in the current phase, and the output data set shouldcontain the output records from all the tasks the previousphase. After that, the audition compares the input data setagainst the output data set.

Suppose the input data set of all tasks in phase i areIi

= {Ii1 , Ii2 , · · · , Iipi } and the output data set of all tasks

in phase i�1 are Oi�1 = {O

i�11 , Oi�12 , · · · , Oi�1pi�1}.The

input integrity of phase i is assured only when Ii

= Oi�1.

In other words, the input integrity is assured only when thefollowing tests are passed.

|Ii

| = |Oi�1| (1)

8e 2 Oi�1, COUNT (e, I

i

) = COUNT (e,Oi�1) (2)

The function COUNT (e, S) returns the number of e inset S. Test (1) ensures that the sizes of I

i

and Oi�1 are

equivalent. Test (2) ensures that if any record e appears inO

i�1, the number of its occurrence in Oi�1 is equal to its

counterpart in Ii

. The fulfillment of the two tests ensuresthat O

i�1 = Ii

.We claim that test (1) and test (2) are sufficient to check

the set equality. It is unnecessary to perform the followingtest (test (3)).

8e 2 Ii

, COUNT (e, Ii

) = COUNT (e,Oi�1) (3)

It is because although test (3) can help to detect the illegalinsertion of elements to I

i

, if the attacker inserts elementsto I

i

, she has to remove other elements from Ii

due tothe existence of test (1), which can be detected by test (2).Therefore, test (3) is redundant.

3.4.2 Speed up with probabilistic checks

The number of input/output records in each phase is usu-ally significantly large. We therefore employ the CountingBloom filter (CBF) to speed up the element query and testthe authenticity of the phase input in a sampling manner.Specifically, we construct a Counting Bloom Filter CBFI

i

Page 7: RIA: an Audition-based Method to Protect the Runtime Integrity of …web.xidian.edu.cn/yzwang/files/20170426_110302.pdf · 2017. 4. 26. · Abstract—Public cloud vendors have been

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

based on Ii

and another Counting Bloom Filter CBFOi�1

based on Oi�1. While constructing CBFI

i

and CBFOi�1,

we compute |Ii

| and |Oi�1| and test their equality, i.e., the

test (1). If test (1) is passed, we perform test (2). For eachrecord r from set O

i�1, with a certain probability tr

, wequery CBFI

i

and CBFOi�1 for the occurrence of records

with the same value and test their equality. Any occurrenceinequality occurred on any selected record indicates theviolation of the input data integrity in phase i. We call theprobability t

r

as the CBF test probability.Notice that to compromise the integrity of the input in

phase i, the attacker may randomly tamper records in SIi

ina hope of not to be found. We defer the quantitative analysison the relationship among the attacker’s cheat behavior, theCBF test probability on test 3, the CBF property and theprobability of detecting errors in Section 4.1.

3.5 The Execution AuditionThe execution audition is performed on a function executionbasis. Specifically, we serialize the application executioninto function executions, so that we can audit them in arandomized manner. The functions to be audited include themap and reduce function and other functions implementedfor the application, including the partition function.

3.5.1 The basic algorithm

To audit the execution of a function, we generate the controlflow graph (CFG) of that function and simulate the executionbased on its CFG and its execution logs. With the branchlog, the simulation can derive which branch was taken whileexecuted on the public cloud, and thus to reproduce the run-time control flow. During the simulation, we also derive themathematical relationships among runtime variable valuesbased on the semantics of each simulated statement. Thederived mathematical relationships are stored in a constraintset, marked as C . The runtime variable values in the exe-cution log will be checked against the constraints set duringthe simulation. Any violation of the constraints indicates theruntime integrity violation on the public cloud. Otherwise,we claim that the runtime integrity of the audited functionexecution is preserved.

During the simulation, for each statement s that is tra-versed, C will be updated based on the semantics of s. If shas a corresponding log record r, it will be used to eitherupdate the constraint set C or will be checked against C ,depending on the semantics of the statement. The algorithmof the execution audition is shown in Algorithm 2.

We explain the algorithm as follows case by case:

case 1 If s is an input statement, i.e., it is the entry pointof a map or reduce function, its corresponding logrecord r is an input log record. The integrity of r isverified by the input audition of the current phase,we therefore trust it and use it in the constraint set.Thus, r is converted into a constraint and added toC .

case 2 If s is an output statement, i.e., it is an invocationof the function context.write, its correspondinglog record r is an output log record. We need toverify the integrity of such a log record. Thus, r ischecked against the current C .

case 3 If s is a branch statement, i.e., it is in a form ofif(condition), its corresponding log record r isa branch log record. We first verify the integrityof the log record r by converting r to a constraintand check it against the current C . After that, thecondition is verified according to the values inr, deciding the runtime branch that is simulated.Based on the condition and its decided branch, wegenerate a new constraint and add it to C . Sincer is verified, the runtime branch in the simulationshould be consistent to the execution on the publiccloud.

case 4 If s is a function invocation statement, and rpre

and rpost

are a pre-invoke and a post-invoke logrecord corresponding to the function invocation, wefirst check r

pre

against C to ensure the integrityof the parameters passed to the function. Then, weconvert r

post

into a constraint and add it into C . Theidea of this design is that, we temporarily trust theintegrity of r

post

and use it to verify the integrityof the execution after the invocation statement. Wedefer the verification of r

post

to the time when theinvoked function execution is audited. To achievethat, we need to maintain a connection between thecaller and the callee, which is described in Case 5)and 6).

case 5 If s is the entry point of a function that is neither amap nor a reduce function, r

pre

is its correspond-ing pre-invoke log record, we initialize C as anempty set, convert r

pre

into a constraint and addit to C . Since r

pre

is verified in Case 4), we can trustits integrity and use it to audit the current functionexecution.

case 6 If s is a return statement of a function that is neithera map nor a reduce function, r

post

is its correspond-ing post-invoke log record, we check r

post

againstC . Checking the integrity of r

post

assures the callerfunction that the post-invoke log is trusted and canbe used to verify the integrity of the remainingexecution of the caller function.

case 7 If s is an assignment statement where the right handside is primitive operations on variables, no logrecord corresponds to such a statement. However,C has to be updated due to the possible change ofvariable values.

a) If s is in the format of y = �(· · · ), where � isprimitive operations on variables that do notinclude y, only the value of y is changed afterthe execution of s. Therefore, any constraintsin C pertaining to y will not hold. Thusthe algorithm replaces those constraints withthe equality constraints generated from thestatement s. For example, if s is x=m+n, anyconstraints containing x will be removed andnew constraint x = m + n will be added toC .

b) If s is in the format of y = �(· · · , y), wherethe value of y is updated after s is executed,the algorithm updates the constraints in Cthat contain y. To make the description clear,

Page 8: RIA: an Audition-based Method to Protect the Runtime Integrity of …web.xidian.edu.cn/yzwang/files/20170426_110302.pdf · 2017. 4. 26. · Abstract—Public cloud vendors have been

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

Algorithm 2 Function Execution Audition (Function f, ExecutionLog log)Require: f is the function to be audited, log contains the execution log corresponding to one execution of the function f .

1: C {;}2: s f.nextStatement()3: while s! = End Of Function do4: switch (s)5: case 1 (s is the entry point of a map/reduce function):6: r log.getInputLogRecord()7: C C [ r.toConstraint()8: case 2 (s is an invocation of the context.write function) :9: r log.getOutputLogRecord()

10: Check(C, r.toConstraint())11: case 3 (s is a branch statement) :12: r log.getBranchLogRecord()13: check(C, r.toConstraint())14: branch evaluate(s, r)15: c s.getCondition(branch)16: C C [ c.toConstraint()17: case 4 (s is an invocation of a function :18: r

pre

log.getPreInvokeLogRecord()19: r

post

log.getPostInvokeLogRecord()20: check(C, r

pre

.toConstraint())21: C C [ r

post

.toConstraint()22: case 5 (s is the entry point of a non-map/reduce function :23: r

pre

log.getPreInvokeLogRecord()24: C C [ r

pre

.toConstraint()25: case 6 (s is a return statement of a non-map/reduce function :26: r

post

log.getPostInvokeLogRecord()27: check(C, r

post

.toConstraint()28: case 7 (s is an assignment statement, where the R.H.S. expression consists of primitive operations on variables) :29: if s is in a format of y = �(X), where y /2 X then # case 7(a)30: Def

y

C.getDefinition(y) # Find the definition of y: y = Defy

31: C C � {y = Defy

}32: C.replace(y,Def

y

)33: C C [ s.toConstraint()34: C.putDefinition(y = �(X))35: else if s is in a format of y = �(X, y) then # case 7(b)36: y

old

��1y

(X, y)37: C.replace(y, y

old

)38: end if39: end switch40: s f.nextStatement()41: end while

we represent s as ynew

= �(· · · , yold

). Ifthe operation � on y is reversible, the al-gorithm replaces y in C with a representa-tion of y

old

. Suppose the existing constraintcontaining y is C(y), we know that thisconstraint only holds for y

old

. We thus haveyold

= ��1y

(· · · , ynew

), where ��1y

is thereverse operation on variable y. Therefore,after y = �(· · · , y) is executed, the algorithmreplaces existing C(y) with C(��1

y

(· · · , y)).If the operation � on y is irreversible, we onlyremove constraints that contain y from theconstraintSet.

As a concrete example, Table 2 shows the details of theexecution audition on invoking map(3, 5, context).

The implementation of function map is shown in Listing. 2.The execution log is listed in Table 1. The second column inthe table tracks the statement to be simulated (started withstmt), along with the corresponding log records, includinginput logs (in), output logs (out), invoke logs (invoke), branchlogs (bra), etc. Based on the input, the audition determinesthe case that applies and performs the corresponding ac-tions, shown in the third column. The fourth column indi-cates the resulting constraints after the action.

3.5.2 Speed up with probabilistic checks

Function audition removes the cohesion between the func-tion caller and the callee, which serializes the applicationexecution into separated function executions. The user canrandomly choose function executions to audit. For eachfunction execution, we can perform Algorithm 2 with a

Page 9: RIA: an Audition-based Method to Protect the Runtime Integrity of …web.xidian.edu.cn/yzwang/files/20170426_110302.pdf · 2017. 4. 26. · Abstract—Public cloud vendors have been

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

TABLE 2The execution audition details of invoking map(3, 5, context)

No. Input Action Constraints 1

1 in:{key = 3, value = 5} C1 k = 3 ^ v = 52 stmt:{counter = value;} C7(c) k = 3 ^ v = 5 ^ c = v3 stmt:{sum = 0;} C7(c) k = 3 ^ v = 5 ^ c = v ^ s = 04 stmt:{fib1 = key;} C7(c) k = 3 ^ v = 5 ^ c = v ^ s = 0 ^ f1 = k5 stmt:{fib2 = fib1+1;} C7(c) k = 3 ^ v = 5 ^ c = v ^ s = 0 ^ f1 = k ^ f2 = f1 + 1

6 bra:{counter = 5},stmt:{if(counter > 0)} C3 k = 3 ^ v = 5 ^ c = v ^ s = 0 ^ f1 = k ^ f2 = f1 + 1 ^ c > 0

7 stmt:{sum = fib1 + fib2;} C7(a) k = 3 ^ v = 5 ^ c = v ^ f1 = k ^ f2 = f1 + 1 ^ c > 0 ^ s = f1 + f28 stmt:{fib1 = fib2;} C7(a) k = 3 ^ v = 5 ^ c = v ^ f2 = k + 1 ^ c > 0 ^ s = k + f2 ^ f1 = f29 stmt:{fib2 = sum;} C7(a) k = 3 ^ v = 5 ^ c = v ^ c > 0 ^ s = 2k + 1 ^ f1 = k + 1 ^ f2 = s10 stmt:{counter = counter � 1;} C7(b) k = 3 ^ v = 5 ^ c+ 1 = v ^ c+ 1 > 0 ^ s = 2k + 1 ^ f1 = k + 1 ^ f2 = s

11 bra:{counter = 4},stmt:{if(counter > 0)} C3 k = 3 ^ v = 5 ^ c+ 1 = v ^ s = 2k + 1 ^ f1 = k + 1 ^ f2 = s ^ c > 0

12 stmt:{sum = fib1 + fib2;} C7(a) k = 3 ^ v = 5 ^ c+ 1 = v ^ f1 = k + 1 ^ f2 = 2k + 1 ^ c > 0 ^ s = f1 + f213 stmt:{fib1 = fib2;} C7(a) k = 3 ^ v = 5 ^ c+ 1 = v ^ f2 = 2k + 1 ^ c > 0 ^ s = k + 1 + f2 ^ f1 = f214 stmt:{fib2 = sum;} C7(a) k = 3 ^ v = 5 ^ c+ 1 = v ^ c > 0 ^ s = 3k + 2 ^ f1 = 2k + 1 ^ f2 = s15 stmt:{counter = counter � 1;} C7(b) k = 3 ^ v = 5 ^ c+ 2 = v ^ c+ 1 > 0 ^ s = 3k + 2 ^ f1 = 2k + 1 ^ f2 = s

16 bra:{counter = 3},stmt:{if(counter > 0)} C3 k = 3 ^ v = 5 ^ c+ 2 = v ^ s = 3k + 2 ^ f1 = 2k + 1 ^ f2 = s ^ c > 0

17 stmt:{sum = fib1 + fib2;} C7(a) k = 3 ^ v = 5 ^ c+ 2 = v ^ f1 = 2k + 1 ^ f2 = 3k + 2 ^ c > 0 ^ s = f1 + f218 stmt:{fib1 = fib2;} C7(a) k = 3 ^ v = 5 ^ c+ 2 = v ^ f2 = 3k + 2 ^ c > 0 ^ s = 2k + 1 + f2 ^ f1 = f219 stmt:{fib2 = sum;} C7(a) k = 3 ^ v = 5 ^ c+ 2 = v ^ c > 0 ^ s = 5k + 3 ^ f1 = 3k + 2 ^ f2 = s20 stmt:{counter = counter � 1;} C7(b) k = 3 ^ v = 5 ^ c+ 3 = v ^ c+ 1 > 0 ^ s = 5k + 3 ^ f1 = 3k + 2 ^ f2 = s

21 bra:{counter=2},stmt:{if(counter > 0)} C3 k = 3 ^ v = 5 ^ c+ 3 = v ^ s = 5k + 3 ^ f1 = 3k + 2 ^ f2 = s ^ c > 0

22 stmt:{sum = fib1 + fib2;} C7(a) k = 3 ^ v = 5 ^ c+ 3 = v ^ f1 = 3k + 2 ^ f2 = 5k + 3 ^ c > 0 ^ s = f1 + f223 stmt:{fib1 = fib2;} C7(a) k = 3 ^ v = 5 ^ c+ 3 = v ^ f2 = 5k + 3 ^ c > 0 ^ s = 3k + 2 + f2 ^ f1 = f224 stmt:{fib2 = sum;} C7(a) k = 3 ^ v = 5 ^ c+ 3 = v ^ c > 0 ^ s = 8k + 5 ^ f1 = 5k + 3 ^ f2 = s25 stmt:{counter = counter � 1;} C7(b) k = 3 ^ v = 5 ^ c+ 4 = v ^ c+ 1 > 0 ^ s = 8k + 5 ^ f1 = 5k + 3 ^ f2 = s

26 bra:{counter = 1},stmt:{if(counter > 0)} C3 k = 3 ^ v = 5 ^ c+ 4 = v ^ s = 8k + 5 ^ f1 = 5k + 3 ^ f2 = s ^ c > 0

27 stmt:{sum = fib1 + fib2;} C7(a) k = 3 ^ v = 5 ^ c+ 4 = v ^ f1 = 5k + 3 ^ f2 = 8k + 5 ^ c > 0 ^ s = f1 + f228 stmt:{fib1 = fib2;} C7(a) k = 3 ^ v = 5 ^ c+ 4 = v ^ f2 = 8k + 5 ^ c > 0 ^ s = 5k + 3 + f2 ^ f1 = f229 stmt:{fib2 = sum;} C7(a) k = 3 ^ v = 5 ^ c+ 4 = v ^ c > 0 ^ s = 13k + 8 ^ f1 = 8k + 5 ^ f2 = s30 stmt:{counter = counter � 1;} C7(b) k = 3 ^ v = 5 ^ c+ 5 = v ^ c+ 1 > 0 ^ s = 13k + 8 ^ f1 = 8k + 5 ^ f2 = s

31 bra:{counter=0},stmt:{if(counter > 0)} C3 k = 3 ^ v = 5 ^ c+ 5 = v ^ s = 13k + 8 ^ f1 = 8k + 5 ^ f2 = s ^ c <= 0

32 invoke:{pre: sum=47,post: {shouldOutput=true, sum=47}} C4 k = 3 ^ v = 5 ^ c+ 5 = v ^ f1 = 8k + 5 ^ f2 = 13k + 8 ^ c <= 0 ^ so = 1 ^ s = 47

33 bra:{shouldOutput = true},stmt:{if(shouldOutput)} C3 k = 3 ^ v = 5 ^ c+ 5 = v ^ f1 = 8k + 5 ^ f2 = 13k + 8 ^ c <= 0 ^ s = 47 ^ so = 1

34 out:{key=3, value=47} C2 k = 3 ^ v = 5 ^ c+ 5 = v ^ f1 = 8k + 5 ^ f2 = 13k + 8 ^ c <= 0 ^ s = 47 ^ so = 11 Variables used in constraints are abbreviated from the original program code. The correspondence are as follows: k : key; v : value; c :counter; s : sum; f1 : fib1; f2 : fib2; shouldOutput : so.

function audition probability, marked as tf

. It offers a trade-off between the audition speed and the execution integrity.The quantitative security analysis of the execution auditionis performed in Section 4.2.

4 SECURITY ANALYSIS

In this section, we first perform security analysis on inputaudition (in Section 3.4) and execution audition (in Section3.5). After that, we discuss the security guarantee of theentire MapReduce application.

To facilitate discussion, we summarize all the parametersused to model the system in Table 3.

4.1 Security of the Input Audition

For each phase k of a MapReduce application, we need toperform input audition on phase input I

k

, using phase out-put O

k�1 as a baseline. We perform a probability analysis toreason the input integrity of each phase.

Since test (2) is performed on each record in a random-ized manner, the malicious worker does not know whichrecord will be tested. Therefore, we model the maliciousworker’s behavior as randomly choosing records in I

k

totamper their values. We define the integrity metrics of theinput audition as follows.

Definition 2 (Phase Input Tampering Detection Rate). Theprobability that the input audition can detect the record tamperingin a MapReduce phase is defined as the phase input integrity,marked as P

in

. We mark the phase input integrity as Pin

(M)when the malicious worker randomly chooses M input records totamper.

Theorem 1 (Phase Input Integrity Theorem). Suppose thenumber of input records in a phase is N

r

, the CBF test probabilityis t

r

, the CBF false positive rate is fb

, and the number ofrecords tampered by the malicious worker is M

r

, the phase inputtampering detection rate is

Page 10: RIA: an Audition-based Method to Protect the Runtime Integrity of …web.xidian.edu.cn/yzwang/files/20170426_110302.pdf · 2017. 4. 26. · Abstract—Public cloud vendors have been

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

TABLE 3Parameters used in modeling the system

Notation Name Explanation Involved techniqueN

r

phase record number The number of input records of the current phase.

input auditionM

r

tampered record number The number of input record the malicious worker tam-pers in the current phase.

tr

CBF test probability The probability that test (2) is performed on each distinctinput record.

fb

CBF false positive rate The false positive rate of the employed CBF.Pin

(M) phase input tampering detection rate The probability of detecting the tampered input recordsin a phase when the malicious worker tampered Mrecords.

tf

function audition probability The probability that a function execution to be auditedwith the execution audition algorithm (Algorithm 2). execution audition

Nf

phase function execution number The number of function executions to be executed in thecurrent phase.

Mf

phase tampered function execution number The number of function executions to be tampered bythe malicious worker in the current phase.

Pout

(M) phase output tampering detection rate The probability of detecting the tampered output logs ina phase when malicious workers tampered M functionexecutions.

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Tampered Record Number Mr

Pha

se In

put T

ampe

ring

Det

ectio

n R

ate

Phase Input Tampering Detection Rates with different Tampered Record Numbers, where fb=0.01

tr=0.1tr=0.001tr=0.000001

Fig. 3. A Simulation of the phase input tampering detection rates withdifferent tampered record numbers and different CBF test probabilities

Pin

(Mr

) = 1� (1� tr

+ tr

fb

)Mr

Mr

!(4)

The proof of Theorem 1 can be found in Appendix AWe simulate the phase input tampering detection rate

with different tampered record number Mr

, shown in Fig.3. In the simulation, we set the CBF false positive rate f

b

as 0.01, set the CBF test probability tr

as 0.1, 0.001, and0.000001, respectively, and increase the value of M

r

. As thefigure shows, even if we set the CBF test probability asa low value such as 0.000001, the phase input tamperingdetection rate can achieve 1.0 when the tampered recordnumber exceeds 5. In other words, the input audition canlimit the number of tampered records to a very small value.The figure also shows that a higher value of t

r

can result ina slightly better phase input integrity, given the same valueof M

r

. However, such an effect is very limited. Therefore, ininput audition, setting a small number of t

r

can result in ahigh phase input integrity.

4.2 Security of the Execution Audition

The malicious worker can tamper the values of variablesin function executions to undermine the integrity. For eachfunction execution, the possible tampering target can be anyvariable in that function. This includes temporary variables,such as fib1 in line 9 in Listing. 2; variables used for afunction invocation, such as key and sum in line 18 inListing. 2; variables received from the caller function, suchas key, value and context in line 1 in Listing. 2.

We first discuss the security property of the basic al-gorithm introduced in Section 3.5.1. After that, we reasonthe integrity of each phase by considering the probabilisticchecks introduced in Section 3.5.2.

4.2.1 Security of the Basic Algorithm

Theorem 2 (Function Execution Integrity). Suppose the exe-cution audition is performed on an execution of a map/reducefunction and the functions directly or indirectly invoked by sucha map/reduce function. If the audition is passed, the statementsimulation sequence of each function execution is consistent to theprogram’s control flow, and the constraint set C in each functionexecution always reflects (i.e., contains and only contains) thecorrect runtime mathematical relationships (RMRs for short)among variables.

The proof of Theorem 2 can be found in Appendix B.

Corollary 2.1 (Result Integrity Theorem). Suppose l is anoutput log record generated by a statement s that invokes thefunction context.write. v is the variable corresponding to l.Tampering any variable that results in a change of value l will bedetected by the function execution audition algorithm (Algorithm2).

The proof of Corollary 2.1 can be found in Appendix C.

Corollary 2.2 (Phase Output Log Integrity). Suppose theinput log of a phase preserves computation integrity. If the func-tion execution auditions performed on each function executionhave passed, the output log of the current phase preserves thecomputation integrity.

The proof of Corollary 2.2 can be found in Appendix D.

Page 11: RIA: an Audition-based Method to Protect the Runtime Integrity of …web.xidian.edu.cn/yzwang/files/20170426_110302.pdf · 2017. 4. 26. · Abstract—Public cloud vendors have been

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Tampered Function Execution Number Mf

Pha

se O

utpu

t Tam

perin

g D

etec

tion

Rat

e

Phase Output Tampering Detection Rate with Different Tampered Function Execution Numbers

tf=0.1tf=0.001tf=0.000001

Fig. 4. A Simulation of the phase output tampering detection rates withdifferent tampered function execution numbers and different functionaudition probabilities

In a phase, if a malicious cloud worker tampers theoutput record, but keeps the output logs correspondingto those records correct, The tampered output record willbe used as the input for the next phase, which generatestampered input log. The input audition in the next phasewill check the tampered input log against the correct outputlog of last phase and detects the error. Using the job input asa baseline, the auditions are performed one phase after an-other, forming a trust chain, eventually protects the integrityof the entire job. We will discuss details in Section 4.3.

4.2.2 Security of the Probabilistic Checks

We define the integrity metrics for evaluating the executionaudition in each phase.

Definition 3 (Phase Output Tampering Detection Rate). Theprobability of detecting the tampered output logs in a MapReducephase is defined as the phase output tampering detection rate,marked as P

out

. We mark the phase output tampering detectionrate as P

out

(Mf

) when the malicious worker randomly choosesM

f

function executions to tamper.

Theorem 3 (Phase Output Integrity Theorem). Suppose thetotal number of function executions in a phase is N

f

. Suppose thefunction audition probability is t

f

, and the number of functionexecutions to be tampered by the malicious worker is M

f

, thephase output tampering detection rate is

Pout

(Mf

) = 1� (1� tf

)Mf

Mf

!(5)

The proof of Theorem 3 can be found in Appendix E.We simulate the phase output tampering detection rate

with different tampered function execution numbers Mf

,shown in Fig. 4. In the simulation, we set the function audi-tion probability t

f

as 0.1, 0.001, and 0.000001, respectively,and increase the value of M

f

. As the figure shows, even ifwe set the function audition probability as a low value suchas 0.000001, the phase output tampering detection rate canachieve 1.0 when the phase function execution number M

f

exceeds 5. In other words, the execution audition can limitthe tampered function execution into a very small number.

The figure also shows that a higher value of tf

can resultin a slightly better phase output tampering detection rate,given the same value of M

f

. However, such an effect is verylimited. Therefore, in the execution audition, only settinga small number of t

f

can result in a high phase outputtampering detection rate.

4.3 Security of the MapReduce ApplicationIn this section, we analyze the security of the MapReduceapplication, witch consists of multiple MapReduce phases.We first analyze the security of basic input audition andexecution audition algorithms. After that, we perform quan-titative analysis on the security of the probabilistic checkalgorithms for the input audition and execution audition.

4.3.1 Security of Using the Basic Algorithms for Execution

Audition and Input Audition

Theorem 4 (Application Output Integrity). Suppose the basicalgorithms for the input audition and the execution auditionare employed. If all the input audition and execution auditionperformed on each phase have passed, the application output iscorrect.

The proof of Theorem 4 can be found in Appendix F.

4.3.2 Security of Using the Probabilistic Checks Algorithm

for Execution Audition and Input Audition

Theorem 5 (Application Output Integrity under Probabilis-tic Checks). Suppose the probabilistic checks algorithms for theinput audition and the execution audition are performed on aMapReduce application consists of n phases. If in phase i, where1 i n, the number of tampered input records is M

ri , thenumber of tampered function executions is M

fi , the CBF testprobability in the input audition is t

ri , the CBF false positive rateis f

bi and the function audition probability is tfi , the probability

of detecting the output error is

Papp

=1�nY

i=1

⇣ (1� tri + t

rifbi)Mri

Mri !

· (1� tfi)

Mfi

Mfi !

⌘(6)

The proof of Theorem 5 can be found in Appendix G

5 IMPLEMENTATION AND EXPERIMENTS

We have developed a prototype system, called MRAuditor,to perform runtime integrity audition on Apache Hadoop (amainstream MapReduce implementation) applications. Ourexperiments in Section 5.2 showed that our technique canbe applied to Hadoop applications directly. The applicationexecution on the public cloud incurs moderate performanceoverhead. In addition, MR Auditor gives users a flexibilityto make a tradeoff between the audition accuracy and theaudition time.

5.1 System ImplementationWe perform the log statement insertion and execution audi-tion on a typed, three-address, statement based intermediaterepresentation of the original program. Such a representa-tion removes complicated expression in each statement, thusmaking the program analysis and transformation easier.When generating execution logs for a statement, such a

Page 12: RIA: an Audition-based Method to Protect the Runtime Integrity of …web.xidian.edu.cn/yzwang/files/20170426_110302.pdf · 2017. 4. 26. · Abstract—Public cloud vendors have been

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

representation removes the complicated expression in eachstatement, thus reduces the number of the variable val-ues to be logged. When performing execution audition, asimplified statement make it easy to derive mathematicalrelationships.

To implement the MR Auditor, we use Soot, an opensource Java-based compiler tool, to perform program anal-ysis and transformation. We use Symja, a computer algebrasystem, to perform constraints generation and the integrityverification . MR Auditor consists of two parts. The firstpart is called Log Inserter, which directly analyzes originalprogram’s .class files and generates the PWL with the .classformat. The second part is called Integrity Checker, whichperforms the input audition and the execution auditionbased on the original program’s .class file and the executionlogs returned from the public cloud.

The Log Inserter performs the statement analysis andthe statement insertion directly on the .class file. Thus, it candirectly transform the Hadoop applications implementedin Java language and does not need the source code. Theanalysis and transformation is performed in the gb phaseof Soot. For each function of the program, the log inserteriterates all its statements and inserts the logging statementsas described in Section 3.3. The logging statement is aninvocation of a predefined function that writes the corre-sponding variable values to the log files on the public cloud.To improve the efficiency of the execution audition, differentlogs are written to different log files based on the currenttask id, the log type and the current function’s signature. Forinstance, the branch logs generated from different functionsin different tasks are written to different files. During theexecution audition, the execution log can be accessed easilyby the task id, the log type and the current function’s sig-nature. Each logging statement generates one line to the logfile, recording the variable values involved in the log trace.In order to mark the correspondence relationship betweendifferent types of logs, we assign an invoke identificationnumber (invoke ID for short) for each function executionwhen the application is executed on the public cloud. Toachieve that, we maintain a static long integer value duringthe execution. Whenever a function is entered (invoked), thestatic value is incremented. The incremented value is thecurrent function invoke ID. All the logs generated in thatfunction execution will start with that invoke ID.

Integrity Checker performs audition on each phase se-quentially. The audition on each phase includes the inputaudition and the execution audition. In the input audition,we utilize Hadoop’s implementation of the counting bloomfilter to construct the CBF for the output of the previousphase and the input of the current phase, and check theinput integrity of the current phase according to the descrip-tion in Section 3.4.

For the execution audition, we perform the statementanalysis using the jtp phase of soot. For each function,we randomly pick a pair of the invoke log recording theinvocation of that function, collects its corresponding execu-tion logs through the invoke ID and performs the functionexecution audition according to Algorithm 2.

For each statement that we simulate, Soot represent thestatement as a V alue object. We translate such an objectinto a string-based arithmetic expressions and send them

to Symja. Symja can perform mathematical manipulation toeither evaluate the branch predicate value or to update theconstraint set. Because Symja can only perform arithmeticoperations, we only translate the arithmetic operator inthe Soot V alue object into a Symja operator and translateother operations that Symja does not recognize into Symjavariables. For instance, when we have a Soot expressionx + func(y), we translate it into x + funcLBRyRBR.Here, ( and ) are keywords in Symja script. They will bemisrepresented as the arithmetic operation priority sign inSymja, rather than a function call. Thus we replace themwith LBR and RBR, respectively.

Each constraint in the constraint set is stored in theformat of string. To search for the constraints that contains acertain variable, we simply search for the variable string ineach constraint string.

To determine the next statement for a branch statement(i.e., to perform operation in line 14 of Algorithm 2), In-tegrity Checker needs to evaluate the predicate value whenruntime variable values replaces the variables in the pred-icate. This evaluation can be handled by Symja. IntegrityChecker first assign variable values to variables in Symja.Then it invoke Symja’s function ExprEvaluator.evaluate()with the predicate as a parameter. True or False will bereturned depending on the evaluation result. This techniquecan also be used to verify the consistency between predicatevariable values and the constraint set (i.e., the check functionin Algorithm 2). To compute the inverse operation of thecurrent operation (i.e., the operation in line 36 of Algorithm2), Integrity Checker uses Symja’s built-in function Solve.The Solve function receives two parameters, the formulaand the variable that is solved for, and returns the solutionfor that variable. In the special format y = F (·, y), Weneed to pass the formula as y

new

= F (·, yold

) and solvefor y

old

. The solution will be a formula of ynew

. In theadded constraint, we replace the existing y (the y

old

) withthe solution for y

old

, but using variable name y instead ofynew

in the added constraint.

5.2 ExperimentsWe performed a set of experiments on Apache Hadoop 1.0.4to evaluate MR Auditor. The evaluation was measured withtwo groups of metrics. The first group evaluates the LogInserter. Specifically, we chose several Hadoop applicationsto measure the extra storage incurred by the generatedexecution logs and the extra execution time incurred dueto the generation of the execution logs. The second groupevaluates the relationship between the audition accuracyand the audition speed.

5.2.1 Experiment Setup

We used a cloud server (the dragon stack) in our universityas the public cloud to perform the experiments. The cloudserver was responsible for executing the Hadoop applica-tions (the PWL) and collecting the execution logs. The cloudserver was equipped with two Intel Xeon E5 CPU (8 cores oneach CPU), 128 GB of memory and 6 TB of hard disk storage.On this server, we deployed a Hadoop cluster consistingof five Linux virtual machines. Each virtual machine wasinstalled Ubuntu 14.04.1 and is configured with 2 GB ofmemory and 128 GB of storage.

Page 13: RIA: an Audition-based Method to Protect the Runtime Integrity of …web.xidian.edu.cn/yzwang/files/20170426_110302.pdf · 2017. 4. 26. · Abstract—Public cloud vendors have been

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

We deployed the private cloud on a Linux desktop. Theprivate cloud was responsible for performing the executionauditions. The Linux desktop was equipped with a 2-core1.8GHz Intel i5 processor, 4 GB of memory and 128 GB ofhard disk.

We selected four Hadoop applications to evaluate MRAuditor. These applications were either selected from thethe Hadoop 1.0.4 release package or from the Hadoopbenchmark suite HiBench 3.0.0 [10]. The specifications ofthe applications are listed in Fig. 4.

5.2.2 The Evaluation of the Log Inserter

We transformed each original application programs into thePWL with the Log Inserter. For each Hadoop application, wecompared the execution time of the original application andthe PWL. To obtain a more accurate execution time, for eachapplication, we executed both the original application andits PWL for ten times and computed the average executiontime. We also measured the size of the execution logsgenerated by the PWL for each application. The averageexecution time and the execution log size of each applicationare listed in Table 5.

Table 5 shows that when processing a relatively largeHadoop application where the map/reduce task numberis significantly large (e.g., more than 1,000 map task) andthe computing logic is fairly complex (e.g., the Quasi MonteCarlo and the Page Rank application), the performance over-head of the PWL applications are moderate (ranging from0.18% to 34.48%). However, the size of the execution logs isfairly large (ranging from 3.4 GB to 56 GB), which incursa large storage overhead and a large data transmissionoverhead.

5.2.3 The Evaluation of the Integrity Checker

We evaluated the performance of the Integrtiy Checkerusing the Word Count application in Table 4. Specifically, weperformed the input audition and the execution audition onthe execution logs generated by the Word Count applicationand recorded their execution time.

When evaluating the input audition, we recorded theexecution time of the input audition under different CBFtest probability t

r

. The input audition for the Word Countapplication includes checking the input of the map phaseagainst the job input (the record number for the job input is109,816), checking the input of the reduce phase against theoutput of the map phase (the record number for the mapphase output is 94,127,798), and checking the applicationoutput against the output of the reduce phase (the recordnumber for the output of the reduce phase is 1,554,929). Thetotal number of input records to be audited is 95,792,543 Weperformed the input audition for ten times under differenttr

and listed the average times under each tr

in Table 6.The result indicates that the efficiency of the input audi-

tion is fairly high. When tr

is set to 100%, the audition speedis 95792543/164 ⇡ 485k(records/second)

We also observe that the execution time of the inputaudition increases as the test probability t

r

increases. Theprocedure of the input audition can be broken down intotwo phases: the CFB construction phase and the test (2)execution phase. We expect that the increase of the execution

TABLE 6The Execution Time of the Input Audition for the Word

Count Application

CBF Test Probability tr

(%) Execution Time (s)1 10210 10450 132100 164

TABLE 7The Execution Time of the Execution Audition for the Word Count

Application

Audition Prob-ability t

f

(%)No. ofAuditedFunc.s

ExecutionTime (s)

Audition Effi-ciency (func./s)

0.1 7115 155 460.2 14299 214 661 71147 1174 612 142295 3400 4210 711475 13276 54

time should be proportional to the increase of the test prob-ability. The data shown in Table 6 confirms our expectation:Compared to the execution time (i.e., 102 seconds) when t

r

is 1%, when tr

increases from 50% to 100%, the increasedexecution times are 30 seconds and 62 seconds, respectively.The increase rate of the execution time is 107%, which isclose to the increase rate of t

r

(i.e., 102%).We further evaluated the performance of the execution

audition. Specifically, we changed the function auditionprobability t

f

to evaluate the execution time of the executionaudition. The execution audition will perform the functionexecution audition on each pair of the invoke log in theinvoke log files with probability t

f

. With each value of tf

,we performed the execution audition for ten times and listthe average times in the second 3rd column of Table 7.According to our statistics, the total number of functionexecutions that can be audited is 7,114,758. Combiningwith the function audition probability t

f

, we computed thenumber of function executions that were performed withthe function execution audition, listed in the second columnof Table 7. Based on the above information, we computedthe audition efficiencies, i.e., the number of function executionaudition performed in each seconds, and listed them inthe fourth column in Table 7. Data in the table indicatesthat the execution audition time is generally proportionalto the audition probability. The audition efficiencies arestable with different audition probabilities. On average,the audition efficiency is 53.8 functions/second. Note thatthe average audition efficiency depends on the programimplementation. In general, the application that containsa much complex program logic will decrease the auditionefficiency, since the time spend for each function executionaudition will be increased.

In summary, the experimental results showed that theLog Inserter incurs a modest performance overhead whilethe Hadoop applications are executed on the public cloud.Meanwhile, the Integrity Check provides users a freedomto trade-off the accuracy and the speed of the executionaudition by adjusting the skip probability and auditionprobability.

Page 14: RIA: an Audition-based Method to Protect the Runtime Integrity of …web.xidian.edu.cn/yzwang/files/20170426_110302.pdf · 2017. 4. 26. · Abstract—Public cloud vendors have been

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13

TABLE 4Specifications of Hadoop Applications

Name Description SpecificationWordCount Count the words in the input files Count the number of words from 1 GB of texts. The job consists

of 1024 map tasks and 1 reduce task.QuasiMonteCarlo Estimate ⇡ using the Monte-Carlo method Estimate the value of ⇡. The job consists of 100 map tasks,

where each task generates 100,000 samples.TeraSort Sort text strings in the alphabetical order. Sort 1 GB of texts. The job consists of 1024 map tasks and 1

reduce task.PageRank a Perform the PageRank algorithm implemented

in Pegasus, a Hadoop-based graph mining sys-tem

Rank 500,000 web pages. The application consists of six jobs,where each job contains up to 96 map tasks and 48 reducetasks.

a the HiBench application.

TABLE 5The Execution Time and the Execution Log Size of Each Application

Application Name Execution Time of the Orig-inal Program (s)

Execution Time of thePWL (s)

Performance Overhead(%)

execution log Size (GB)

Word Count 3100 3131 1.00 22.0Quasi Monte Carlo 327 333 0.18 3.4

Tera Sort 2926 3044 4.03 5.7Page Rank 3284 4413 34.38 56

6 RELATED WORKS

Protecting the runtime integrity of remote computation hasbeen studied for decades. Researchers has proposed a widerange of solutions from different perspectives.

Secure co-processor [11] provided a hardware-basedsolution towards the trusted outsourced computing. Thesecure hardware chip Trusted Platform Modules (TPMs) in-stalled on computers can verify the integrity of the worker’ssoftware stack ranging from the bios configuration at thebooting procedure to the application that executes the out-sourced task. By ensuring that the work’s environmentcompletely complies to the specification, TPM technologyguarantees the correctness of task result. However, sucha solution requires that the configuration of the worker’smachine to be exactly identical to the TPM’s expectation,thus lacks the flexibility. The Trusted Execution Environ-ment technology, such as Intel Software Guard Extension(SGX) [12] [13] offers an alternative option, which allowstrusted software to be executed in an trusted environment.Leveraging such a technology, Schuster [14] proposed asolution to protect the execution integrity of MapReduce.Such a solution still depends on a specific hardware con-figuration. In addition, the trusting base of such a systemusually is carefully tailored, which makes such a methoddifficult to be generalized.

Proof-based verifiable computation [15] searches for the-oretical solutions, utilizing cryptography and computingcomplexity theorem. The framework of this direction issimilar to our work. The outsourced program is firstly trans-formed into a circuit by the user (verifier) and is executedon the untrusted cloud (prover). While execution, proofs aregenerated as transcript. After (or during) the execution, theverifier will verify the transcript provided by the prover todecide whether to accept the computation result returnedby the prover. Despite a few systems are proposed to realizesuch a framework, including Pepper [16], Pinocchio [17],and Buffet [18], this direction is still in the phase of theoret-ical research. The existing solutions are not practical due tolow performance and low program expressiveness.

Practical solutions towards such a problem falls to thedirection of task replication and task verification, including[19] [20] [21] [22] [23], [24], [25], [26] [27] and [28]. Such aclass of solutions works on the task level. Thus it usuallyincurs a higher performance overhead.

Remote attestation provides solution from the code level.By statically modeling the program, collecting the programexecution trace, and comparing the consistency betweenthe model and the program execution trace, this type ofmethods can verify the runtime integrity. Many previousworks [4] [5] focus on the integrity of the program’s controlflow while ignoring the integrity of its data flow. Such aclass of methods can be easily bypassed if the attacker onlytampers the values of the variables during the execution.Shu et al. [6] proposed a method combining dynamic tracingand machine learning in order to detect abnormal assemblelanguage instruction sequence. Such a method has certainfalse positive and false negative rate. Moreover, the attackercan still perform the above attack to bypass the check. Someresearchers [7] [8] proposed to check the data flow integrityof the parameters involved in system calls. Such a class ofmethods can only provides limited protections on the dataflow integrity. The attacker can still tamper the variables thatare not the system call parameters.

An orthogonal research direction is to protect the run-time control flow integrity from the machine code’s level[29] [30]. By enforcing checks on the destination addresses ofjump instructions, such a class of works defends the controlflow hijacking attack on the local machine. However, such aclass of solutions does not meet our requirement where thecomputation is performed remotely.

The earlier research outcomes about this paper havebeen published in a poster paper [31]. Compared to theposter paper, this paper provides more details about the sys-tem design, implementation, evaluation, and experimentalresults.

Page 15: RIA: an Audition-based Method to Protect the Runtime Integrity of …web.xidian.edu.cn/yzwang/files/20170426_110302.pdf · 2017. 4. 26. · Abstract—Public cloud vendors have been

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14

7 CONCLUSION

In this paper, we propose the Runtime Integrity Audition(RIA) technique, a method to remotely verify the runtimeintegrity of MapReduce applications. RIA records the exe-cution trace of the MapReduce applications on the publiccloud and checks the execution traces against the appli-cation input and the application’s program on the privatecloud. By doing so, RIA protects the runtime integrity ofMapReduce applications. We developed a prototype systemcalled MR Auditor based on the idea of RIA, and testedits applicability and the performance with several Hadoopapplications. Our experimental results show that our tool iscompatible with all the applications that we tested, incursa moderate performance overhead (with a 0.18% to 34% ofextra execution time with an unoptimized implementation)and provides users a flexibility to trade-off the auditionaccuracy and the audition speed.

APPENDIX APROOF OF THEOREM 1

Proof. Suppose the malicious worker randomly choose onerecord to tamper, the probability of not being detected is

1� tr

+ tr

fb

If a malicious worker randomly chooses M out of Nrecords to tamper, the probability that the input auditiondoes not detect is

P(one case of avoid detection)

=1

N(1� t

r

+ tr

fb

)

· 1

N � 1(1� t

r

+ tr

fb

)

· · ·

· 1

N � (M � 2)(1� t

r

+ tr

fb

)

· 1

N � (M � 1)(1� t

r

+ tr

fb

)

=(1� tr

+ tr

fb

)Mr · (Nr

�Mr

)!

Nr

!

The number of combinations for the malicious workerto choose M

r

out of Nr

records is�Nr

Mr

�. Thus the overall

probability for the malicious worker to avoid the inputaudition in the current phase is

P(avoid detection)

=

✓N

r

Mr

◆· P(one case of avoid detection)

=N

r

!

Mr

!(Nr

�Mr

)!· (1� t

r

+ tr

fb

)Mr · (Nr

�Mr

)!

Nr

!

=(1� t

r

+ tr

fb

)Mr

Mr

!

Therefore, the phase input detection rate for tamperingM

r

records is

Pin

(Mr

) =1� P(avoid detection)

=1� (1� tr

+ tr

fb

)Mr

Mr

!

APPENDIX BPROOF OF THEOREM 2Proof. We prove it by induction. To facilitate our proof, wedefine the sequence of the simulated statements to be exactlythe same as the execution sequence. In other words, whenthe statements in a function are simulated sequentially, ifthe next statement is a function call, we will recursivelysimulate the statements of the callee function first. Whenthe callee function returns, we resume the simulation ofthe statements after the function call statement. Under thissequence, for any statement s to be simulated, all the logsgenerated before the execution of s have passed the check.

Base case: Before the entry point of a map/reduce func-tion, no variable is involved in the execution. Therefore, noRMR is available. The correct RMR is thus an empty set,corresponding to setting C as an empty set in line 1 ofAlgorithm 2. At the entry point of a map/reduce function,nothing is executed, the statement simulation sequence istrivially consistent to the control flow of the map/reducefunction.

Induction step: We assume that, before simulating state-ment s, the statement simulation sequence before statements is consistent to the program’s control flow and the con-straint set C

s�1 reflects the correct RMRs. We prove thatfor each type of statement, after simulating statement s, thenew constraint set C

s

still reflects the correct RMRs amongvariables. Meanwhile, the updated simulation sequence isstill consistent to the program’s control flow.

We discuss the new Cs

on all seven types of statements.We discuss the control flow on the branch statement (case3), since other types of statements do not have branches.

case 1 (s is the entry point of a map/reduce function): Inthis case, the values of parameters key, value (i.e.,k and v) are verified according to the input auditionof the current phase. Therefore adding key = k ^value = v to C will be correctly reflect the changeof the runtime mathematical relationships.

case 2 (s is an invocation of the context.write func-tion): In such a statement, no variable is updated.Therefore, C does not have to be updated and stillreflects the correct runtime relationships.

case 3 (s is a branch statement): Since we assume Cs�1

reflects the correct RMRs before executing s, a checkof the branch log record r on C

s�1 (i.e., line 13of the algorithm) will guarantee a correct runtimevalue of the variables in the condition and thus willdetermine a correct branch (line 14), thus making itconsistent to the program control flow. Based on thecondition and its evaluation result, a new RMR isgenerated and added to C in line 16. Thus the newC reflects the correct RMRs.

Page 16: RIA: an Audition-based Method to Protect the Runtime Integrity of …web.xidian.edu.cn/yzwang/files/20170426_110302.pdf · 2017. 4. 26. · Abstract—Public cloud vendors have been

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15

case 4 (s is an invocation statement invoking function f ):Since we assume C

s�1 reflects the correct RMRsbefore executing s, a check of the pre-invoke logrecord r

pre

on Cs�1 (i.e., line 20 of the algorithm)

will guarantee the correctness of the runtime valueof the parameters passed to the invoked function.It will be used in case 5 (see below). According tothe specified audition sequence, the audition of theexecution of function f is performed after checkingrpre

(line 20) and before adding rpost

to C (line 21).Before adding r

post

to C , the correctness of rpost

has been audited (see case 6). Therefore, adding theconstraint generated from r

post

will guarantee thatupdated C reflects the correct RMRs.

case 5 (s is an entry point of a non-map/reduce functionf ): In this case, the preceding statement s � 1 isthe invocation statement that invokes f . Since C

s�1

reflects the correct RMRs and the check of the pre-invoke log r

pre

against Cs�1 (line 20, case 4) has

passed, we assure that rpre

is the correct runtimevalue. Therefore, adding it to C

s

reflects the correctRMRs.

case 6 (s is a return statement of a non-map/reduce func-tion f ): Since no variable is updated, C

s

will bethe same as C

s�1, still reflecting the correct RMRs.Since we assume C

s�1 reflects the correct RMRsbefore the function return, r

post

representing thereturned value and the updated parameters will bechecked against C

s�1 to ensure the correctness ofrpost

, which will be used in case 4.case 7 (s is an assignment statement): Since we assume the

existing Cs�1 reflects the correct RMRs, any update

of the variable involved in Cs�1 will be updated in

Cs

and any update of variable not involved in Cs�1

will be added to Cs

, reflecting the correct RMRs.

Conclusion: By the principle of induction, if all checksin Algorithm 2 pass, the statement execution sequence isconsistent to the program’s control flow, and the constraintset C always reflects the correct runtime mathematical rela-tionships.

APPENDIX CPROOF OF COROLLARY 2.1Proof. According to theorem 2, when statement s is simu-lated, the corresponding constraint C will reflect the correctruntime mathematical relationships. According to case 2 ofalgorithm 2, the output log record l will be checked againstthe constraint set C when s is simulated. Thus a tamperingthat results in a change of output log record l will fail thecheck.

APPENDIX DPROOF OF COROLLARY 2.2Proof. The computation in a phase consists of multiple func-tion executions. Suppose the input log in a phase is correct,according to Corollary 2.1, a passed function executionaudition indicates correct output logs.

APPENDIX EPROOF OF THEOREM 3Proof. Suppose the malicious worker randomly chooses onefunction execution to tamper, the probability of not beingdetected is

1� tf

If the probability that a malicious worker randomlychoose M

f

out of Nf

records to tamper, the probability thatthe execution audition does not detect is

P(one case of avoid detection)

=1

Nf

(1� tf

)

· 1

Nf

� 1(1� t

f

)

· · ·

· 1

Nf

� (Mf

� 2)(1� t

f

)

· 1

Nf

� (Mf

� 1)(1� t

f

)

=(1� tf

)Mf · (Nf

�Mf

)!

Nf

!

The number of combinations for the malicious worker tochoose M

f

out of Nf

executions is�Nf

Mf

�. Thus the overall

probability for the malicious worker to avoid the executionaudition in the current phase is

P(avoid detection)

=

✓N

f

Mf

◆· P(one case of avoid detection)

=N

f

!

Mf

!(Nf

�Mf

)!· (1� t

f

)Mf · (Nf

�Mf

)!

Nf

!

=(1� t

f

)Mf

Mf

!

Therefore, the probability of detecting execution tamper-ing is

Pout

(Mf

) =1� P(avoid detection)

=1� (1� tf

)Mf

Mf

!

APPENDIX FPROOF OF THEOREM 4Proof. We first prove the correctness of the output log ofeach phase. After that, we prove the consistency betweenthe output record and the output log of the last phase.

Suppose the MapReduce application consists of nphases, marked as {1, 2, · · · , n}, respectively. We prove thecorrectness of the output log for each phase with induction.

Base case: According to Algorithm 1, in phase 1, the phaseinput log will be checked against the job input I with input

Page 17: RIA: an Audition-based Method to Protect the Runtime Integrity of …web.xidian.edu.cn/yzwang/files/20170426_110302.pdf · 2017. 4. 26. · Abstract—Public cloud vendors have been

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 16

audition. If this input audition has passed, the input log ofphase 1 is correct. If the execution audition on each functionexecution passed, according to corollary 2.2, the output logof phase 1 is correct.

Induction step: We assume that the output log of phasei� 1 is correct. If the input audition of phase i is passed, theinput log of phase i is correct. If the execution audition oneach function execution passed, according to corollary 2.2,the output log of phase i is correct.

Conclusion: By the principle of induction, the output logof each phase (from phase 1 to phase n) is correct.

According to Line 13 of Algorithm 1, the output of thelast phase will be checked against the output log of thelast phase. Such an input audition is passed only when theoutput records are consistent to the output log. Since theoutput log is correct, the output record is correct.

APPENDIX GPROOF OF THEOREM 5

Proof. According to Theorem 1, for a phase i, the probabilityfor the malicious worker to avoid the detection of thetampering is

(1� Pin

(Mri)) · (1� P

out

(Mfi))

=(1� t

ri + trifbi)

Mri

Mri !

· (1� tfi)

Mfi

Mfi !

The probability for the malicious worker to avoid thedetection of tampering in all the n phases are

nY

i=1

(1� Pin

(Mri)) · (1� P

out

(Mfi))

=nY

i=1

⇣ (1� tri + t

rifbi)Mri

Mri !

· (1� tfi)

Mfi

Mfi !

Therefore, the probability of detecting any tampering inall n phases are

1�nY

i=1

(1� Pin

(Mri)) · (1� P

out

(Mfi))

=1�nY

i=1

⇣ (1� tri + t

rifbi)Mri

Mri !

· (1� tfi)

Mfi

Mfi !

ACKNOWLEDGMENTS

This paper is supported in part by the Open Fund of the Chi-nese Key Laboratory of the Grain Information Processingand Control (No. KFJJ-2015-202), the Fundamental ResearchFunds for the Central Universities (No. XJS16042, JB160312and BDY131419), and the NSFC (No. U1536202, 61571352,61373173, 61602364 and 61602365).

REFERENCES

[1] “Top 10 cloud fiascos,” http://www.networkcomputing.com/cloud/top-10-cloud -fiascos/96279858, accessed: 2016-05-21.

[2] T. Ristenpart, E. Tromer, H. Shacham, and S. Savage, “Hey, you,get off of my cloud: exploring information leakage in third-partycompute clouds,” in Proceedings of the 16th ACM conference onComputer and communications security. ACM, 2009, pp. 199–212.

[3] S. Bugiel, S. Nurnberger, T. Poppelmann, A.-R. Sadeghi, andT. Schneider, “Amazonia: when elasticity snaps back,” in Proceed-ings of the 18th ACM conference on Computer and communicationssecurity. ACM, 2011, pp. 389–400.

[4] J. T. Giffin, S. Jha, and B. P. Miller, “Efficient context-sensitiveintrusion detection.” in NDSS, 2004.

[5] ——, “Detecting manipulated remote call streams.” in USENIXSecurity Symposium, 2002, pp. 61–79.

[6] X. Shu, D. Yao, and N. Ramakrishnan, “Unearthing stealthyprogram attacks buried in extremely long execution paths,” inProceedings of the 22nd ACM SIGSAC Conference on Computer andCommunications Security. ACM, 2015, pp. 401–413.

[7] S. Bhatkar, A. Chaturvedi, and R. Sekar, “Dataflow anomaly de-tection,” in 2006 IEEE Symposium on Security and Privacy (S P’06),May 2006, pp. 15 pp.–62.

[8] P. Li, H. Park, D. Gao, and J. Fu, “Bridging the gap between data-flow and control-flow analysis for anomaly detection,” in Com-puter Security Applications Conference, 2008. ACSAC 2008. Annual,Dec 2008, pp. 392–401.

[9] J. Dean and S. Ghemawat, “Mapreduce: Simplified dataprocessing on large clusters,” Commun. ACM, vol. 51,no. 1, pp. 107–113, Jan. 2008. [Online]. Available:http://doi.acm.org/10.1145/1327452.1327492

[10] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang, “The hibenchbenchmark suite: Characterization of the mapreduce-based dataanalysis,” in Data Engineering Workshops (ICDEW), 2010 IEEE 26thInternational Conference on, March 2010, pp. 41–51.

[11] R. Sailer, X. Zhang, T. Jaeger, and L. van Doorn, “Designand implementation of a tcg-based integrity measurementarchitecture,” in Proceedings of the 13th Conference on USENIXSecurity Symposium - Volume 13, ser. SSYM’04. Berkeley, CA,USA: USENIX Association, 2004, pp. 16–16. [Online]. Available:http://dl.acm.org/citation.cfm?id=1251375.1251391

[12] I. Anati, S. Gueron, S. Johnson, and V. Scarlata, “Innovative tech-nology for cpu based attestation and sealing,” in Proceedings of the2nd international workshop on hardware and architectural support forsecurity and privacy, vol. 13, 2013.

[13] F. McKeen, I. Alexandrovich, A. Berenzon, C. V. Rozas, H. Shafi,V. Shanbhogue, and U. R. Savagaonkar, “Innovative instructionsand software model for isolated execution.” in HASP@ ISCA, 2013,p. 10.

[14] F. Schuster, M. Costa, C. Fournet, C. Gkantsidis, M. Peinado,G. Mainar-Ruiz, and M. Russinovich, “Vc3: Trustworthy dataanalytics in the cloud using sgx,” in Proceedings of the 2015 IEEESymposium on Security and Privacy, ser. SP ’15. Washington, DC,USA: IEEE Computer Society, 2015, pp. 38–54. [Online]. Available:http://dx.doi.org/10.1109/SP.2015.10

[15] M. Walfish and A. J. Blumberg, “Verifying computations withoutreexecuting them,” Commun. ACM, vol. 58, no. 2, pp. 74–84, Jan.2015. [Online]. Available: http://doi.acm.org/10.1145/2641562

[16] S. T. Setty, R. McPherson, A. J. Blumberg, and M. Walfish, “Makingargument systems for outsourced computation practical (some-times).” in NDSS, vol. 1, no. 9, 2012, p. 17.

[17] B. Parno, J. Howell, C. Gentry, and M. Raykova, “Pinocchio:Nearly practical verifiable computation,” in 2013 IEEE Symposiumon Security and Privacy, May 2013, pp. 238–252.

[18] R. Wahby, S. Setty, Z. Ren, A. Blumberg, and M. Walfish, “Efficientram and control flow in verifiable outsourced computation,” inNetwork and Distributed System Security Symposium (NDSS), 2015.

[19] L. Lamport, R. Shostak, and M. Pease, “The byzantinegenerals problem,” ACM Trans. Program. Lang. Syst.,vol. 4, no. 3, pp. 382–401, Jul. 1982. [Online]. Available:http://doi.acm.org/10.1145/357172.357176

[20] M. Castro, B. Liskov et al., “Practical byzantine fault tolerance,” inOSDI, vol. 99, 1999, pp. 173–186.

[21] W. Du, J. Jia, M. Mangal, and M. Murugesan, “Uncheatable gridcomputing,” in Distributed Computing Systems, 2004. Proceedings.24th International Conference on, 2004, pp. 4–11.

Page 18: RIA: an Audition-based Method to Protect the Runtime Integrity of …web.xidian.edu.cn/yzwang/files/20170426_110302.pdf · 2017. 4. 26. · Abstract—Public cloud vendors have been

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 17

[22] D. Szajda, B. Lawson, and J. Owen, “Toward an optimal redun-dancy strategy for distributed computations,” in Cluster Comput-ing, 2005. IEEE International, Sept 2005, pp. 1–11.

[23] ——, “Hardening functions for large scale distributed computa-tions,” in Security and Privacy, 2003. Proceedings. 2003 Symposiumon, May 2003, pp. 216–224.

[24] W. Wei, J. Du, T. Yu, and X. Gu, “Securemr: A service integrityassurance framework for mapreduce,” in Proceedings of the 2009Annual Computer Security Applications Conference, ser. ACSAC ’09.Washington, DC, USA: IEEE Computer Society, 2009, pp. 73–82.[Online]. Available: http://dx.doi.org/10.1109/ACSAC.2009.17

[25] Y. Wang and J. Wei, “Viaf: Verification-based integrity assuranceframework for mapreduce,” in IEEE CLOUD, 2011, pp. 300–307.

[26] Y. Wang, J. Wei, and M. Srivatsa, “Result integrity check formapreduce computation on hybrid clouds,” in IEEE CLOUD, 2013.

[27] ——, “Cross cloud mapreduce: A result integrity check frameworkon hybrid clouds,” International Journal of Cloud Computing, vol. 1,no. 1, pp. 26–39, 2013.

[28] Y. Wang, J. Wei, S. Ren, and Y. Shen, “Towardintegrity assurance of outsourced computing, a gametheoretic perspective,” Future Generation Computer Systems,vol. 55, pp. 87 – 100, 2016. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0167739X15002630

[29] M. Abadi, M. Budiu, U. Erlingsson, and J. Ligatti, “Control-flowintegrity principles, implementations, and applications,” ACMTrans. Inf. Syst. Secur., vol. 13, no. 1, pp. 4:1–4:40, Nov. 2009.[Online]. Available: http://doi.acm.org/10.1145/1609956.1609960

[30] B. Niu and G. Tan, “Per-input control-flow integrity,” inProceedings of the 22Nd ACM SIGSAC Conference on Computerand Communications Security, ser. CCS ’15. New York,NY, USA: ACM, 2015, pp. 914–926. [Online]. Available:http://doi.acm.org/10.1145/2810103.2813644

[31] Y. Wang and Y. Shen, “Poster: Ria: An audition-based methodto protect the runtime integrity of mapreduce applications,”in Proceedings of the 2016 ACM SIGSAC Conference onComputer and Communications Security, ser. CCS ’16. NewYork, NY, USA: ACM, 2016, pp. 1799–1801. [Online]. Available:http://doi.acm.org/10.1145/2976749.2989042

Yongzhi Wang received his BS and MS de-grees in computer science from Xidian Univer-sity, China in 2004 and 2007, respectively. He re-ceived his PhD in computer science from FloridaInternational University, FL, USA, in 2015. SinceAugust 2015, he has been an Assistant Pro-fessor at Xidian University, China. His researchinterests include big data, cloud computing se-curity and outsourced computing security.

Yulong Shen received the B.S. and M.S. de-grees in Computer Science and Ph.D. degreein Cryptography from Xidian University, Xian,China, in 2002, 2005, and 2008, respectively.He is currently a Professor at the School ofComputer Science and Technology, Xidian Uni-versity, China. He is also an associate directorof the Shaanxi Key Laboratory of Network andSystem Security and a member of the StateKey Laboratory of Integrated Services networksXidian University, China. He has also served on

the technical program committees of several international conferences,including ICEBE, INCoS, CIS and SOWN. His research interests includeWireless network security and cloud computing security.

Xiaohong Jiang received his B.S., M.S. andPh.D degrees in 1989, 1992, and 1999 respec-tively, all from Xidian University, China. He iscurrently a full professor of Future UniversityHakodate, Japan. Before joining Future Univer-sity, Dr.Jiang was an Associate professor, To-hoku University, from Feb.2005 to Mar.2010. Dr.Jiangs research interests include computer com-munications networks, mainly wireless networksand optical networks, network security, router-s/switches design, etc. He has published over

260 technical papers at premium international journals and conferences,which include over 50 papers published in top IEEE journals and topIEEE conferences, like IEEE/ACM Transactions on Networking, IEEEJournal of Selected Areas on Communications, IEEE Transactions onParallel and Distributed Systems, IEEE INFOCOM. Dr. Jiang was thewinner of the Best Paper Award of IEEE HPCC 2014, IEEE WCNC 2012,IEEE WCNC 2008, IEEE ICC 2005-Optical Networking Symposium, andIEEE/IEICE HPSR 2002. He is a Senior Member of IEEE, a Member ofACM and IEICE.