16
SPECIAL ISSUE PAPER SaC-FRAPP: a scalable and cost-effective framework for privacy preservation over big data on cloud Xuyun Zhang 1, * ,, Chang Liu 1 , Surya Nepal 2 , Chi Yang 1 , Wanchun Dou 3 and Jinjun Chen 1 1 Faculty of Engineering and IT, University of Technology, Sydney, Sydney, Australia 2 Information and Communication Technologies Centre, Commonwealth Scientic and Industrial Research Organisation, Sydney, Australia 3 State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China SUMMARY Big data and cloud computing are two disruptive trends nowadays, provisioning numerous opportunities to the current information technology industry and research communities while posing signicant challenges on them as well. Cloud computing provides powerful and economical infrastructural resources for cloud users to handle ever increasing data sets in big data applications. However, processing or sharing privacy-sensitive data sets on cloud probably engenders severe privacy concerns because of multi-tenancy. Data encryption and anonymization are two widely-adopted ways to combat privacy breach. However, encryption is not suitable for data that are processed and shared frequently, and anonymizing big data and manage numerous anonymized data sets are still challenges for traditional anonymization approaches. As such, we propose a scalable and cost-effective framework for privacy preservation over big data on cloud in this paper. The key idea of the framework is that it leverages cloud-based MapReduce to conduct data anonymization and manage anonymous data sets, before releasing data to others. The framework provides a holistic conceptual foundation for privacy preservation over big data. Further, a corresponding proof-of-concept prototype system is implemented. Empirical evaluations demonstrate that scalable and cost-effective framework for privacy preservation can anonymize large-scale data sets and mange anonymous data sets in a highly exible, scalable, efcient, and cost-effective fashion. Copyright © 2013 John Wiley & Sons, Ltd. Received 28 May 2013; Accepted 4 June 2013 KEY WORDS: big data; cloud; privacy preservation; framework; anonymization 1. INTRODUCTION Big data and cloud computing, two disruptive trends at present, offers a large number of business opportunities, and likewise pose considerable challenges on the current information technology (IT) industry and research communities [1, 2]. Data sets in most big data applications such as social networks and sensor networks have become increasingly large and complex so that it is a considerable challenge for traditional data processing tools to handle the data processing pipeline (including collection, storage, processing, mining, sharing, etc.). Generally, such data sets are often from various sources and of different types (Variety) such as unstructured social media content and half-structured medical records and business transactions, and are of large sizes (Volume) with fast *Correspondence to: Xuyun Zhang, Faculty of Engineering and IT, University of Technology, Sydney, Sydney, Australia. E-mail: [email protected] Copyright © 2013 John Wiley & Sons, Ltd. CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2013 Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.3083

A scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر

Embed Size (px)

Citation preview

Page 1: A  scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCEConcurrency Computat.: Pract. Exper. 2013Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.3083

SPECIAL ISSUE PAPER

SaC-FRAPP: a scalable and cost-effective framework for privacypreservation over big data on cloud

Xuyun Zhang1,*,†, Chang Liu1, Surya Nepal2, Chi Yang1, Wanchun Dou3

and Jinjun Chen1

1Faculty of Engineering and IT, University of Technology, Sydney, Sydney, Australia2Information and Communication Technologies Centre, Commonwealth Scientific and Industrial Research

Organisation, Sydney, Australia3State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China

SUMMARY

Big data and cloud computing are two disruptive trends nowadays, provisioning numerous opportunities tothe current information technology industry and research communities while posing significant challengeson them as well. Cloud computing provides powerful and economical infrastructural resources for cloudusers to handle ever increasing data sets in big data applications. However, processing or sharingprivacy-sensitive data sets on cloud probably engenders severe privacy concerns because of multi-tenancy.Data encryption and anonymization are two widely-adopted ways to combat privacy breach. However,encryption is not suitable for data that are processed and shared frequently, and anonymizing big data andmanage numerous anonymized data sets are still challenges for traditional anonymization approaches. As such,we propose a scalable and cost-effective framework for privacy preservation over big data on cloud in thispaper. The key idea of the framework is that it leverages cloud-based MapReduce to conduct dataanonymization and manage anonymous data sets, before releasing data to others. The framework provides aholistic conceptual foundation for privacy preservation over big data. Further, a correspondingproof-of-concept prototype system is implemented. Empirical evaluations demonstrate that scalable andcost-effective framework for privacy preservation can anonymize large-scale data sets and mangeanonymous data sets in a highly flexible, scalable, efficient, and cost-effective fashion. Copyright © 2013John Wiley & Sons, Ltd.

Received 28 May 2013; Accepted 4 June 2013

KEY WORDS: big data; cloud; privacy preservation; framework; anonymization

1. INTRODUCTION

Big data and cloud computing, two disruptive trends at present, offers a large number of businessopportunities, and likewise pose considerable challenges on the current information technology (IT)industry and research communities [1, 2]. Data sets in most big data applications such as socialnetworks and sensor networks have become increasingly large and complex so that it is aconsiderable challenge for traditional data processing tools to handle the data processing pipeline(including collection, storage, processing, mining, sharing, etc.). Generally, such data sets are oftenfrom various sources and of different types (Variety) such as unstructured social media content andhalf-structured medical records and business transactions, and are of large sizes (Volume) with fast

*Correspondence to: Xuyun Zhang, Faculty of Engineering and IT, University of Technology, Sydney, Sydney, Australia.†E-mail: [email protected]

Copyright © 2013 John Wiley & Sons, Ltd.

Page 2: A  scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر

X. ZHANG ET AL.

data in or out (Velocity). Cloud systems provides massive computation power and storage capacity thatenable users to deploy applications without infrastructure investment. Because of its salient features,cloud is promising for users to handle the big data processing pipeline with its elastic andeconomical infrastructural resources. For instance, MapReduce [3], extensively studied and widelyadopted large-scale data processing paradigm, is incorporated with cloud infrastructure to providemore flexible, scalable, and cost-effective computation for big data processing. A typical example isthe Amazon Elastic MapReduce service. Users can invoke Amazon Elastic MapReduce to conducttheir MapReduce computations based on the powerful infrastructure offered by Amazon WebServices (e.g., Elastic Compute Cloud and Simple Storage Service) and are charged in proportion tothe usage of the services. In this way, it is economical and convenient for companies andorganizations to collect, store, analyze, and share big data to gain competitive advantages.

However, because of the feature that cloud systems are multi-tenant, processing or sharing privacy-sensitive data sets on them, for example, ‘other parties’ mining healthcare data sets, probablyengenders severe privacy concerns. Still, privacy concerns in MapReduce platforms are aggravatedbecause the privacy-sensitive information scattered among various data sets can be recovered withmore ease when data and computational power are considerably abundant. Although some privacyissues are not new, their importance is amplified by cloud computing and big data [2]. With thewide adoption of online cloud services and proliferation of mobile devices, the privacy concernabout processing on and sharing of sensitive personal information is increasing. For instance,HealthVault, an online health service provided by Microsoft (Redmond, Washington, USA), isdeployed on the Windows Azure cloud platform. Although these data in such cloud services areusually deemed extremely privacy-sensitive, they usually offer significant human benefits ifanalyzed and mined by organizations similar to disease research centers.

Data encryption [4, 5] and anonymization [6, 7], having been extensively studied recently, arepromising and widely-adopted ways to combat privacy breach and violation on cloud. Mechanismssuch as encryption [4], access control [8], and differential privacy [9] are exploited to protect thedata privacy. These mechanisms are well-know pillars of privacy protection and still have openquestions in the context of cloud computing and big data [2]. Usually, the data sets uploaded intocloud are not only for simply storage, but also for online cloud applications, that is, the data sets aredynamical. If we encrypt these data sets, processing on data sets efficiently will be quite achallenging task, because most existing applications only run on unencrypted data sets. Althoughrecent progress has been made in homomorphic encryption, which theoretically allows performingcomputation on encrypted data sets, applying current algorithms are rather expensive because oftheir inefficiency [10]. Still, data holders and data users in cloud are different parties in mostapplications, for example, cloud health service providers and pharmaceutical companies. In suchcases, encryption or access control mechanisms alone fail to ensure privacy preservation and datautility exposure. Data anonymization is a promising category of approaches to achieve such a goal[7]. However, most existing anonymization algorithms lack of scalability over big data. Thence,how to leverage the state-of-the-art cloud-based techniques to address the scalability issues ofcurrent anonymization approaches deserve considerable attention. Still, anonymous data setsscattered on cloud probably compromise data privacy if they are not properly managed. These datasets can be highly related because they may be parts of a data set [11], different levels ofanonymous version of a data set [12], incremental versions of a data set [13], and so on. Adversariesare able to collect such a group of data sets from cloud and infer certain privacy from them even ifeach data set individually satisfies a privacy requirement. As such, it is still a challenge to managemultiple anonymous data sets on cloud.

In this paper, we propose a scalable and cost-effective framework for privacy preservation (SaC-FRAPP) over big data on cloud via integrating our previous work together in terms of the lifecycleof data anonymization. The key idea of the framework is that it leverages cloud-based MapReduceand HDFS (Hadoop Distributed File System) to conduct data anonymization and manageanonymous data sets, respectively, before releasing data sets to other parties. The framework is builton the top of MapReduce and functions as a filter to preserve the privacy of data sets before thesedata sets are accessed and processed by MapReduce. Specifically, the framework provides interfacesto data holders to specify various privacy requirements based on different privacy models. Once

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)DOI: 10.1002/cpe

Page 3: A  scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر

SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION

privacy requirements are specified, the framework launches anonymization algorithms of MapReduceversion to efficiently anonymize data sets for subsequent MapReduce tasks. Anonymous data sets areretained and reused to avoid re-computation cost. Thus, the framework handles the dynamical updateof data sets as well to maintain the privacy requirements of such data sets. Besides anonymization,SaC-FRAPP also integrates encryption techniques to cost-effectively ensure the privacy of multiplesdata sets that are independently anonymized in terms of different privacy requirements. Finally, acorresponding proof-of-concept prototype system based on our cloud environment is implementedfor the framework. The framework provides a holistic conceptual foundation for privacypreservation over big data and enables users to accomplish full potential of the high scalability,elasticity, and cost-effectiveness of the cloud. We conduct extensive experiments on real-world datasets to evaluate the proposed framework. Empirical evaluations demonstrate that SaC-FRAPP cananonymize big data and mange the anonymous data sets in a highly flexible, scalable, efficient, andcost-effective fashion.

The contributions of this paper are three folds. Firstly, we propose a SaC-FRAPP over big data oncloud via integrating our previous work together in terms of the lifecycle of data anonymization.Secondly, a corresponding proof-of-concept prototype system is presented. Thirdly, extensiveexperimental evaluations on modules of the framework are provided to demonstrate SaC-FRAPPcan anonymize big data and mange the anonymous data sets in a highly flexible, scalable, efficient,and cost-effective fashion.

The remainder of this paper is organized as follows. The next section reviews the related workabout privacy protection in cloud computing, big data, and MapReduce. In Section 3, we brieflypresent the preliminary background knowledge about cloud systems and MapReduce. Section 4describes the lifecycle of data anonymization on cloud and formulates the details of the proposedframework. The proof-of-concept prototype system is designed and implemented in Section 5. Wepresent the empirical evaluation results of each module of the prototype system in Section 6.Finally, we conclude this paper and discuss our future work in Section 7.

2. RELATED WORK

We briefly review recent research on data privacy preservation and privacy protection in MapReduceand cloud computing environments.

The privacy-preserving data publishing research community has investigated extensively onprivacy-preserving issues and made fruitful progress with a variety of privacy models and preservingmethods [7]. Privacy principles such as k-anonymity [14], l-diversity [15], and t-closeness [16] areput forth to model and quantify privacy. Privacy principles for multiple datasets are also proposed,but they aim at specific scenarios such as continuous data publishing or sequential data releasing [7].Several anonymizing operations are leveraged to anonymize data sets, including generalization [17–19],anatomization [20], slicing [21], disassociation [22], and so on. Roughly, there are four generalizationschemes [7], namely, FG [23], SG [19], MG [17], and CG [18].

Encryption is widely adopted as a straightforward approach to ensure data privacy on cloud againstmalicious users. Special operations such as query or search on encrypted data sets stored on cloud hasbeen extensively studied [4,24, 25], although performing general operations on encrypted data sets arestill quite challenging [10]. Instead of encrypting all data sets, Puttaswamy et al. [8] described a set oftools called Silverline that can separate all functionally encryptable data from other cloud applicationdata, where the former is encrypted for privacy preservation while the later is left unencrypted forapplication functions. However, the sensitivity of data is required be labeled in advance. Zhanget al. [26] proposed a privacy leakage upper bound constraint-based approach to preserve theprivacy of multiple data sets by only encrypting a part of data sets on cloud.

As SaC-FRAPP is proposed based on MapReduce, we reviewed MapReduce relevant research workin the succeeding texts. The Kerberos authentication mechanism [27] is integrated into the MapReduceframework of Hadoop after the 1.0.0 version. Basically, access control fails to preserve privacybecause data users can infer the privacy-sensitive information if they access to the unencrypted data.Roy et al. [28] investigated the data privacy problem caused by the MapReduce framework and

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)DOI: 10.1002/cpe

Page 4: A  scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر

X. ZHANG ET AL.

presented a system named Airavat, which incorporates mandatory access control with differentialprivacy. The mandatory access control will be triggered when the privacy leakage exceeds athreshold, so that both privacy preservation and high data utility are ensured. However, the resultsproduced in this system are mixed with certain noise, which is unsuitable to many applications thatneed data sets without noise, for example, medical experiment data mining and analysis. Based onMapReduce, several systems have been proposed to handle privacy concerns of computation andstorage on cloud. Blass et al. [29] proposed a privacy-preserving scheme named PRISM forMapReduce on cloud to perform parallel word search on over encrypted data sets. Nevertheless,many cloud applications require MapReduce to conduct tasks similar to data mining and analyticsover these data sets besides search. Ko et al. [30] proposed the HybrEx MapReduce model toprovide a way that sensitive and private data are processed within a private cloud, whereas otherscan be safely extended to public cloud. Similarly, Zhang et al. [31] proposed a system named Sedic,which partitions MapReduce computing jobs in terms of the security labels of data they work onand then assigns the computation without sensitive data to a public cloud. However, sensitivity ofdata is also required be acquired in advance in above two systems. Wei et al. [32] proposed aservice integrity assurance framework named SecureMR for the MapReduce framework. But wemainly focus herein on privacy-preserving issues in the proposed layer. Our privacy-preservingframework attempts to produce and retain anonymous data sets according to data holders' privacyrequirements for subsequent MapReduce tasks.

3. CLOUD SYSTEMS AND MAPREDUCE PRELIMINARY

3.1. Cloud systems

Cloud computing is one of the most hyped IT innovations at present, having sparked plenty of interest in boththe IT industry and academia. Recently, IT giants such as Amazon (Seattle, Washington, US), Google(Mountain View, California, US), IBM (Armonk, New York, US), and Microsoft have invested huge sumsof money in building up their public cloud products, and indeed they have developed their own products,for example, Amazon's Web Services, Google's App Engine, and Compute, and Microsoft's Azure.Meanwhile, several corresponding open source cloud computing solutions are also developed, like Hadoop,Eucalyptus, OpenNebula, and OpenStack. The cloud computing definition published by the US NationalInstitute of Standards and Technology comprehensively covers the commonly agreed aspects of cloudcomputing [33]. In terms of the definition, the cloud model consists of five essential characteristics, threeservice delivery models, and four deployment models. The five key features encompass on demand self-service, broad network access, resource pooling (multi-tenancy), rapid elasticity, and measured services. Thethree service delivery models are cloud software as a service, for example, Google Docs; cloud platform asa service, for example, Google App Engine; and cloud infrastructure as a service, for example, AmazonElastic Compute Cloud, and Amazon Simple Storage Service. The four deployment models include privatecloud, community cloud, public cloud, and hybrid cloud.

Technically, cloud computing could be regarded as an ingenious combination of a series ofdeveloped or developing ideas and technologies, establishing a novel business model by offering ITservices using economies of scale. In general, the basic ideas encompass service computing, gridcomputing, distributed computing, and so on. The core technologies that cloud computingprincipally built on include web service technologies and standards, virtualization, novel distributedprogramming models like MapReduce, and cryptography. All the participants in the cloudcomputing can benefit from this new business model. The giant IT enterprises can not only run theirown core businesses, but also make profit by delivering the spare infrastructure services to others.Small-size and medium-size businesses are capable of focusing on their own core businesses viaoutsourcing the boring and complicated IT management to other cloud service providers, usually ata fairly low cost. Especially, cloud computing facilitates start-ups considerably, enabling them tobuild up their business with low upfront IT investments, as well as cheap ongoing costs. Moreover,because of the flexibility of cloud computing, companies can adapt their business readily and swiftlyby enlarging or shrinking the business scale dynamically without concerns about losing anything.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)DOI: 10.1002/cpe

Page 5: A  scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر

SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION

3.2. MapReduce basics

MapReduce is a scalable and fault-tolerant data processing framework that enables to process hugevolume of data in parallel with many low-end commodity computers [3]. MapReduce was firstintroduced in the year 2004 by Google with similar concepts in functional languages dated as earlyas 60s. It has been widely adopted and received extensive attention from both the academia and theindustry because of its promising capability. In the context of cloud computing, the MapReduceframework becomes more scalable and cost-effective because infrastructure resources can beprovisioned on demand. Simplicity, scalability, and fault tolerance are three main salient features ofMapReduce framework. Therefore, it is convenient and beneficial for companies and organizationsutilize MapReduce services, such as Amazon Elastic MapReduce, to process big data and obtaincore competiveness.

Basically, a MapReduce task consists of two primitive functions, map and reduce, defined over adata structure named as key-value pair (key, value). Specifically, the map function can beformalized as map: (k1, v1)→ (k2, v2), that is, map function takes a pair (k1, v1) as input and thenoutput another intermediate key-value pair (k2, v2). These intermediate pairs will be consumed byreduce function as input. Formally, the reduce function can be represented as reduce: (k2, list(v2))→ (k3, v3), that is, reduce function takes intermediate k2 and all its corresponding values list(v2) as input and output another pair (k3, v3). Usually, (k3, v3) list is the results which MapReduceusers attempt to get. Both map and reduce functions are specified by data users in terms of theirspecific applications.

To make such a simple programming model work effectively and efficiently, MapReduceimplementations provide a variety of fundamental mechanisms such as data replication and datasorting. Besides, distributed file systems similar to Hadoop distributed file system [34] aresubstantially crucial to make the MapReduce framework run in a highly scalable and fault-tolerant fashion. Recently, the standard MapReduce has been extensively revised into manyvariations in order to handle data in different scenarios. For instance, Incoop [35] is proposed forincremental MapReduce computation, which detects changes to the input and automaticallyupdates the output by employing an efficient, fine-grained result reuse mechanism. Several noveltechniques: a storage system, a contraction phase for reduce tasks, and an affinity-basedscheduling algorithm are utilized to achieve efficiency without sacrificing transparency. Asstandard MapReduce framework lack built-in supports for iterative programming, which arisenaturally in many applications including data mining, web ranking, graph analysis, model fitting,and so on, HaLoop [36] and Twister [37] are designed to support iterative MapReducecomputation. HaLoop is built on top of Hadoop and extends it with a new programming modeland several important optimizations that include: a loop-aware task scheduler, loop-invariant datacaching, and caching for efficient fixpoint verification. Twister is a distributed in-memoryMapReduce runtime optimized for iterative MapReduce computations.

4. SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION

In this section, we mainly present the design of the SaC-FRAPP. Section 4.1 first describes thelifecycle of data anonymization on cloud and specifies corresponding system requirements for thelifecycle. An overview of the framework is presented in Section 4.2. Then, we elaborate the detailsof each module of the framework from Section 4.3 to 4.6.

4.1. Lifecycle of data anonymization on cloud

As a large number of big data applications are deployed in cloud platforms, most of the data in suchapplications are collected, stored, processed, and shared on cloud. A typical lifecycle of big dataanonymization on cloud is described in Figure 1. All participants are illustrated in the figure. Wedescribe the phases in Figure 1 as follows.

Phase I represents data collection phase, where data owners submit their privacy-sensitive data intoservices or applications provided by data holders. A data owner is the individual who is associated with

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)DOI: 10.1002/cpe

Page 6: A  scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر

I Data Collection

II Data Anonymization

III AnonymousData

Management

IV Data Update

V Data Consumption

Adversaries

QueryUsers

Data MiningUsers

Individual Owners

DataHolder

GroupOwners

Figure 1. Lifecycle of data anonymization on cloud.

X. ZHANG ET AL.

the data record she or he submitted. Usually, a data owner is an end user of some cloud-based onlineservices, for example, online healthcare services. Also, group data owners like some institutes placedata sets containing records from a group of individuals into cloud, for example, hospitals may putpatients' diagnosis information into online healthcare services. A data holder is usually the service orapplication owner, for example, Microsoft is the data owner because of its cloud-based healthcareservice, HealthVault. Data holders are responsible for the privacy protection of data owners. Dataowners can specify personalized privacy requirements to data holders. But mostly, data holdersdevelop privacy protection strategies according to governments' privacy policies and regulations.

To make full use of the collected data sets, data holders would like to release the data sets to otherparties for analysis or data mining. However, given the privacy concerns, the data holder anonymizesthe data sets. Unlike the traditional case that data holders anonymize data sets themselves, dataanonymization is conducted by dedicated data anonymization services, that is, anonymization as aservice. Phase II illustrates such a process. We assume that anonymization services are trusted.Privacy preserving requirements will be specified as parameters to anonymization services. Afteranonymization, the anonymous data sets are released to cloud or certain data recipients. In general,there will be multiple anonymous data sets because of the different data usage or data recipients.Hence, Phase III chiefly manages such anonymous data sets to ensure privacy preservation. Becausesome big data applications are online services, for example, Facebook and HealthVault, the data setsthese applications that are dynamic and incremental. Consequently, Phase IV is responsible forupdating anonymous data sets when new data records are added. To anonymize the newly addeddata, the anonymized data sets should be taken into account and the whole anonymized data sets areoffered to users. Phases II to IV will be elaborated in the following sections.

In general, anonymous data sets are consumed by other parties. Phase V in Figure 1 shows such aprocess. Various data users will access different anonymous data sets for diverse purposes. Forinstance, research centers training data models similar to decision trees or association rules from theanonymous data sets to reveal certain trends similar to disease infection. Meanwhile, malicious datausers deliberately attempt to infer privacy-sensitive information from the released data sets. Notethat the privacy can also be breached incidentally when a legal data user process on a data set.

In big data and cloud environments, data anonymization encounters several challenges because ofthe new features coming with cloud systems and big data applications. In order to comply with suchfeatures specific to cloud computing and big data, we identify several system requirements thatshould be satisfied when designing a framework for privacy preservation over big data on cloud asfollows. Note that scalability and cost-effectiveness are the most important two issues in such aframework for big data privacy preservation.

• Flexible. The framework should provide a user interface through which the data owners canspecify various privacy requirements. Usually, data sets will be accessed and processed bydifferent data users for different application.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)DOI: 10.1002/cpe

Page 7: A  scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر

SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION

Co

• Scalable. Scalability is necessary for current privacy-preserving approaches because the scale ofdata sets is too large to be processed by existing serial algorithms. So the privacy-preservingframework should be also scalable to handle data anonymization and managements. Concretely,the data-intensive or computation-intensive operations in the framework should be executedefficiently in a parallel and scalable fashion.

• Dynamical. In cloud computing, most applications accumulate data sets over time, for example,cloud healthcare services will receive a large number of information from users. The scale of suchdata sets becomes larger and larger, forming big data. Hence, the privacy-preserving frameworkshould handle dynamical data sets, the privacy, and utility of such data sets can still be ensuredwhen updates occur.

• Cost-effective. In the pay-as-you-go feature of cloud computing, saving IT cost is one of the coreenablers. Thus, it is also desired for the privacy-preserving framework to save the expense ofprivacy preservation as much as possible while the privacy preservation and data utility can stillbe ensured.

4.2. Framework overview

In terms of the four system requirements formulated earlier in the text, we designed a SaC-FRAPP overbig data on cloud. Accordingly, the framework consists of four main modules, namely, privacyspecification interface (PSI), data anonymization (DA), data update (DU), and anonymous data setsmanagement (ADM). Based on the four modules, the privacy-preserving framework can achieve thefour system requirements. The framework overview is depicted systematically in Figure 2, where thesubmodules (dashed rectangles) for each module are illustrated and the relationships among theaforementioned four modules, big data application layer, MapReduce or HDFS, and cloudinfrastructure are described as well. Note that although the framework only consists the fourmodules, we also present in Figure 2 its potential users, that are, end users, big data applications, ortools like Mahout, as well as infrastructure and drivers, that are, MapReduce, HDFS, and cloudinfrastructure.

As shown in Figure 2, DA, DU, and ADM are the three main functional modules. They conductconcrete operations on data sets in terms of the privacy models specified in the PSI module. The DAand DU modules take advantage of MapReduce and HDFS to anonymize data sets or adjustanonymized data sets when updates occur. The ADM module is responsible for managinganonymized data sets in order to save expense via avoiding re-computation. Thus, ADM utilizecloud infrastructure directly to accomplish the tasks. We will discuss the four proposed modules andtheir submodules in detail in following sections.

Unlike traditional MapReduce platforms, the MapReduce platform in our research is deployed ontop of the cloud infrastructure to gain high scalability and elasticity. To preserve the privacy of datasets processed by data users using MapReduce, a privacy-preserving layer between original data sets

Figure 2. Framework overview of scalable and cost-effective framework for privacy preservation.

pyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)DOI: 10.1002/cpe

Page 8: A  scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر

X. ZHANG ET AL.

and user-specified MapReduce tasks. Basically, the layer including DA and DU modules, is the engineof SaC-FRAPP as it conducts the most computation in the framework. Figure 3 depicts the privacy-preserving layer based on MapReduce and HDFS.

As shown in Figure 3, original data sets are stored confidentially on cloud by data owners, and cannever be accessed directly by data users. Then data holders specify privacy requirements and submitthem to the privacy-preserving layer. The layer is responsible for anonymizing original data setsaccording to privacy requirements. Certain anonymous data sets are retained to avoid re-computation. Thus, the layer is also responsible for updating anonymous data when new data join.Data users can then specify their application tasks in MapReduce jobs and run these jobs on theanonymized data sets. The results are stored in HDFS or further stored in cloud storage. Theprivacy-preserving layer itself exploits MapReduce jobs to conduct the computation required in dataanonymization. This is plausible and necessary to use MapReduce to accomplish the computationfor anonymizing and managing these data sets because data sets are usually of huge volume andcomplexity in the context of cloud computing and big data.

4.3. Privacy specification interface

As to privacy models and protection techniques, privacy-preserving data publishing researchcommunity has extensively investigated on the issues and made fruitful progress with a variety ofprivacy models, privacy preserving methods, and algorithms [7]. Usually, original data sets can beaccessed and process by different data users for different purposes, leading to various privacy risksin these scenarios. Moreover, the privacy requirements of data owners possibly vary over time. Assuch, a systematic and flexible privacy specification model is proposed to frame privacyrequirements. We have the following definition on privacy specification.

Definition 1 (Privacy Specification)The privacy requirements specified by a data owner are defined as a Privacy Specification (PS). A PS isformally represented by a vector of parameters,that is, PS =<PMN, Thr, AT, Alg, Gra, Uti>. Theparameters in the vector are elaborated subsequently.

The name of a privacy model is represented by PMN. Out of recent proposed privacymodels, we employ three widely adopted privacy models in the privacy-preservingframework, namely, k-anonymity, l-diversity, and t-closeness. The three privacy principles providedifferent levels of privacy-preserving extent. The privacy principle k-anonymity means that the numberof the anonymized records that correspond to a quasi-identifier is required to be larger than a threshold.Otherwise, once certain quasi-identifiers are too specific that only a small group of people is linked tothem, these individuals are linked to sensitive information with high confidence, resulting to privacybreach. Here, quasi-identifiers represent the groups of anonymized data records. Based on k-anonymity,l-diversity requires that the sensitive values correspond to a quasi-identifier is not less than a thresholdand therefore stricter than k-anonymity. The strictest is t-closeness, requiring the distribution ofsensitive values correspond to a quasi-identifier to be close to that of original data sets.

Figure 3. Privacy-preserving layer based on MapReduce and Hadoop Distributed File System.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)DOI: 10.1002/cpe

Page 9: A  scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر

SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION

Parameter Thr is the threshold of the specified privacy model, that are, k, l, and t in the threepreviously mentioned privacy principles.

Parameter AT denotes application type. Data owners can specify the goals of anonymized data sets,for example, classification, clustering, or general use. If the use of anonymized data sets is known inadvance, anonymization algorithms can produce anonymized data sets with higher data utility whenthe privacy are still preserved. Without knowing the types of applications that consume theanonymized data sets, the DA module produces anonymized data sets for general use. Differentinformation metrics will be utilized for different application types [7].

The anonymization algorithms are indicated by the parameter Alg. A variety of algorithmshave been developed for different privacy principles and application types. Details will bedescribed in Section 4.4.

Parameter Gra represents the granularity of the privacy specification. It determines the scope of theprivacy preservation. Usually, only part of the original data sets is shared with data users, and differentdata users are possibly interested in different part of big data. Moreover, only part of attributes of a datarecords is considered in the process of anonymization. Thus, the granularity parameter is quite useful.

The data utility parameter Uti is an optional one. Data owners can specify how much data utilitythey allow to expose to data users. In general, privacy and data utility are two roughly oppositeaspects of privacy preservation. When privacy thresholds are given, most anonymization algorithmsusually expose as much as possible. On the contrary, we can make the data sets anonymous asmuch as possible if data utility is fixed.

Above all, the SaC-FRAPP framework systematically and comprehensively provides diverseprivacy specifications to achieve the flexibility and diversity of privacy preservation.

4.4. Data anonymization

Several categories of anonymization techniques have been proposed, including generalization, slicing,disassociation, and anatomization. In SaC-FRAPP, we utilize generalization for the anonymizationbecause it is widely investigated and adopted in existing algorithms. Specifically, four generalizationschemes have been proposed, namely, full-domain generalization (FG), sub-tree generalization (SG),multi-dimensional generalization (MG), and cell generalization (CG). Their correspondingsub-modules are depicted in Figure 2. Roughly speaking, data utility exposed by these four schemesincrease in the order: FG< SG<MG<CG, when the same privacy requirement is given. But notethat the anonymized data sets produced by MG and CG suffer from data exploration problem, thatis, the anonymized data sets contain inconsistent data. For instance, one original attribute value canbe generalized into two different higher-level values. The Alg parameter in a privacy specificationindicates which anonymized scheme can be used to anonymize data sets.

However, most existing anonymization algorithms are centralized or serial, meaning that thesealgorithms fail to handle big data. Usually, big data are so large that they fail to fit in the memory ofone normal cloud computation node. Hence, they are usually stored across a number of nodes.Therefore, it is a challenging problem to anonymize large-scale data sets cloud for existinganonymization algorithm. Hence, we revise existing algorithms into MapReduce versions, in orderto exploit the MapReduce to efficiently anonymize data sets in a scalable and parallel fashion.

The data anonymizing (DA) module consists of a series of anonymized algorithms of MapReduceversion. Basically, each anonymized algorithm has a MapReduce driver program and several pairsof map and reduce programs. Usually, these map and reduce programs, constituting a MapReducejob, will be executed iteratively because the anonymization is an iterative process. Although thestandard MapReduce implementation mainly supports one-pass data processing rather than iterativeprocessing, we design our MapReduce algorithms dedicatedly to make full use of MapReduce. Forinstance, we improve the degree of parallelization by partitioning data in advance and launchingmultiple MapReduce jobs simultaneously.

4.5. Data update

The anonymized data sets are retained on cloud for different data users. So, these data sets arepersistent every time the anonymized data sets are generated. However, the data sets in applications

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)DOI: 10.1002/cpe

Page 10: A  scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر

X. ZHANG ET AL.

on cloud are dynamic and increase dramatically over time, resulting in big data. Hence, we have toupdate original data sets and anonymized data sets.

A straightforward way is to anonymize the already updated data sets from scratch. From theefficiency perspective, it is usually unacceptable to anonymize all data sets once an update occurs.Furthermore, the privacy preservation fails be ensured according to the analysis in [13]. Therefore, abetter way is to just anonymize the update part and adjust the already anonymized data sets. Foranonymous data sets, anonymization level is used to describe the degree of privacy preservation.

Usually, the anonymization level for the already anonymized data sets satisfies the given privacyrequirements, so new data can be simply anonymized to current anonymization level when updatesoccur. But the newly anonymized data sets possibly violate the privacy requirements because theyare likely to be too specific. In such a case, we have to adjust the anonymization level of the wholeanonymized data sets to ensure the privacy preservation for all data.

Another aspect of privacy preservation is to produce data utility as much as possible to data userswhen privacy requirements are satisfied. For data anonymization, an interesting phenomenon is thatfor a given privacy requirement, the more data are anonymized, the lower anonymization level willbe. A lower anonymization level means more data utility can be produced because the anonymizedvalues are more specific. Consequently, just anonymizing new data to the current anonymizationlevel is not sufficient, even though this satisfies the privacy requirements definitely. To expose moredata utility, we need to lower the anonymization level of the whole anonymized data sets.

Above all, three basic operations are provided in the DU module, namely, update, generalization,and specialization. Generalization is utilized to raise the anonymization level, whereas specializationis to lower the anonymization level.

4.6. Anonymous data management

As described in the last section, anonymous data sets are retained for data sharing, mining, andanalytics. Another consideration is to save IT expense. In the context of cloud computing, bothcomputation and storage resources will be charged in proportion to their usage in terms of thepay-as-you-go feature. In this sense, it is beneficial to store certain part of intermediate data setsrather than re-compute them repeatedly. Yuan et al. [38] has extensively investigated the trade-offsbetween data storage and re-computation and proposed a series of strategies. In SaC-FRAPP,anonymous data sets are stored to save cost and managed systematically. We exploit dataprovenance [39] to manage the data generation relationships among these data sets.

Retaining a large number of independently anonymous data sets potentially suffers from the privacybreach problems. Because the PSI module provides flexible privacy specifications, a data set can bepossibly anonymized into different anonymous data sets. As a result, privacy-sensitive informationcan be recovered from different anonymous data sets. To address this inference problem in multipledata sets, we have proposed an approach that incorporates encryption to ensure privacy preservation[26]. Basically, we can encrypt all the anonymous data sets and share them to specific users.However, encrypting all data sets will incur expensive overhead because these anonymous data setsare usually accessed or processed frequently by many data users, so we propose to encrypt part ofthese data sets to save privacy-preserving cost while privacy preservation can still be ensured.

In the privacy-preserving framework, privacy of all anonymous data sets are quantified accordingto [26] or differential privacy [9]. Then, we carefully select part of anonymous data sets to encryptvia our proposed approaches. In this way, SaC-FRAPP can achieve cost-effective and efficientprivacy preservation.

5. PROTOTYPE SYSTEM DESIGN

We have developed a proof-of-concept prototype system for the privacy-preserving frameworkSaC-FRAPP based on Hadoop, an open-source implementation of MapReduce, and the OpenStackcloud platform. The basic system design of SaC-FRAPP is depicted as a class diagram in Figure 4.In general, we implement the prototype system according to the formulation of the framework in the

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)DOI: 10.1002/cpe

Page 11: A  scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر

Figure 4. Basic System design (class diagram) of scalable and cost-effective framework for privacypreservation.

SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION

last section. The four main modules are designed as classes or interfaces. The submodules are designedas their child classes or aggregate components. With these organized components, the developedprototype system achieves the four system requirements as described in Section 4.1. As aframework, SaC-FRAPP is also extensible and more privacy models and MapReduce-basedanonymization algorithms can be included.

The Privacy Specification Interface class is responsible for front-end interaction with users, andinvokes the three functional back-end modules (DA, DU, and ADM). So far, we have developedseveral MapReduce-based anonymization algorithms for DA class. Such MapReduce programs cananonymize large-scale in an efficient and scalable fashion. For SG scheme, we develop threealgorithms of MapReduce, that are, DS, BUG, and their combination. The major parts of top-downspecialization (TDS) and bottom-up generalization (BUG) are MapReduce jobs that dominate thecomputation in the anonymization process. The MapReduce drivers, mappers, and reducer areillustrated in Figure 4. For multidimensional scheme, we develop median and info medianalgorithms of MapReduce for general purposes and classification workload, respectively. Note thatfinding the median of an attribute dominates the computation in multidimensional anonymization, sothe MapReduce jobs are designed to conduct such computation. Because the execution process ofmultidimensional anonymization is recursive, we design the recursion control class to determinewhether multiple nodes are launched or only one is utilized to do the recursive computation. Forlocal recoding anonymization scheme, we model it as the k-member clustering problem [40] andadopt clustering techniques to address it. Similarity computation, which dominates the clusteringperformance, is conducted by MapReduce jobs. The DA class leverage most functions provided byDA class to carry out data update. For ADM class, several components are aggregated. Specifically,sensitive intermediate data tree (SIT) and sensitive intermediate data graph (SIG) classes areleveraged to manage anonymous data generation relationships; encryption and key managementclasses are responsible for encrypting data sets that are selected to hide. Encryption decision class,based on privacy quantification and cost estimation classes, aims at determining which anonymousdata sets should be encrypted, whereas others are not in order to ensure cost-effective privacypreservation globally. Besides the classes that correspond to the components in the framework, otherclasses are also necessary for the system. Taxonomy forest class is an essential data structure formost functional classes in Figure 4. Note that some components are not presented because of spacelimits, but this does not affect the discussion of our design.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)DOI: 10.1002/cpe

Page 12: A  scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر

X. ZHANG ET AL.

6. EXPERIMENTAL EVALUATION

In this section, we demonstrate the deployment environment of the prototype system and summarizethe experimental evaluation results of the main components of SaC-FRAPP. A series of experimentsare conducted on both real-world data sets and synthesized data sets. Specifically, we compare theperformance of the DA, DU, and ADM components with their corresponding existing approaches.We describe the experiment results of the three components in the following subsections.

6.1. Deployment environment and experiment settings

We develop and deploy the privacy-preserving framework on the basis of our cloud environment U-cloud.U-Cloud is a cloud computing environment at the University of Technology, Sydney. The systemoverview of the U-Cloud system is depicted in Figure 5. The computing facilities of this system arelocated among several labs in the Faculty of Engineering and IT, University of Technology,Sydney. On top of hardware and Linux operating system (Ubuntu), we install KVM virtualizationsoftware, which virtualizes the infrastructure and provides unified computing and storage resources.To create virtualized data centers, we install OpenStack open source cloud environment, which isresponsible for virtual machine management, resource scheduling, task distribution, and interactionwith users. Furthermore, Hadoop clusters are built on the basis of OpenStack to facilitateMapReduce computing paradigm and big data processing. All the experiments are conducted insuch a cloud environment.

We use Adult data set [41], a public data set commonly used as a de facto benchmark for testingdata anonymization for privacy preservation. We also generated enlarged data sets based on theAdult data set. After pre-processing the Adult data set, the sanitized data set consists of 30 162records. In this data set, each record has 14 attributes. We utilize eight attributes out of them inour experiments. The basic information about attribute taxonomy trees is described in Table I, inwhich the number of domain values and level of trees are listed. The size of original Adult dataset is blown up to generate a series of data sets, which is adopted in [19]. Specifically, for each

Figure 5. System overview of U-Cloud.

Table I. Description of the Adult data set.

Attribute Age Education Marital status Occupation Relationship Race Sex Native country

Levels 5 4 3 4 3 3 2 4Domain values 99 26 9 23 8 7 3 50

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)DOI: 10.1002/cpe

Page 13: A  scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر

SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION

original record r, α� 1 ‘variations’ of r are created, where α> 1 is the blowup scale. As a result, anenlarged data set is α times as large as the original one.

6.2. Summary of experiment results

In this section, we briefly summarize experiment results of the three main functional modules, that is,DA, DU, and AMD, from the perspective of scalability and cost-effectiveness. We have alreadyevaluated these modules in our previous work. Interested readers could refer to [6,26,42, 43].

6.2.1. Scalability of data anonymization. We compare our approach of data anonymization with astate-of-the-art centralized approaches proposed in [19,44]. We run both approaches on data setsvarying from 50MB to 2.5GB. We check whether both approaches can scale over large-scale datasets and measure execution time to evaluate the efficiency. The change of execution time withrespect to data set size is depicted in Figure 6. The execution time of our approach is denoted asTMR, while that of the centralized approach is denoted as TCent.

From Figure 6 (a), it is seen that TCent surges when the data size increases, whereas TMR increasesslightly even though it has a higher start value. The centralized approach suffers from memoryinsufficiency when the data size is larger than 500MB, whereas Figure 6 (b) shows that ourapproach can still scale over much larger data sets. Hence, the proposed privacy-preservingframework can significantly improve the scalability and efficiency compared with existing state-of-the-art anonymization approaches.

6.2.2. Scalability of data update. We compare our approach of anonymous data update with astate-of-the-art approach proposed in [45]. We run both approach with the number of records in dataupdate batch ranging from 2000 to 20 000. We measure the update time to evaluate the efficiency.The execution time of our approach is denoted as tI while that of the existing approach is denoted astE. Figure 7 illustrates how the difference between tI and tE changes with respect to the number ofdata records in an update batch when K is fixed, where K is the user-specified k-anonymity parameter.

When K is fixed, it can be seen from Figure 7 that the difference between tI and tE becomes biggerand bigger when the number of data records increase. This trend demonstrates that theprivacy-preserving framework can significantly improve the efficiency of privacy preservation onlarge-volume incremental data sets over existing approaches.

6.2.3. Cost-effectiveness of anonymous data management. We compare our approach of retaininganonymous data sets with the existing approach, which encrypt all data sets [4,25]. We run bothapproaches on data sets with the number of data sets varying from 100 to 1000. The privacy-preserving monetary cost is measured to evaluate the cost-effectiveness. The costs of our approachand the existing one are denoted as CHEU and CALL, respectively. The change of the cost withrespect to the number of data sets is illustrated in Figure 8 when ϵd is fixed. The parameter ϵd is auser-specified privacy requirement threshold, meaning that the degree of privacy disclosure must beunder ϵd.

Figure 6. Change of execution time with respect to data size.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)DOI: 10.1002/cpe

Page 14: A  scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر

(a) (b)

Figure 8. Change of privacy-preserving cost with respect to the number of data sets.

Figure 7. Change of update time with respect to the number of update records.

X. ZHANG ET AL.

We can see from Figure 8 that that the difference between CALL and CHEU becomes bigger andbigger when the number of intermediate data sets increases, that is, more expense can be reducedwhen the number of data sets becomes larger. Thus, this trend means the privacy-preservingframework can reduce the privacy-preserving cost of retaining anonymous data sets significantlyover the existing approach in real-world big data applications.

As a conclusion, the evaluation results in the experiments mentioned earlier demonstrate that theproposed privacy-preserving framework SaC-FRAPP can anonymize big data and mange theanonymous data sets in a highly scalable, efficient, and cost-effective fashion.

7. CONCLUSIONS AND FUTURE WORK

In this paper, we have proposed a SaC-FRAPP over big data on cloud. We have analyzed the lifecycleof data anonymization and formulated four basic system requirements for a privacy-preservingframework in the context of cloud computing and big data. To achieve the four systemrequirements, we have presented four modules for the privacy-preserving framework SaC-FRAPP,namely, PSI, DA, DU, and ADM. We leverage cloud-based MapReduce and HDFS to conduct dataanonymization and manage anonymous data sets, respectively, before releasing data sets to otherparties. Also, a corresponding proof-of-concept prototype system has been implemented for theprivacy-preserving framework and deployed in our cloud environment. Empirical evaluations havedemonstrated that SaC-FRAPP can anonymize large-scale data sets and mange the anonymous datasets in a highly flexible, scalable, efficient, and cost-effective fashion. The framework provides aholistic conceptual foundation for privacy preservation over big data and enables users toaccomplish full potential of the advantages of cloud platforms.

Privacy concerns of big data on cloud have attracted the attention of researchers in different researchcommunities. But ensuring privacy preservation of big data sets still needs extensive investigation.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)DOI: 10.1002/cpe

Page 15: A  scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر

SAC-FRAPP: A SCALABLE AND COST-EFFECTIVE FRAMEWORK FOR PRIVACY PRESERVATION

Based on the contributions of the proposed framework, we plan to integrate the privacy-preservingframework with other data processing frameworks that employ MapReduce as the computationengine, for example, the Apache Mahout project, which is a data mining library built atop ofMapReduce. Further, we will investigate privacy-aware data distribution and scheduling in cloudenvironments in the future.

REFERENCES

1. Borkar V, Carey MJ, Li C. Inside “Big Data Management”: Ogres, Onions, or Parfaits? Proceedings of the 15thInternational Conference on Extending Database Technology (EDBT'12), 2012; 3–14.

2. Chaudhuri S. What Next?: A Half-Dozen Data Management Research Goals for Big Data and the Cloud.Proceedings of the 31st Symposium on Principles of Database Systems (PODS'12), 2012; 1–4.

3. Dean J, Ghemawat S. Mapreduce: a flexible data processing tool. Communications of the ACM 2010; 53(1):72–77.DOI: 10.1145/1629175.1629198.

4. Cao N, Wang C, Li M, Ren K, Lou W. Privacy-preserving Multi-Keyword Ranked Search Over Encrypted CloudData. Proceedings of the 31st Annual IEEE International Conference on Computer Communications(INFOCOM'11), 2011; 829–837.

5. Liu C, Zhang X, Yang C, Chen J. Ccbke—session key negotiation for fast and secure scheduling of scientificapplications in cloud computing. Future Generation Computer Systems 2013; 29(5):1300–1308. DOI: http://dx.doi.org/10.1016/j.future.2012.07.001.

6. Zhang X, Yang LT, Liu C, Chen J. A scalable two-phase top-down specialization approach for data anonymizationusing mapreduce on cloud. IEEE Transactions on Parallel and Distributed Systems 2013; in press. DOI: 10.1109/TPDS.2013.48.

7. Fung BCM, Wang K, Chen R, Yu PS. Privacy-preserving data publishing: a survey of recent developments. ACMComputer Survey 2010; 42(4):1–53. DOI: 10.1145/1749603.1749605.

8. Puttaswamy KPN, Kruegel C, Zhao BY. Silverline: Toward Data Confidentiality in Storage-Intensive CloudApplications. Proceedings of the 2nd ACM Symposium on Cloud Computing (SoCC'11), 2011; Article 10.

9. Dwork C. Differential Privacy. Proceedings of the 33rd International Colloquium on Automata, Languages andProgramming (ICALP'06), 2006; 1–12.

10. Gentry C. Fully Homomorphic Encryption Using Ideal Lattices. Proceedings of the 41st Annual ACM Symposium onTheory of Computing (STOC'09), 2009; 169–178.

11. Nergiz ME, Clifton C, Nergiz AE. Multirelational k-anonymity. IEEE Transactions on Knowledge and DataEngineering 2009; 21(8):1104–1117.

12. Iwuchukwu T, Naughton JF. K-anonymization as Spatial Indexing: Toward Scalable and IncrementalAnonymization. Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB'07), 2007;746–757.

13. Pei J, Xu J, Wang Z, Wang W, Wang K. Maintaining k-anonymity Against Incremental Updates. Proceedings of the19th International Conference on Scientific and Statistical Database Management (SSBDM '07), 2007; Article 5.

14. Sweeney L. K-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness andKnowledge-Based Systems 2002; 10(5):557–570. DOI: 10.1142/s0218488502001648.

15. Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M. l-diversity: privacy beyond k-anonymity. ACMTransactions on Knowledge Discovery from Data 2007; 1(1):Article 3. DOI: 10.1145/1217299.1217302.

16. Li N, Li T, Venkatasubramanian S. Closeness: a new privacy measure for data publishing. IEEE Transactions onKnowledge and Data Engineering 2010; 22(7):943–956. DOI: 10.1109/TKDE.2009.139.

17. LeFevre K, DeWitt DJ, Ramakrishnan R. Mondrian Multidimensional k-anonymity. Proceedings of 22ndInternational Conference on Data Engineering (ICDE '06), 2006; 25–25.

18. Xu J, Wang W, Pei J, Wang X, Shi B, Fu AWC. Utility-based Anonymization Using Local Recoding. Proceedings ofthe 12th ACM SIGKDD International Conference on Knowledge Discovery and Data (KDD'06), 2006; 785–790.

19. Mohammed N, Fung B, Hung PCK, Lee CK. Centralized and distributed anonymization for high-dimensionalhealthcare data. ACM Transactions on Knowledge Discovery from Data 2010; 4(4):Article 18.

20. Xiao X, Tao Y. Anatomy: Simple and Effective Privacy Preservation. Proceedings of 32nd International Conferenceon Very Large Data Bases (VLDB'06), 2006; 139–150.

21. Li T, Li N, Zhang J, Molloy I. Slicing: a new approach for privacy preserving data publishing. IEEE Transactions onKnowledge and Data Engineering 2012; 24(3):561–574.

22. Terrovitis M, Liagouris J, Mamoulis N, Skiadopolous S. Privacy preservation by disassociation. Proceedings of theVLDB Endowment 2012; 5(10):944–955.

23. LeFevre K, DeWitt DJ, Ramakrishnan R. Incognito: Efficient Full-Domain k-anonymity. Proceedings of 2005 ACMSIGMOD International Conference on Management of Data (SIGMOD '05), 2005; 49–60.

24. Hu H, Xu J, Ren C, Choi B. Processing Private Queries Over Untrusted Data Cloud Through PrivacyHomomorphism. Proceedings of the IEEE 27th International Conference on Data Engineering (ICDE'11),2011; 601–612.

25. Li M, Yu S, Cao N, Lou W. Authorized Private Keyword Search Over Encrypted Data in Cloud Computing.Proceedings of the 31st International Conference on Distributed Computing Systems (ICDCS'11), 2011; 383–392.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)DOI: 10.1002/cpe

Page 16: A  scalabl e and cost effective framework for privacy preservation over big data on cloudالمرجع العاشر

X. ZHANG ET AL.

26. Zhang X, Liu C, Nepal S, Pandey S, Chen J. A privacy leakage upper bound constraint-based approach forcost-effective privacy preserving of intermediate data sets in cloud. IEEE Transactions on Parallel and DistributedSystems 2013; 24(6):1192–1202. DOI: 10.1109/tpds.2012.238.

27. Neuman BC, Ts'o T. Kerberos: an authentication service for computer networks. IEEE Communications Magazine1994; 32(9):33–38.

28. Roy I, Setty STV, Kilzer A, Shmatikov V, Witchel E. Airavat: Security and Privacy for Mapreduce. Proceedings of7th USENIX Conference on Networked Systems Design and Implementation (NSDI'10), 2010; 297–312.

29. Blass E-O, Pietro RD, Molva R, Önen M. PRISM—Privacy-Preserving Search in Mapreduce. Proceedings of the12th International Conference on Privacy Enhancing Technologies (PETS'12), 2012; 180–200.

30. Ko SY, Jeon K, Morales R. The Hybrex Model for Confidentiality and Privacy in Cloud Computing. Proceedings ofthe 3rd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'11), 2011; Article 8.

31. Zhang K, Zhou X, Chen Y, Wang X, Ruan Y. Sedic: Privacy-Aware Data Intensive Computing on Hybrid Clouds.Proceedings of 18th ACM Conference on Computer and Communications Security (CCS'11), 2011; 515–526.

32. Wei W, Juan D, Ting Y, Xiaohui G. Securemr: A Service Integrity Assurance Framework for Mapreduce.Proceedings of Annual Computer Security Applications Conference (ACSAC '09), 2009; 73–82.

33. Mell P, Grance T. The Nist Definition of Cloud Computing (Version 15). U.S. National Institute of Standards andTechnology, Information Technology Laboratory, 2009.

34. Shvachko K, Hairong K, Radia S, Chansler R. The Hadoop Distributed File System. Proceedings of 2010 IEEE 26thSymposium on Mass Storage Systems and Technologies (MSST'10), 2010; 1–10.

35. Bhatotia P, Wieder A, Rodrigues R, Acar UA, Pasquin R. Incoop: MapReduce for Incremental Computations.Proceedings of the 2nd ACM Symposium on Cloud Computing (SoCC'11), 2011; 1–14.

36. Bu Y, Howe B, Balazinska M, Ernst MD. The Haloop approach to large-scale iterative data analysis. The VLDBJournal 2012; 21(2):169–190. DOI: 10.1007/s00778-012-0269-7.

37. Ekanayake J, Li H, Zhang B, Gunarathne T, Bae S-H, Qiu J, Fox G. Twister: A Runtime for Iterative Mapreduce.Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HDPC'10),2010; 810–818.

38. Yuan D, Yang Y, Liu X, Chen J. On-demand minimum cost benchmarking for intermediate dataset storage inscientific cloud workflow systems. Journal of Parallel and Distributed Computing 2011; 71(2):316–332. DOI:10.1016/j.jpdc.2010.09.003.

39. Davidson SB, Khanna S, Milo T, Panigrahi D, Roy S. Provenance Views for Module Privacy. Proceedings ofthe thirtieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'11), 2011;175–186.

40. Byun J-W, Kamra A, Bertino E, Li N. Efficient k-anonymization Using Clustering Techniques. Proceedings of the12th International Conference on Database Systems for Advanced Applications (DASFAA'07), 2007; 188–200.

41. UCI Machine Learning Repository. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/ [Accessed on April 01,2013].

42. Zhang X, Liu C, Nepal S, Chen J. An efficient quasi-identifier index based approach for privacy preservation overincremental data sets on cloud. Journal of Computer and System Sciences 2013; 79(5):542–555. DOI: http://dx.doi.org/10.1016/j.jcss.2012.11.008.

43. Zhang X, Liu C, Nepal S, Yang C, Dou W, Chen J. Combining Top-down and Bottom-up: Scalable Sub-treeAnonymization Over Big Data Using Mapreduce on Cloud. Proceedings of 12th IEEE International Conferenceon Trust, Security and Privacy in Computing and Communications (IEEE TrustCom-13), 2013; Accepted.

44. Fung BCM, Wang K, Yu PS. Anonymizing classification data for privacy preservation. IEEE Transactions onKnowledge and Data Engineering 2007; 19(5):711–725.

45. Doka K, Tsoumakos D, Koziris N. KANIS: Preserving k-anonymity Over Distributed Data. Proceedings of the5th International Workshop on Personalized Access, Profile Management, and Context Awareness in Databases(PersDB'11), 2011.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2013)DOI: 10.1002/cpe