Data Security and Anonymization for Collaborative Data Publishing

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 5, May - 2015. ISSN 2348 4853

24 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

Data Security and Anonymization for Collaborative Data

Publishing Sumeet Kalaskar, Madhuri Rathod, Snehal Godse, Shruti Dikonda

Department of Computer Engineering, PVGs College of Engineering & Technology

University of Pune, India.

A B S T R A C T

In this paper we consider the collaborative data publishing problem and new type of insider attack by

corresponding data providers who refer to overall data to infer the data records given by other data.

Formally we introduce the notion of m-privacy, which guarantees that anonymized data satisfies given

privacy constraints and groups them into corresponding number of colluding data providers. Later we

present algorithms proving security against many providers of data and guarantees high utility and high

privacy of anonymized data. For experimental purposes we use hospital patients data sets. Baseline

algorithms are not as efficient as our slicing approach for achieving utility and efficiency. The difference

between computational time of encryption algorithm is efficiently demonstrated by our experiment.

Index Terms: SMC, TTP, Anonymization, Bucketization, Distributed database, security

I. INTRODUCTION

While data is shared and release to public, privacy techniques are preserved and these are mainly used to

decrease the leakage of formation about partial individual. There is an increasing need for sharing data

that contain personal information from distributed databases. For example, in the healthcare domain, a

national agenda to develop the Nationwide Health Information Network (NHIN) to share information

among hospitals and other providers, and support appropriate use of health information beyond direct

patient care with privacy protection.

Privacy preserving data analysis, and data publishing have received considerable attention in recent

years as promising approaches for sharing data while preserving individual privacy. Main goal is to

publish an anonymized view of integrated data, T, which will be immune to attacks (fig1.1). Attacker runs

the attack, i.e. a single or a group of external or internal entities that wants to breach privacy of data

using background knowledge. Collaborative data publishing is carried out successfully with the help of

trusted third party (TTP), which guarantees that information or data about particular individual is not

disclosed anywhere, that means it maintains privacy. Here it is assumed that the data providers are semi

honest. A more desirable approach for collaborative data publishing is, first aggregate then anonymize

T1,T2,T3 and T4 are databases for which data is provided by provider like provider P1 provides data for

database T1. These distributed data coming from different providers get aggregate by TTP(trusted third

party) or using SMC protocol. Then these aggregated data anonymized further by any anonimization

technique.0 is the authenticate user and P1 trying to breach privacy of data which is provided by other

users with the help of BK(Background knowledge). This type of attack we can call as a insider attack.

We have to protect our system from such a type of attacks.

II. LITERATURE SURVEY

In this section we are presenting the different methods which are previously used for anonymization .We

discuss some advantages and limitation of these systems. C. Dwork in his survey result [2] of differential

privacy, evaluates and summarize different approaches to privacy preserving data publishing (PPDP),

study of different challenges in practically publishing of data, clarify the other related problems which

are different from PPDP and requirements that make PPDP different from others and proposed future

search directions. They identify the research direction in PPDP like privacy preserving tools for

individuals privacy protection in emerging technology and incorporation of privacy protection and

engineering process




N. Mohammed, B. C. M. Fung, P. C. K. Hung, and C. Lee, proposed LKC privacy model for high dimensional

relational data for healthcare system[3]. This LKC model gives better result than traditional k

anonymization model. But LKC model consider only relational data and healthcare data is complex, may

be a combination of relational data, transaction data and textual data. Two party protocol DPP2GA is

presented by authors W. Jiang and C. Clifton which is two party proto-privacy. Hence this protocol helps

in it but major disadvantages of DPP2GA is it may not produce a precise data when data are not

partitioned. It is only privacy preserving protocol not SMC because it introduces certain inference

problem.

W. Ziang and C. Clifton presented a two party framework DkA[5]. This helps to maintaining benefit of

partitioning of data while generating integrated k anonymous data. This is proven to generate k

anonymous dataset and satisfying security definition of SMC. But DkA is not a multiparty framework.

A. Machanavajjhala, J. Gehrke, D. Kifer, and M.Venkitasubramaniam proposed a system with diversity.

This system provides a security over k anonymity. Attacker can attack on anonymized system with the

help of BK(background knowledge). L diversity helps to overcome this problem. New system which helps

for anonymization is slicing[10]. This is very useful technique for high dimensional data but there could

be a loss of data utility.

III. RELATED WORK

Privacy preserving data analysis and publishing has received considerable attention in recent years. Most

work has focused on a single data provider setting and considered the data recipient as an attacker. A

large body of literature assumes limited background knowledge of the attacker, and defines privacy. A

few recent works have modeled the instance level background knowledge as corruption, and studied

perturbation techniques under these syntactic privacy notions. In the distributed setting that we study,

since each data holder knows its own records, the corruption of records is an inherent element.

There are some works focused on anonymization of distributed data. They studied distributed

anonymization for vertically partitioned data using k anonymity. Zhong et al. studied classification on

data collected from individual data owners (each record is contributed by one data owner), while

maintaining k anonymityJurczyk et al. proposed a notion called l-site diversity to ensure anonymity for

data providers in addition to privacy of the data subjects. Mironov et al. studied SMC techniques to

achieve differential privacy. Mohammedet al. proposed SMC techniques for anonymizing distributed data

using the notion of LKC-privacy to address high dimensional data. Gal et al. proposed a new way of

anonymization of multiple sensitive attributes, which could be used to implement m-privacy w.r.t. l-

diversity with providers as one of sensitive attributes.

IV. PROPOSED WORK

The first category considers that a privacy threat occurs when an attacker is able to link a record owner

to a record in a published data table, to a sensitive attribute in a published data table, or to the published

data table itself. We call these record linkage, attribute linkage, and table linkage, respectively. In all three

types of linkages, we assume that the attacker knows the QID of the victim. In record and attribute

linkages, we further assume that the attacker knows that the victims record is in the released table, and

seeks to identify the victims record and/or sensitive information from the table. In table linkage, the

attack seeks to determine the presence or absence of the victims record in the released table. A data

table is considered to be privacy preserving if it can effectively prevent the attacker from successfully

performing these linkages. In the attack of record linkage, some value qid on QID identifies a small

number of records in the released table T, called a group. If the victims QID matches the value QID, the

victim is vulnerable to being linked to the small number of records in the group. In this case, the attacker

faces only a small number of possibilities for the second category aims at achieving the uninformative

principle. The published table should provide the attacker with little additional information beyond the

background knowledge. If the attacker has a large variation between the prior and posterior beliefs, we

call it the probabilistic attack.


26 | 2015, IJAFRC All Rights Reserved

Many privacy models in this family do not explicitly classify attributes in a data table into QID and

Sensitive Attributes, but some of them could also thwart the sensitive linkages in the first category, so the

two categories overlap studies this family of privacy models.

Table I summarizes the attack models addressed by the privacy models. There is a consideration of a new

type of insider attack by colluding data providers who may use their own data records (a subset of the

overall data) to infer the data records contributed by other data providers. The project addresses this

new threat, and makes several contributions. T

reason why there was a need of a innovative solution. One of the primary drawbacks that restrict some

companies from getting fully involved in database marketing is its high costs.

Second, we present heuristic algorithms exploiting the monotonicity of privacy constraints for efficiently

checking m-privacy given a group of records. Third, we present a data provider

algorithm with adaptive m-privacy checking strategies to ensure h

anonymized data with efficiency. There was the use of collaborative techniques. Collaborative data

publishing introduces a new attack that has not been studied so far. Compared to the attack by the

external recipient in the second scenario, each provider has additional data knowledge of its own

records, which can help with the attack. This issue can be further worsened when multiple data providers

collude with each other. In the social network or recommendation setting, a user m

private information about other users using the anonymized data or recommendations assisted by some

background knowledge and her own account information. Malicious users may collude or even create

artificial accounts as in a shilling at

for collaborative data publishing with mprivacy. All protocols are extensively analyzed and their security

and efficiency are formally proved. Experiments on real

better or comparable utility and efficiency than existing and baseline algorithms while satisfying m

privacy.

V. BLOCK DIAGRAM


Volume 2, Issue 5, May - 2015.

2015, IJAFRC All Rights Reserved


, but some of them could also thwart the sensitive linkages in the first category, so the

two categories overlap studies this family of privacy models.




new threat, and makes several contributions. The insider attack was not addressed till now, which is the


companies from getting fully involved in database marketing is its high costs.

ent heuristic algorithms exploiting the monotonicity of privacy constraints for efficiently

privacy given a group of records. Third, we present a data provider

privacy checking strategies to ensure high utility and m



ond scenario, each provider has additional data knowledge of its own


collude with each other. In the social network or recommendation setting, a user m



artificial accounts as in a shilling attack. Finally, we propose secure multi-party computation protocols


and efficiency are formally proved. Experiments on real-life datasets suggest that

better or comparable utility and efficiency than existing and baseline algorithms while satisfying m


2015. ISSN 2348 4853

www.ijafrc.org


, but some of them could also thwart the sensitive linkages in the first category, so the




he insider attack was not addressed till now, which is the


ent heuristic algorithms exploiting the monotonicity of privacy constraints for efficiently

privacy given a group of records. Third, we present a data provider-aware anonymization

igh utility and m-privacy of



ond scenario, each provider has additional data knowledge of its own


collude with each other. In the social network or recommendation setting, a user may attempt to infer



party computation protocols


life datasets suggest that our approach achieves

better or comparable utility and efficiency than existing and baseline algorithms while satisfying m-




The main aim to avoid congestion and increase reliability in an event driven WSN This involves creation

of random topology, generation of event; the nodes which are near the event area will detect the event

and then forms anatomic decision and broadcast their packet to Mobile data collector node. The node will

follow random path to collect the data packet and disseminate it towards the sink node.

VI. ALGORITHMS

A. Anonymization by slicing: Slicing basically depends on Attribute and tuple partitioning. In attribute

partitioning (vertical partition) we partitioned data as {name},{age-zip} and{Disease} and tuple

partitioning (horizontal partition) as{t1,t2,t3,t4,t5,t6}. In attribute partitioning age and zip are

partitioned together because they both are highly correlated because they are quasi identifiers (QI).

These QI can be known to attacker. While tuple partitioning system should check L diversity for the

sensitive attribute (SA) column. Algorithm runs are as follows.

1. Initialize bucket k=n, int i= rowcount, columncount=C, Q={D}, // D= data into database, Arraylist= a[i];

2. While Q is not empty

If i




Fscore=1

Insert value in the row

i++;

Else

Add element to arraylist a[i];

4. Exit.

First initialize L=m and rowcount i. If i=n-m+1i.e if k=n=6 and L=m=4 then i=3, up to third row data

doesnt need to check for Fscore. Add this data as they are coming from Q (step 1 and 2). For further data

from Q check data for privacy constraint. If data fulfills L , then Fscore=1. If data doesnt fulfill Fscore=1,

then add element in array list a[i] (step 3).3.Permutation: Permutation means rearrangement of records

of data. In our project we have used permutation process for rearrangement of quasi identifier i.e {Zip,

Age}

4. Fscore: Fscore is privacy fitness score i.e the level of fulfillment of privacy constraint C. If fscore=1 then

C(D*)= true.

5. Constraint C: C is a privacy constraint in which D*should fulfill slicing condition with L diversity as

explain above. Consider value of L diversity is 4. Fscore should be 1 when system fulfills L diversity

condition. Some verification processes are carried out.

1) Verification for L diversity: For verification of L diversity we used Fitness score function. For checking

L diversity generates continuous similar values of SA i.e insert similar disease. Check for Fscore=1. If

L=m, return Fscore. If privacy breach i.e if anonymized view take data as inserted then it breached

privacy. D* should take data which fulfill L= m.

1. Generate continuous similar values of SA

2. Check for privacy constraint and fscore=1;

3. If

Privacy breach;

Then early stop;

Else

Return (Fscore);

4. Exit

2) Verification for strength of system against number of provider: For verification against number of

provider, add one more attribute in anonymized data as a provider to output. This verification will prove

that our technique of anonymization doesnt depend on number of provider. Existing system i.e. provider

aware anonymization algorithm depends on database as well as provider.

1. Generate values of SA by providers= 1..n

2. Check for privacy constraint and Fscore=1 with respect to number of provider

3. If

Privacy breach;

Then early stop;

Else

Return (Fscore);

4. Exit.




VII. DATA SET

We conduct extensive workload experiments. Our results confirm that slicing preserves much better data

utility than generalization.

In workloads involving the sensitive attribute, slicing is also more effective than bucketization. In some

classification experiments, slicing shows better performance than using the original data.

VIII. EXPECTED OUTCOME

Output data table shows secure data publishing which provide privacy to sensitive attribute. Comparison

shows the computation time difference between encryption algorithm, provider aware algorithm and

slicing with data privacy algorithm.

IX. CONCLUSION

We consider a potential attack on collaborative data publishing. We used slicing algorithm for

anonymization and diversity and verify it for security and privacy by using binary algorithm of data

privacy. Slicing algorithm is very useful when we are using high dimensional data. It divides data in both

vertical and horizontal fashion. Due to encryption we can increase security. But the limitation is there

could be loss of data utility. Above system can used in many applications like hospital management

system, many industrial areas where we like to protect a sensitive data like salary of employee.

Pharmaceutical company where sensitive data may be combination of ingredients of medicines, in

banking sector where sensitive data is account number of customer, our system can use. It can be used in

military area where data is gathered from different sources and need to secured that data from each

other to maintain privacy. This proposed system help to improve the data privacy and security when data

is gathered from different sources and output should be in collaborative fashion.

X.REFERENCES

[1] S. Goryczka, L. Xiong, and B. C. M. Fung, m-Privacy for collaborative data publishing, in Proc. of

the 7th Intl. Conf. on Collaborative Computing: Networking, Applications and Work sharing, 2011.

[2] C. Dwork, Differential privacy: a survey of results, in Proc. of the 5th Intl. Conf. on Theory and

Applications of Models of Computation, 2008, pp. 119.

[3] B. C. M. Fung, K.Wang, R. Chen, and P. S. Yu, Privacy-preserving data publishing: A survey of

recent developments, ACM Comput. Surv., vol. 42, pp. 14:114:53, June 2010.

[4] C. Dwork, A firm foundation for private data analysis, Commun. ACM, vol. 54, pp. 8695, January

2011.

[5] N. Mohammed, B. C. M. Fung, P. C. K. Hung, and C. Lee, Centralized and distributed anonymization

for high-dimensional healthcare data, ACM Trans. on Knowl. Discovery from Data, vol. 4, no. 4,

pp. 18:118:33, October 2010.

[6] W. Jiang and C. Clifton, Privacy-preserving distributed k-anonymity, in DBSec, vol. 3654, 2005,

pp. 924924.

[7] W. Jiang and C. Clifton, A secure distributed framework for achieving k-anonymity, VLDB J., vol.

15, no. 4, pp. 316333, 2006.

[8] O. Goldreich, Foundations of Cryptography: Volume 2, 2004.




[9] Y. Lindell and B. Pinkas, Secure multiparty computation for privacy-preserving data mining, The

Journal of Privacy and Confidentiality, vol. 1, no. 1, pp. 5998, 2009.

[10] P. Samarati, Protecting respondents identities in microdata release, IEEE TKDE, vol. 13, no. 6,

pp. 10101027, 2001.

Documents

Data Security and Anonymization for Collaborative Data Publishing