Data Security and Anonymization for Collaborative Data Publishing

  • Upload
    ijafrc

  • View
    213

  • Download
    0

Embed Size (px)

DESCRIPTION

In this paper we consider the collaborative data publishing problem and new type of insider attack bycorresponding data providers who refer to overall data to infer the data records given by other data.Formally we introduce the notion of m-privacy, which guarantees that anonymized data satisfies givenprivacy constraints and groups them into corresponding number of colluding data providers. Later wepresent algorithms proving security against many providers of data and guarantees high utility and highprivacy of anonymized data. For experimental purposes we use hospital patients’ data sets. Baselinealgorithms are not as efficient as our slicing approach for achieving utility and efficiency. The differencebetween computational time of encryption algorithm is efficiently demonstrated by our experiment.

Citation preview

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 5, May - 2015. ISSN 2348 4853

    24 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    Data Security and Anonymization for Collaborative Data

    Publishing Sumeet Kalaskar, Madhuri Rathod, Snehal Godse, Shruti Dikonda

    Department of Computer Engineering, PVGs College of Engineering & Technology

    University of Pune, India.

    A B S T R A C T

    In this paper we consider the collaborative data publishing problem and new type of insider attack by

    corresponding data providers who refer to overall data to infer the data records given by other data.

    Formally we introduce the notion of m-privacy, which guarantees that anonymized data satisfies given

    privacy constraints and groups them into corresponding number of colluding data providers. Later we

    present algorithms proving security against many providers of data and guarantees high utility and high

    privacy of anonymized data. For experimental purposes we use hospital patients data sets. Baseline

    algorithms are not as efficient as our slicing approach for achieving utility and efficiency. The difference

    between computational time of encryption algorithm is efficiently demonstrated by our experiment.

    Index Terms: SMC, TTP, Anonymization, Bucketization, Distributed database, security

    I. INTRODUCTION

    While data is shared and release to public, privacy techniques are preserved and these are mainly used to

    decrease the leakage of formation about partial individual. There is an increasing need for sharing data

    that contain personal information from distributed databases. For example, in the healthcare domain, a

    national agenda to develop the Nationwide Health Information Network (NHIN) to share information

    among hospitals and other providers, and support appropriate use of health information beyond direct

    patient care with privacy protection.

    Privacy preserving data analysis, and data publishing have received considerable attention in recent

    years as promising approaches for sharing data while preserving individual privacy. Main goal is to

    publish an anonymized view of integrated data, T, which will be immune to attacks (fig1.1). Attacker runs

    the attack, i.e. a single or a group of external or internal entities that wants to breach privacy of data

    using background knowledge. Collaborative data publishing is carried out successfully with the help of

    trusted third party (TTP), which guarantees that information or data about particular individual is not

    disclosed anywhere, that means it maintains privacy. Here it is assumed that the data providers are semi

    honest. A more desirable approach for collaborative data publishing is, first aggregate then anonymize

    T1,T2,T3 and T4 are databases for which data is provided by provider like provider P1 provides data for

    database T1. These distributed data coming from different providers get aggregate by TTP(trusted third

    party) or using SMC protocol. Then these aggregated data anonymized further by any anonimization

    technique.0 is the authenticate user and P1 trying to breach privacy of data which is provided by other

    users with the help of BK(Background knowledge). This type of attack we can call as a insider attack.

    We have to protect our system from such a type of attacks.

    II. LITERATURE SURVEY

    In this section we are presenting the different methods which are previously used for anonymization .We

    discuss some advantages and limitation of these systems. C. Dwork in his survey result [2] of differential

    privacy, evaluates and summarize different approaches to privacy preserving data publishing (PPDP),

    study of different challenges in practically publishing of data, clarify the other related problems which

    are different from PPDP and requirements that make PPDP different from others and proposed future

    search directions. They identify the research direction in PPDP like privacy preserving tools for

    individuals privacy protection in emerging technology and incorporation of privacy protection and

    engineering process

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 5, May - 2015. ISSN 2348 4853

    25 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    N. Mohammed, B. C. M. Fung, P. C. K. Hung, and C. Lee, proposed LKC privacy model for high dimensional

    relational data for healthcare system[3]. This LKC model gives better result than traditional k

    anonymization model. But LKC model consider only relational data and healthcare data is complex, may

    be a combination of relational data, transaction data and textual data. Two party protocol DPP2GA is

    presented by authors W. Jiang and C. Clifton which is two party proto-privacy. Hence this protocol helps

    in it but major disadvantages of DPP2GA is it may not produce a precise data when data are not

    partitioned. It is only privacy preserving protocol not SMC because it introduces certain inference

    problem.

    W. Ziang and C. Clifton presented a two party framework DkA[5]. This helps to maintaining benefit of

    partitioning of data while generating integrated k anonymous data. This is proven to generate k

    anonymous dataset and satisfying security definition of SMC. But DkA is not a multiparty framework.

    A. Machanavajjhala, J. Gehrke, D. Kifer, and M.Venkitasubramaniam proposed a system with diversity.

    This system provides a security over k anonymity. Attacker can attack on anonymized system with the

    help of BK(background knowledge). L diversity helps to overcome this problem. New system which helps

    for anonymization is slicing[10]. This is very useful technique for high dimensional data but there could

    be a loss of data utility.

    III. RELATED WORK

    Privacy preserving data analysis and publishing has received considerable attention in recent years. Most

    work has focused on a single data provider setting and considered the data recipient as an attacker. A

    large body of literature assumes limited background knowledge of the attacker, and defines privacy. A

    few recent works have modeled the instance level background knowledge as corruption, and studied

    perturbation techniques under these syntactic privacy notions. In the distributed setting that we study,

    since each data holder knows its own records, the corruption of records is an inherent element.

    There are some works focused on anonymization of distributed data. They studied distributed

    anonymization for vertically partitioned data using k anonymity. Zhong et al. studied classification on

    data collected from individual data owners (each record is contributed by one data owner), while

    maintaining k anonymityJurczyk et al. proposed a notion called l-site diversity to ensure anonymity for

    data providers in addition to privacy of the data subjects. Mironov et al. studied SMC techniques to

    achieve differential privacy. Mohammedet al. proposed SMC techniques for anonymizing distributed data

    using the notion of LKC-privacy to address high dimensional data. Gal et al. proposed a new way of

    anonymization of multiple sensitive attributes, which could be used to implement m-privacy w.r.t. l-

    diversity with providers as one of sensitive attributes.

    IV. PROPOSED WORK

    The first category considers that a privacy threat occurs when an attacker is able to link a record owner

    to a record in a published data table, to a sensitive attribute in a published data table, or to the published

    data table itself. We call these record linkage, attribute linkage, and table linkage, respectively. In all three

    types of linkages, we assume that the attacker knows the QID of the victim. In record and attribute

    linkages, we further assume that the attacker knows that the victims record is in the released table, and

    seeks to identify the victims record and/or sensitive information from the table. In table linkage, the

    attack seeks to determine the presence or absence of the victims record in the released table. A data

    table is considered to be privacy preserving if it can effectively prevent the attacker from successfully

    performing these linkages. In the attack of record linkage, some value qid on QID identifies a small

    number of records in the released table T, called a group. If the victims QID matches the value QID, the

    victim is vulnerable to being linked to the small number of records in the group. In this case, the attacker

    faces only a small number of possibilities for the second category aims at achieving the uninformative

    principle. The published table should provide the attacker with little additional information beyond the

    background knowledge. If the attacker has a large variation between the prior and posterior beliefs, we

    call it the probabilistic attack.

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    26 | 2015, IJAFRC All Rights Reserved

    Many privacy models in this family do not explicitly classify attributes in a data table into QID and

    Sensitive Attributes, but some of them could also thwart the sensitive linkages in the first category, so the

    two categories overlap studies this family of privacy models.

    Table I summarizes the attack models addressed by the privacy models. There is a consideration of a new

    type of insider attack by colluding data providers who may use their own data records (a subset of the

    overall data) to infer the data records contributed by other data providers. The project addresses this

    new threat, and makes several contributions. T

    reason why there was a need of a innovative solution. One of the primary drawbacks that restrict some

    companies from getting fully involved in database marketing is its high costs.

    Second, we present heuristic algorithms exploiting the monotonicity of privacy constraints for efficiently

    checking m-privacy given a group of records. Third, we present a data provider

    algorithm with adaptive m-privacy checking strategies to ensure h

    anonymized data with efficiency. There was the use of collaborative techniques. Collaborative data

    publishing introduces a new attack that has not been studied so far. Compared to the attack by the

    external recipient in the second scenario, each provider has additional data knowledge of its own

    records, which can help with the attack. This issue can be further worsened when multiple data providers

    collude with each other. In the social network or recommendation setting, a user m

    private information about other users using the anonymized data or recommendations assisted by some

    background knowledge and her own account information. Malicious users may collude or even create

    artificial accounts as in a shilling at

    for collaborative data publishing with mprivacy. All protocols are extensively analyzed and their security

    and efficiency are formally proved. Experiments on real

    better or comparable utility and efficiency than existing and baseline algorithms while satisfying m

    privacy.

    V. BLOCK DIAGRAM

    International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 5, May - 2015.

    2015, IJAFRC All Rights Reserved

    Many privacy models in this family do not explicitly classify attributes in a data table into QID and

    , but some of them could also thwart the sensitive linkages in the first category, so the

    two categories overlap studies this family of privacy models.

    Table I summarizes the attack models addressed by the privacy models. There is a consideration of a new

    type of insider attack by colluding data providers who may use their own data records (a subset of the

    overall data) to infer the data records contributed by other data providers. The project addresses this

    new threat, and makes several contributions. The insider attack was not addressed till now, which is the

    reason why there was a need of a innovative solution. One of the primary drawbacks that restrict some

    companies from getting fully involved in database marketing is its high costs.

    ent heuristic algorithms exploiting the monotonicity of privacy constraints for efficiently

    privacy given a group of records. Third, we present a data provider

    privacy checking strategies to ensure high utility and m

    anonymized data with efficiency. There was the use of collaborative techniques. Collaborative data

    publishing introduces a new attack that has not been studied so far. Compared to the attack by the

    ond scenario, each provider has additional data knowledge of its own

    records, which can help with the attack. This issue can be further worsened when multiple data providers

    collude with each other. In the social network or recommendation setting, a user m

    private information about other users using the anonymized data or recommendations assisted by some

    background knowledge and her own account information. Malicious users may collude or even create

    artificial accounts as in a shilling attack. Finally, we propose secure multi-party computation protocols

    for collaborative data publishing with mprivacy. All protocols are extensively analyzed and their security

    and efficiency are formally proved. Experiments on real-life datasets suggest that

    better or comparable utility and efficiency than existing and baseline algorithms while satisfying m

    International Journal of Advance Foundation and Research in Computer (IJAFRC)

    2015. ISSN 2348 4853

    www.ijafrc.org

    Many privacy models in this family do not explicitly classify attributes in a data table into QID and

    , but some of them could also thwart the sensitive linkages in the first category, so the

    Table I summarizes the attack models addressed by the privacy models. There is a consideration of a new

    type of insider attack by colluding data providers who may use their own data records (a subset of the

    overall data) to infer the data records contributed by other data providers. The project addresses this

    he insider attack was not addressed till now, which is the

    reason why there was a need of a innovative solution. One of the primary drawbacks that restrict some

    ent heuristic algorithms exploiting the monotonicity of privacy constraints for efficiently

    privacy given a group of records. Third, we present a data provider-aware anonymization

    igh utility and m-privacy of

    anonymized data with efficiency. There was the use of collaborative techniques. Collaborative data

    publishing introduces a new attack that has not been studied so far. Compared to the attack by the

    ond scenario, each provider has additional data knowledge of its own

    records, which can help with the attack. This issue can be further worsened when multiple data providers

    collude with each other. In the social network or recommendation setting, a user may attempt to infer

    private information about other users using the anonymized data or recommendations assisted by some

    background knowledge and her own account information. Malicious users may collude or even create

    party computation protocols

    for collaborative data publishing with mprivacy. All protocols are extensively analyzed and their security

    life datasets suggest that our approach achieves

    better or comparable utility and efficiency than existing and baseline algorithms while satisfying m-

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 5, May - 2015. ISSN 2348 4853

    27 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    The main aim to avoid congestion and increase reliability in an event driven WSN This involves creation

    of random topology, generation of event; the nodes which are near the event area will detect the event

    and then forms anatomic decision and broadcast their packet to Mobile data collector node. The node will

    follow random path to collect the data packet and disseminate it towards the sink node.

    VI. ALGORITHMS

    A. Anonymization by slicing: Slicing basically depends on Attribute and tuple partitioning. In attribute

    partitioning (vertical partition) we partitioned data as {name},{age-zip} and{Disease} and tuple

    partitioning (horizontal partition) as{t1,t2,t3,t4,t5,t6}. In attribute partitioning age and zip are

    partitioned together because they both are highly correlated because they are quasi identifiers (QI).

    These QI can be known to attacker. While tuple partitioning system should check L diversity for the

    sensitive attribute (SA) column. Algorithm runs are as follows.

    1. Initialize bucket k=n, int i= rowcount, columncount=C, Q={D}, // D= data into database, Arraylist= a[i];

    2. While Q is not empty

    If i

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 5, May - 2015. ISSN 2348 4853

    28 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    Fscore=1

    Insert value in the row

    i++;

    Else

    Add element to arraylist a[i];

    4. Exit.

    First initialize L=m and rowcount i. If i=n-m+1i.e if k=n=6 and L=m=4 then i=3, up to third row data

    doesnt need to check for Fscore. Add this data as they are coming from Q (step 1 and 2). For further data

    from Q check data for privacy constraint. If data fulfills L , then Fscore=1. If data doesnt fulfill Fscore=1,

    then add element in array list a[i] (step 3).3.Permutation: Permutation means rearrangement of records

    of data. In our project we have used permutation process for rearrangement of quasi identifier i.e {Zip,

    Age}

    4. Fscore: Fscore is privacy fitness score i.e the level of fulfillment of privacy constraint C. If fscore=1 then

    C(D*)= true.

    5. Constraint C: C is a privacy constraint in which D*should fulfill slicing condition with L diversity as

    explain above. Consider value of L diversity is 4. Fscore should be 1 when system fulfills L diversity

    condition. Some verification processes are carried out.

    1) Verification for L diversity: For verification of L diversity we used Fitness score function. For checking

    L diversity generates continuous similar values of SA i.e insert similar disease. Check for Fscore=1. If

    L=m, return Fscore. If privacy breach i.e if anonymized view take data as inserted then it breached

    privacy. D* should take data which fulfill L= m.

    1. Generate continuous similar values of SA

    2. Check for privacy constraint and fscore=1;

    3. If

    Privacy breach;

    Then early stop;

    Else

    Return (Fscore);

    4. Exit

    2) Verification for strength of system against number of provider: For verification against number of

    provider, add one more attribute in anonymized data as a provider to output. This verification will prove

    that our technique of anonymization doesnt depend on number of provider. Existing system i.e. provider

    aware anonymization algorithm depends on database as well as provider.

    1. Generate values of SA by providers= 1..n

    2. Check for privacy constraint and Fscore=1 with respect to number of provider

    3. If

    Privacy breach;

    Then early stop;

    Else

    Return (Fscore);

    4. Exit.

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 5, May - 2015. ISSN 2348 4853

    29 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    VII. DATA SET

    We conduct extensive workload experiments. Our results confirm that slicing preserves much better data

    utility than generalization.

    In workloads involving the sensitive attribute, slicing is also more effective than bucketization. In some

    classification experiments, slicing shows better performance than using the original data.

    VIII. EXPECTED OUTCOME

    Output data table shows secure data publishing which provide privacy to sensitive attribute. Comparison

    shows the computation time difference between encryption algorithm, provider aware algorithm and

    slicing with data privacy algorithm.

    IX. CONCLUSION

    We consider a potential attack on collaborative data publishing. We used slicing algorithm for

    anonymization and diversity and verify it for security and privacy by using binary algorithm of data

    privacy. Slicing algorithm is very useful when we are using high dimensional data. It divides data in both

    vertical and horizontal fashion. Due to encryption we can increase security. But the limitation is there

    could be loss of data utility. Above system can used in many applications like hospital management

    system, many industrial areas where we like to protect a sensitive data like salary of employee.

    Pharmaceutical company where sensitive data may be combination of ingredients of medicines, in

    banking sector where sensitive data is account number of customer, our system can use. It can be used in

    military area where data is gathered from different sources and need to secured that data from each

    other to maintain privacy. This proposed system help to improve the data privacy and security when data

    is gathered from different sources and output should be in collaborative fashion.

    X.REFERENCES

    [1] S. Goryczka, L. Xiong, and B. C. M. Fung, m-Privacy for collaborative data publishing, in Proc. of

    the 7th Intl. Conf. on Collaborative Computing: Networking, Applications and Work sharing, 2011.

    [2] C. Dwork, Differential privacy: a survey of results, in Proc. of the 5th Intl. Conf. on Theory and

    Applications of Models of Computation, 2008, pp. 119.

    [3] B. C. M. Fung, K.Wang, R. Chen, and P. S. Yu, Privacy-preserving data publishing: A survey of

    recent developments, ACM Comput. Surv., vol. 42, pp. 14:114:53, June 2010.

    [4] C. Dwork, A firm foundation for private data analysis, Commun. ACM, vol. 54, pp. 8695, January

    2011.

    [5] N. Mohammed, B. C. M. Fung, P. C. K. Hung, and C. Lee, Centralized and distributed anonymization

    for high-dimensional healthcare data, ACM Trans. on Knowl. Discovery from Data, vol. 4, no. 4,

    pp. 18:118:33, October 2010.

    [6] W. Jiang and C. Clifton, Privacy-preserving distributed k-anonymity, in DBSec, vol. 3654, 2005,

    pp. 924924.

    [7] W. Jiang and C. Clifton, A secure distributed framework for achieving k-anonymity, VLDB J., vol.

    15, no. 4, pp. 316333, 2006.

    [8] O. Goldreich, Foundations of Cryptography: Volume 2, 2004.

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 5, May - 2015. ISSN 2348 4853

    30 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    [9] Y. Lindell and B. Pinkas, Secure multiparty computation for privacy-preserving data mining, The

    Journal of Privacy and Confidentiality, vol. 1, no. 1, pp. 5998, 2009.

    [10] P. Samarati, Protecting respondents identities in microdata release, IEEE TKDE, vol. 13, no. 6,

    pp. 10101027, 2001.