Privacy FP Tree

Embed Size (px)

Citation preview

  • 8/8/2019 Privacy FP Tree

    1/15

    L. Chen et al. (Eds.): DASFAA 2009 Workshops, LNCS 5667, pp. 246260, 2009.

    Springer-Verlag Berlin Heidelberg 2009

    Privacy FP-Tree

    Sampson Pun and Ken Barker

    University of Calgary

    2500 University Drive NW

    Calgary, Alberta, Canada

    T2N 1N4

    Tel.: (403) 220-5110

    {szypun,kbarker}@ucalgary.ca

    Abstract. Current technology has made the publication of peoples privateinformation a common occurrence. The implications for individual privacy and

    security are still largely poorly understood by the general public but the risks

    are undeniable as evidenced by the increasing number of identity theft cases

    being reported recently. Two new definitions of privacy have been developed

    recently to help understand the exposure and how to protect individuals from

    privacy violations, namely, anonymized privacy andpersonalized privacy. This

    paper develops a methodology to validate whether a privacy violation exists for

    a published dataset. Determining whether privacy violations exist is a non-

    trivial task. Multiple privacy definitions and large datasets make exhaustive

    searches ineffective and computationally costly. We develop a compact treestructure called the Privacy FP-Tree to reduce the costs. This data structure

    stores the information of the published dataset in a format that allows for

    simple, efficient traversal. The Privacy FP-Tree can effectively determine the

    anonymity level of the dataset as well as identify any personalized privacy

    violations. This algorithm is O (n log n ) , which has acceptable characteristics

    for this application. Finally, experiments demonstrate the approach is scalable

    and practical.

    Keywords: Privacy, Database, FP-Tree, Anonymized privacy, Personalized

    privacy.

    1 Introduction

    Increasing identity theft frequency throughout the world has made privacy a major

    concern for individuals. We are asked on a nearly daily basis to provide personal data

    in exchange for goods and services, particularly online. Credit card histories, phone

    call histories, medical histories, etc. are stored in various computers often without our

    knowledge. This data is often released to the public either through the failings ofsecurity protocols or purposefully by companies undertaking data mining for reasons

    (possibly) beyond the individuals knowledge. One goal of privacy research is to

    ensure that an individuals privacy is not violated even when sensitive data is made

    available to the public. This paper provides an effective method for validating

    whether an individuals privacy has been (or is potentially exposed to be) violated.

  • 8/8/2019 Privacy FP Tree

    2/15

    Privacy FP-Tree 247

    Over the last 10 years there have been numerous breaches of privacy on publicly

    available data. Sweeny [6] showed that by cross-referencing a medical database in

    Massachusetts with voter registration lists; private medical data could be exposed. As

    a result, patient data thought to be private, by anonymization, could be linked to

    specific individuals. More recently in 2006, AOL [11] was forced to make a publicapology after releasing the search data of over 650,000 individuals. AOL employees

    thought the data was private and contained no identifiable information. AOL only

    removed the data from their website once users demonstrated that the data could be

    used of identifying specific individuals. In the same year, the Netflix1 prize was

    announced to encourage research into improving the search algorithm for its

    recommender system to assist subscribers when selecting movies they might be

    interested in based on past preferences. Narayanan and Shmatikov [3] showed that

    through their de-anonymization algorithms the Netflix dataset exposed individually

    identifiable information.Each time such a violation is identified, work is undertaken to remove the

    vulnerability but this retroactive approach does not prevent the damage caused by the

    initial violation so new algorithms need to be developed to identify potential

    vulnerabilities. Statistical databases were the focus of much of the privacy research in

    the late 80s and early 90s. These databases provided statistical information (sum,

    average, max, min, count, etc.) on the dataset without violating the privacy of its

    members. The research on statistical databases itself revolved mainly around query

    restriction [13] and data perturbation [4, 14]. However with the current growth of data

    mining, researchers are demanding more user specific data for analysis. Unfortunately,

    data perturbation techniques utilized by statistical databases often left the tuples in state

    that was inappropriate for data mining [6, 7]. To address this issue two new privacy

    classes have emerged: anonymized privacy and personalized privacy. These two

    privacy definitions allow the data collector to published tuple level information for data

    analysis while still guaranteeing some form of privacy to its members.

    Anonymized privacy is a privacy guarantee made by the data collector. When

    publishing user specific data, a members tuple will be anonymized so that it cannot

    be identified with a high degree of confidence. Many properties have been defined

    within anonymized privacy. These include k-anonymity [6], l-diversity [2], and (, k)-

    anonymity [10], among many others. Xiao and Tao [12] have also purposed the notionofpersonalized privacy. This notion allows users to define their own level of privacy

    to be used when data is published. If the data being published provides more

    information than the user is willing to release, then their privacy has been violated.

    Both privacy concepts will be discussed in further detail in Section 2.

    In this paper, we have developed a novel approach that identifies the amount of

    privacy released within a published dataset. Using the concepts of anonymized and

    personalized privacy, we determine the privacy properties exhibited within an

    arbitrary dataset. If the privacy requirements are correctly and explicitly specified in

    the meta-data, then by comparing the exhibited privacy properties to those stated inthe meta-data, we can expose discrepancies between the specification and the actual

    exposure found in the data. Thus, given a privacy requirement (specification), we can

    validate a datasets claims to conforming to that specification. This paper contributes

    1 http://www.netflixprize.com/ - Accessed September 17, 2008.

  • 8/8/2019 Privacy FP Tree

    3/15

    248 S. Pun and K. Barker

    by characterizing the dataset, and leaves as future work the development of a suitable

    meta-data specification that can be used for comparison and/or validation. However,

    to help motivate the utility of the approach, consider the following simple privacy

    specification: Anonymity is guaranteed to be at least 5-anonymous. The dataset can

    now be tested, using the tool developed in this paper, to ensure that there exists atleast 5 tuples in the dataset with the same quasi-identifier. Obviously this is a simple

    motivational example so the policy specifications are expected to be much more

    complex in a real-world data set. The contribution of this paper is a tool to analyze the

    data set using an efficient algorithm with respect to several anonymization criteria.

    The remainder of this paper is organized as follows. In Section 2, we describe the

    properties of anonymized privacy and personalized privacy. We present a new data

    structure called the privacy FP-Tree in Section 3. Section 4 explains how we use the

    privacy FP-tree to validate the privacy of the database. Section 5 describes scalability

    experiments to demonstrate the algorithms pragmatics and provides a complexityanalysis. In Section 6, we discuss future work and draw conclusions about the privacy

    FP-tree.

    2 Background

    For anonymized and personalized privacy, the definition of privacy itself relies on

    four keywords concepts, which must be defined first. These are quasi-Identifiers,

    identifiers, sensitive attributes, and non-sensitive attributes. All data values from a

    dataset must be categorized into one of these groups.

    2.1 Identifiers and Quasi-Identifiers

    An adversary interested in compromising data privacy may know either identifiers or

    quasi-identifiers. These values provide hints or insight to the adversary about which

    individuals are members of a particular dataset. Clearly some values reveal more

    information than others. Identifiers are pieces of information, if published, that will

    immediately identify an individual in the database. Social Insurance Numbers, Birth

    Certificate Numbers, and Passport Numbers are examples of such identifiers. Quasi-identifiers are sets of information that when combined can explicitly or implicitly

    identify a person. Addresses, gender, and birthdays are examples of quasi-identifiers

    because each individual value provides insufficient identifying information but

    collectively could identify an individual. This paper uses the generic term identifierto

    reference both types and only uses the more specific terminology when necessary due

    to context.

    2.2 Sensitive and Non-sensitive Attributes

    Attributes that are not identifiers can be considered either sensitive or non-sensitiveattributes. These attributes are assumed to be unknown by an adversary attempting to

    gain knowledge about a particular individual. The sensitive attributes are those that

    we must keep secret from an adversary. Information collected on an individuals

    health status, income, purchase history, and/or criminal past would be examples of

  • 8/8/2019 Privacy FP Tree

    4/15

    Privacy FP-Tree 249

    sensitive information. Non-sensitive attributes are those which are unknown by the

    adversary but the user would not find problematic if the information is released as

    general knowledge. It is difficult to define where the dividing line is between these

    two types of attributes because each individual has their own preference so some may

    consider all the information they provide as sensitive while others do not mind if allsuch information is released. This is the task of privacy policy specifications, which is

    beyond the scope of this paper. Thus, the provider must identify which attributes are

    considered sensitive and these are the only ones considered in the balance of this

    paper, we do not consider non-sensitive attributes further.

    2.3 Anonymized Privacy

    2.3.1 K-Anonymity

    K-anonymity [6] is a privacy principle where any row within the dataset cannot beidentified with a confidence greater than 1/k. To facilitate this property, each unique

    combination of identifiers must occur in the dataset at least ktimes. Table 1 provides

    a 2-anonymous table. The first three columns form the tables identifiers and each

    unique combination of the identifiers occurs within the table at least 2 times.

    While k-anonymity protects each row of the table from identification, it fails to

    fully protect all sensitive values. Each Q-Block2is linked to a set of sensitive values.

    If this set of sensitive values is identical, each row of a Q-block must contain the same

    sensitive value. In this situation the adversary does not need to predict the victims

    specific row. The adversary would still know the sensitive value for the victim with a

    confidence of 100%. This type of problem is called a homogeneity attack[2].

    2.3.2 L-DiversityMachanavajjhala et al. [2] describe a solution to the homogeneity attack called l-

    diversity. The l-diversity principle is defined as:

    A Q-block is l-diverse if it contains at least l well represented values for each

    sensitive attribute. A table is l-diverse if every Q-block is l-diverse.

    (1)

    The key to this definition is the term well represented. Machanavajjhala et al.

    provides three different interpretations for this term [2].Distinct l-diversity is the firstinterpretation of the term. Distinct l-diversity states that for each q-block there must

    be l unique sensitive values. Distinct l-diversity can guarantee that the sensitive value

    is predicted correctly by the adversary at a rate of:

    (Q (l- 1)) / Q, where Q is the number of rows in the Q-block. (2)

    Distinct 1-diversity cannot provide a stronger privacy guarantee because there is no

    way to ensure the distribution among data values. It is feasible that a distinct 2-diverse

    table has a q-block containing 100 rows where one sensitive value contains a positive

    result while the other 99 contain negative results. An adversary would be ableto predict with 99% accuracy that the victim has a negative sensitive value. The

    led Machanavajjhala et al. to [2] to define two other definitions for well represented

    l-diversity, namely, entropy l-diversity and recursive (c, l) diversity. Entropy

    2 Each set of rows corresponding to a unique combination of identifiers is known as a Q-Block.

  • 8/8/2019 Privacy FP Tree

    5/15

    250 S. Pun and K. Barker

    l-diversity ensures that the distribution of sensitive values within each q-block

    conforms to:

    - p (q, s) * log (p (q, s)) log (l).

    wherep (q, s) = S / Q. S is the number of rows in the Q-block with a sensitive

    value s. Q is the number of rows in the Q-block. (3)

    Therefore, to be entropy l-diverse a dataset must contain a relatively even distribution

    among the sensitive values (dependent on the l value chosen). Conversely, recursive

    (c, l)-diversity does not aim to have an even distribution among the values. Recursive

    diversity attempts to ensure that the most frequent sensitive value of a q-block is not

    too frequent. By counting the frequency of each sensitive value within a q-block and

    then sorting it, we are left with a sequence r1, r2, , rm where r1 is the most frequent

    sensitive value. A Q-block satisfies recursive (c, l)-diversity if:

    r1< c * (rl +rl+1 + +rm), where r1 is the most frequent value. (4)

    2.3.3 (, k) Anonymity(, k)-anonymity [10] is a privacy principle similar to l-diversity. Simply stated there

    are two parts to (, k)-anonymity. The kportion is the same as previously described,

    and is the maximum percentage that any sensitive value within a Q-block can

    represent. Using ( , k)-anonymity can prevent a homogeneity attack by limiting the

    sensitive values within a Q-block. Formally, (, k)-anonymity is defined as:

    A q-block fulfills the (, k)-anonymity ifp (q, s), wherep (q, s) = S / Q. S

    is the number of rows in the Q-block with a sensitive value s. Q is the number

    of rows in the Q-block. A table is (, k)-anonymous if every Q-block fulfills

    the (, k)-anonymity requirement.

    (5)

    Table 1. Anonymized Table Containing Individual Incomes in different provinces of Canada

    (k= 2, = 0.667, c = 2, l = 2, entropy l = 3)

    Address Date of Birth SIN IncomeAlberta 1984 1234* 80k

    Alberta 1984 1234* 85k

    Ontario 19** 5**** 120k

    Ontario 19** 5**** 123k

    Manitoba * 5**** 152k

    Manitoba * 5**** 32k

    Manitoba * 5**** 32k

    2.4 Personalized Privacy Preservation

    Xiao et al. [12] introduce a different concept for protecting privacy calledpersonalized

    privacy. In personalize privacy; the data collector must collect a guarding node along

    with the information of interested. This guarding node is a point on a taxonomy tree at

  • 8/8/2019 Privacy FP Tree

    6/15

    Privacy FP-Tree 251

    which the data provider is willing to release information. When publishing the dataset,

    each row is checked against the data providers guarding node. A data providers

    sensitive value cannot be published at a level of greater detail than the provider feels

    comfortable as indicated by their guard node. Figure 1 provides an example of a

    taxonomy tree drawn from the education domain. While data is collect at the lowestlevel (representing the most detailed or specific data) a persons guarding node may be

    at any point ranging from exact data (found at the leaves) up to the root node. For

    example, an individual may have completed Grade 9 but does not want this level of

    detailed released to others. By setting their guarding node as Junior High School, data

    can only be published if the public cannot know with high confidence that this

    individual only completed Grade 9.

    Jr. High

    ANY_EDU

    High School University

    Sr. High Undergrad Graduate

    Masters Doctoral7 108 119 12

    Fig. 1. Taxonomy Tree of the Education Domain

    3 Privacy FP-Tree

    Given the growing size of datasets, efficiency and capacity must be considered when

    attempting to protect privacy. Willenborg and De Waal developed a compact data

    structure called the FP-tree [8]. It shows that by storing a dataset in the form of an

    FP-Tree, files sizes differ by orders of magnitude. The main purpose of the FP-treewas to identify frequent patterns of transactional datasets. Instead, we use this

    functionality to identify the frequencies of each Q-block in a privacy context. As

    such, only columns of the dataset that are considered identifiers (recall Section 2) are

    used to create the FP-Tree.

    3.1 FP-Tree Construction

    Creating a FP-Tree requires two scans of the dataset. The first scan retrieves the

    frequency of each unique item and sorts that list in descending order. The second scanof the database will order the identifiers of each row according to its frequency and

    then append each identifier to the FP-tree. A sketch of this algorithm is shown in

    Algorithm 1 below.

  • 8/8/2019 Privacy FP Tree

    7/15

    252 S. Pun and K. Barker

    Input: A database DBOutput: FP-tree, the frequent-pattern tree ofDB.Method: The FP-tree is constructed as follows.Scan the databaseDBonce to collect the set of frequent identifiers (F)

    and the number of occurrences of each frequent identifier.Sort Fin occurrence-descending order as FList,

    the listof frequent attributes.Create the root of an FP-tree, T, and label it as null.For each row ROWSin DBdo:

    Select the frequent identifiers in ROWSSort them according to the order ofFList.

    Let the sorted frequent-identifier list in ROWSbe [p| P],wherepis the first element and Pis the remaining list.Call insert tree([p| P], T).

    Algorithm 1. Algorithm for FP-Tree Construction [5]

    The function insert tree ([p | P], T) is performed as follows: IfThas a child Nsuch

    that N.item-name = p.item-name, then increment Ns count by 1; else create a new

    nodeN, with its count initialized to 1, its parent link is linked to T, and its node-link is

    linked to the nodes with the same item-name via the node-link structure. If P is

    nonempty, call insert tree (P, N) recursively as indicated.

    3.1.1 Example of FP-Tree Construction

    To create the FP-Tree of Table 1, the first three columns are labeled as the identifiers

    and the last column is considered the sensitive attribute. The database used to create

    the FP-Tree only includes the portion of Table 1 containing identifying columns.

    Table 2 contains the sorted list of items based on its frequency within the dataset. The

    following step three of Algorithm 1 will result in the FP-Tree shown in Figure 2.

    Table 2. Frequency of Each Identifier within Table 1

    Identifier Frequency

    SIN_5**** 5

    Address_ Manitoba 3

    DOB_* 3

    Address_Alberta 2

    DOB_1984 2SIN_1234* 2

    Address_Ontario 2

    DOB_19** 2

  • 8/8/2019 Privacy FP Tree

    8/15

    Privacy FP-Tree 253

    DOB_1984:2

    Root

    Add_Alberta:2 SIN_5****:5

    Add_Ontario:2 Add_Manitoba:3

    DOB_19**:2 DOB_*:3SIN_1234*:2

    Fig. 2. FP-Tree of Table 1

    3.2 Privacy FP-Tree Construction

    This paper extends Algorithm 1 to develop Privacy FP-Trees. Using the FP-tree

    allows us to find privacy properties related to identifiers. However sensitive values

    are a crucial part of any privacy definition. To account for this, sensitive values must

    be appended to the FP-tree. It was observed in Figure 2 that each leaf node of the FP-

    Tree represents one unique q-block within the dataset. Appending a list of sensitive

    values to the end of each leaf node allows the sensitive values to be associated with

    the correct q-block. In cases where the dataset contains multiple sensitive values, each

    column of sensitive values is stored in its own linked list at the end of each leaf node.

    The amendment to Algorithm 1 is as follows:

    Input: A database DB, Columns of Identifiers, Columns of Sensitivevalues

    Output: Privacy FP-tree, the frequent-pattern tree ofDBwith itsassociate sensitive values.

    Method: The Privacy FP-tree is constructed as follows.Scan the databaseDBonce to collect the set of frequent identifiers (F)

    and the number of occurrences of each frequent identifier.Sort Fin occurrence-descending order as FList,

    the listof frequent attributes.Create the root of an FP-tree, T, and label it as null.For each row ROWSin DBdo:

    Select the frequent identifiers in ROWSSort them according to the order ofFList.Let the sorted frequent-identifier list in ROWSbe [p| P],

    wherepis the first element and Pis the remaining list.

    IfPis null, --pwill represent the leaf node ofRowLet the sensitive values of the Rowbe [s|S],where sis the first and Sare the remaining sensitive values.

    Call insert sensitive (p, [s|S]).Call insert tree([p| P], T).

    Algorithm 2. Algorithm for Privacy FP-Tree Construction

  • 8/8/2019 Privacy FP Tree

    9/15

    254 S. Pun and K. Barker

    The function insert sensitive (p, [s|S]) is performed as follows: Ifp has a linked-list of

    type3 s; search through that linked-list for an element N such that N.item-name =

    s.item_name, then increment N by 1; else create a new node N, with its count

    initialized to 1, and append node N to the end of the linked-list. If no linked-list is

    found; create a new nodeN, with its count initialize to 1 and create a new linked-listforp of type s. IfS is nonempty, call insert sensitive (p, S) recursively, as indicated.

    3.2.1 Example of a Privacy FP-Tree ConstructionInput for Algorithm 2 is a table, so we illustrate it with Table 1 by providing; address,

    date of birth, and Social Insurance Number labeled as the identifying columns; and

    income as the sensitive column. The resulting Privacy FP-Tree is shown in Figure 3.

    4 Determining Privacy

    4.1 Anonymized Privacy

    4.1.1 Finding K-Anonymity

    To determine the k-anonymity of a dataset, the q-block with the minimum number

    rows must be located. Using the privacy FP-tree, we represent each q-block by a leaf

    node. Within each leaf node is a frequency, which is the number of occurrences

    between the leaf node and the root node. For example, node SIN_1234* has a

    frequency of 2. The value 2 is the number of occurrences that SIN_1234*,

    DOB_1984, Address_Alberta appeared together within the dataset. Using this

    property of the tree, we traverse though all the leaf nodes. By identifying the

    minimum value among all the leaf nodes, k-anonymity is determined for the dataset.

    This minimum is the k of the dataset since no other q-block will have less than k

    common rows.

    4.1.2 Finding l-Diversity

    To find the distinct l-diversity of a dataset, the q-block that contains the fewest unique

    sensitive values must be located. Using the privacy FP-tree, the number of unique

    sensitive values of a q-block is represented by the depth of the linked-list stored in the

    leaf node. Traversing through each leaf node and storing the minimum depth of the

    linked-lists will identify the distinct l of the dataset. Entropy of a q-block was defined

    above (3). Within each node of the linked-list is the sensitive values name and count.

    p (q, s) is determined by accessing the count of the sensitive value and dividing it by

    the frequency within the leaf node. Traversing the linked-list of sensitive values for a

    q-block will determine p (q, s) for all sensitive values s in that q-block. Finally, to

    calculate the entropy of the q-block we sum p (q, s) * log (p (q, s)) for all sensitive s.

    The entropy of a dataset is the q-block with the lowest entropy. Once again we can

    determine this by traversing each leaf node to identify the q-block with the lowest

    entropy.

    3 Values of the same type belong to the same sensitive domain. Examples provided in this

    document assume that values of the same type are within the same column of a dataset.

  • 8/8/2019 Privacy FP Tree

    10/15

    Privacy FP-Tree 255

    The l within (c, l)-diversity is the same l as the one found using the distinct l-

    diversity method. To calculate c of a q-block the most frequent sensitive value (i.e.

    max)must be determined. This can be accomplished by going through the linked-list

    of a q-block. Recall formula (4) above captured the properties of(c, l) diversity.

    The frequency of the leaf node is equal to the sum of all the counts of the sensitivevalues. This frequency subtracted from the l-1 most frequent sensitive values will

    result in (rl +rl+1 + +rm). c of a q-block can be determined by r1 / (rl +rl+1 + +rm).

    Traversing the leaf nodes to find the c of each q-block will determine the c for the

    dataset as a whole. The greatest c among the q-block is the c value for the dataset.

    32k: 2

    Sensitive Values

    DOB_1984:2

    Root

    Add_Alberta:2 SIN_5****:5

    Add_Ontario:2 Add_Manitoba:3

    DOB_19**:2 DOB_*:3SIN_1234*:2

    80k: 1

    85k:1

    120k : 1

    123k:1

    152k:1

    Identifiers

    Fig. 3. Privacy FP-Tree of Table 1

    4.1.3 Finding (, k) anonymityThe method used to find k was described in Section 4.1.1. can be determined by

    calculating the max (p (q, s)) within the privacy FP-Tree. To find the max (p (q, s)) of

    a q-block, the sensitive value with the maximum count, max, is returned.Max is then

    divided by the frequency within the leaf node. Once all the leaf nodes have been

    traversed the q-block with max (p (q, s)) will be known and that value is the of the

    dataset.

    4.1.4 Multiple Sensitive Values

    While the examples and explanations have only involved datasets with a singlesensitive value, multiple sensitive values within a datasets are common.

    Machanavajjhala et al. [2] defines a dataset to be l-diverse on a multi-sensitive table,

    if the table is l-diverse when each sensitive attribute is treated as the sole sensitive

    attribute. This is easily implemented in our privacy FP-tree. Each sensitive attribute is

  • 8/8/2019 Privacy FP Tree

    11/15

    256 S. Pun and K. Barker

    given its own linked-list within each q-block. By comparing the values calculated

    from each linked-list within the same q-block and returning only the value required

    (i.e. min or max); we can determine the correct anonymized privacy values on multi-

    attribute tables by iterating our algorithm appropriately.

    4.2 Personalized Privacy

    Prior to finding whether or not a dataset preserves personalized privacy, a mechanism

    to represent the taxonomy tree is required. Each node within a level of the taxonomy

    tree is assigned a unique number. The sequence of numbers from the root to the node

    is then used as the representation. The conversion of the taxonomy in Figure 1 is

    shown in Table 3. Null is included to account for the possibility of an individual that

    has no preference for the amount of information released.

    Table 3. Numeric Representation of the Taxonomy Tree in Figure 1

    Node Representation

    ANY_EDU 1

    High School 1,1

    University 1,2

    Jr. High 1,1,1

    Sr. High 1,1,2

    Undergrad 1,2,1Graduate 1,2,2

    Grade 7 1,1,1,1

    Grade 8 1,1,1,2

    Grade 9 1,1,1,3

    Grade 10 1,1,2,1

    Grade 11 1,1,2,2

    Grade 12 1,1,2,3

    Masters 1,2,2,1

    Doctoral 1,2,2,2Null Null

    For datasets using personalized privacy, there are at least two sensitive columns.

    One column contains the sensitive values, which are going to be published; while the

    other contains the guarding nodes of each row. After building the privacy FP-tree, leaf

    nodes within the tree will contain two linked-lists corresponding to the two columns.

    By analyzing these two linked-lists we can determine if privacy is violated on a

    q-block.

    To analyze the sensitive values and the guarding nodes we first convert both lists to

    its numerical representation. The sensitive values and guarding nodes are then passed

    to Algorithm 3. If any q-block violates the privacy of an individual then the table

    itself violates the personalized privacy constraint.

  • 8/8/2019 Privacy FP Tree

    12/15

    Privacy FP-Tree 257

    Input: List of sensitive values for a q-block SList of guarding nodes G for same q-blockOutput: Boolean indicating if q-block preserves privacy

    Method: The Privacy FP-tree is constructed as follows.for guards in Gfor data in Sif (data.length() < guard.length())

    Guard Satisfiedelse if (data != guard) (For the length of the guard)Guard Satisfiedelse Guard is violated

    if all guards are satisfiedthen the q-block preserves privacyelse privacy is violated.Algorithm 3. Algorithm for determining personalize privacy violations

    4.2.1 Discussion of Algorithm 3A guard node is satisfied if a sensitive value within the q-block is higher on the

    taxonomy tree then the guard. In this situation an adversary cannot predict with high

    confidence4 at a level of detail the same as the guard node. For example, if a guardnode was set at High School. Within this q-block a sensitive value existed which

    was ANY_EDU. The length of the guard node (High School: 1, 1) would be 2 and

    the length of the sensitive value (ANY_EDU: 1) would be 1. In this situation the

    guard node would be satisfied because an adversary would not be able to predict with

    high confidence this education level. Secondly, a guard node would be satisfied if the

    there was a sensitive value that does not have a common path to the root node. For

    example, if a guard node was Grade 9 and there was a sensitive value Masters

    being published, then their respective numerical representations would be 1,1,1,3 and

    1,2,2,1. Any difference in the values will prevent an adversary from predicting thesensitive value with high (100%) confidence.

    5 Experiments

    5.1 Environment

    Experiments were completed using based on a Javas JRE 1.6 implementation. The

    hardware used was a Quad Core Q6600 @ 2.40 GHz. with 1024 Mb of memory

    allocated exclusively to the eclipse platform for running the java virtual machine. Thedataset used for the experiments were variations of the Adult dataset provided from

    the UC Irvine Machine Learning Repository. In order to create a larger dataset for

    analysis we analyzed multiple copies of the Adult dataset as one single file.

    4 High confidence is defined as 100% in Hansell [11].

  • 8/8/2019 Privacy FP Tree

    13/15

    258 S. Pun and K. Barker

    Fig. 4. Time to Process Dataset of Varying Sizes

    5.2 Scalability

    The first experiment was to determine the time required to analyze the privacy of

    datasets of different sizes. In this experiment the number of identifiers and sensitive

    columns remained constant while only increasing the number of rows. Maintaining

    the same number of unique values meant the size of the privacy FP-tree would remain

    constant. Only the counts within each node would differ. Figure 4 shows the results of

    the experiment.

    Figure 4 shows that there was a linear growth between the time to create theprivacy FP-Tree and to determine its anonymized privacy properties versus the

    number of rows within the dataset. The linear growth was a result of the increasing

    cost to scan through datasets of larger sizes. Since the privacy FP-tree was the same

    size among all the datasets, determining the anonymized privacy properties was

    constant.

    The second experiment investigated the time required to determine the privacy

    properties of varying privacy FP-Trees. This experiment analyzed datasets that varied

    from one unique q-block to 1000 unique q-blocks. The size of the dataset remained

    constant at 1000 rows. The results of the experiment showed a negligible cost in

    running time for the various privacy FP-trees. Each dataset took approximately 5seconds to complete and there were less than 1% difference between the times. The

    initial overhead for file I/O and class instantiation required the most resources for the

    entire process.

    These two experiments have shown that the privacy FP-tree is a practical solution

    capable of verifying and determining the privacy characteristics of a dataset. While no

    experiments are reported here to verify the personalized privacy aspect of the paper,

    the cost for determining such a property is similar to the cost of determining

    anonymized privacy.

    5.3 Complexity

    This approach is readily split into two sections for complexity analysis. The first

    section is the creation of the privacy FP-tree and the second is the cost to determine

    privacy properties of the dataset. The creation of the privacy FP-tree requires exactly

  • 8/8/2019 Privacy FP Tree

    14/15

    Privacy FP-Tree 259

    two scans of the database or O (n). Prior to the second database scan, the frequencies

    ofn nodes must be sorted in descending order. The sorting algorithm implemented

    was a simple merge sort with O (n log n). Thus, the cost of creating the privacy FP-

    tree is O (n) + O (n log n) so the total complexity is O (n log n). To determine the

    privacy properties, our algorithms must traverses all q-blocks. At each q-block itcalculates the privacy properties by looking at the sensitive values. In the worst-case

    scenario, the cost is O (n) where each row is a unique q-block. Therefore, the overall

    cost to complete our approach is O (n log n).

    6 Comparison to Some Directly Related Work

    The work by Atzori et al. [9] focused on identifying k-anonymity violations in the

    context of pattern discovery. The authors present an algorithm for detecting inference

    channels. In essence this algorithm identifies frequent item sets within a dataset

    consisting of only quasi-identifiers. This algorithm has exponential complex and is

    not scalable. The authors acknowledge this and provide an alternate algorithm that

    reduces the dataset that needs to be analyzed. This optimization reduces the running

    time of their algorithm by an order of magnitude. In comparison, our approach

    provides the same analysis with a reduced running cost and more flexibility by

    allowing sensitive values to be incorporated into the model. Friedman et al. [1]

    expanded on the existing k-anonymity definition beyond a release of a data table.

    They define k-anonymity based on the tuples released by the resulting model instead

    of the intermediate anonymized data tables. They provide an algorithm to induce ak-anonymous decision tree. This algorithm includes a component to maintain

    k-anonymity within the decision tree consistent with our definition of k-anonymity.

    To determine whether k-anonymity is breached in the decision tree, the authors have

    chosen to pre-scan the database and store the frequencies of all possible splits. Our

    algorithm would permit on demand scanning of the privacy FP-tree to reduce the cost

    of this step.

    7 Future Work and ConclusionsThe main purpose of the privacy FP-tree is as a verification tool. This approach can

    be extended to ensure that datasets are stored in a manner that preserves privacy.

    Rather than storing information on tables, it can be automatically stored in a privacy

    FP-tree format. Thus, permissions can be developed to only allow entries into the

    privacy FP-tree if privacy is preserved after insertion. We intend to explore the

    possibility of implementing a privacy database system in which the privacy FP-tree

    data structure is used to store all information.

    Research must also be done on the effects of adding, removing, and updating rows

    to the published dataset. Since altering the dataset also alters the structure of theprivacy FP-tree it requires us to rerun our algorithm to determine the privacy of the

    new dataset. Methods must be developed such that altering the dataset will not require

    the privacy FP-Tree to be rebuilt. Rebuilding the privacy FP-tree is the costliest

  • 8/8/2019 Privacy FP Tree

    15/15

    260 S. Pun and K. Barker

    portion of our algorithm, as shown through our experiments, so further efficiencies

    will prove important.

    In this paper we have shown that the privacy afforded in a database can be

    determined in an efficient and effective manner. The privacy FP-tree allows the

    storage of the database to be compressed while considering of privacy principles. Wehave shown how k-anonymity, l-diversity, and ( , k) anonymity can be correctly

    determined using the privacy FP-tree. We do acknowledge that many other similar

    algorithms exist that merit future consideration. Furthermore, this approach can be

    used to verify whether a dataset will preserve personalize privacy. Through our

    experiments and complexity analysis we have shown that this approach is practical

    and an improvement on current methods.

    References

    1. Friedman, R.W., Schuster, A.: Providing k-anonymity in data mining. In: The VLDBJournal 2008, pp. 789804 (2008)

    2. Machanavajjhala, J.G., Kifer, D., Venkitasubramaniam, M.: l-diversity: Privacy beyond k-anonymity. In: Proc. 22nd Intnl. Conf. Data Engg. (ICDE), p. 24 (2006)

    3. Narayanan, Shmatikov, V.: Robust De-anonymization of Large Datasets, February 5(2008)

    4. Dwork: An Ad Omnia Approach to Defining and Achieving Private Data Analysis. In:Proceedings of the First SIGKDD International Workshop on Privacy, Security, and Trust

    in KDD

    5. Han, J., Pei, J., Yin, Y.: Mining Frequent Patterns without Candidate Generation. In: Chen,W., et al. (eds.) Proc. Intl Conf. Management of Data, pp. 112 (2000)

    6. Sweeney, L.: K-anonymity: a model for protecting privacy. International Journal onUncertainty, Fuzziness and Knowledge-based Systems 10(5), 557570 (2002)

    7. Sweeney, L.: Weaving technology and policy together to maintain confidentiality. J. ofLaw, Medicine and Ethics 25(2-3), 98110 (1997)

    8. Willenborg, L., De Waal, T.: Statistical Disclosure Control in Practice. Springer,Heidelberg (1996)

    9. Atzori, M., Bonchi, F., Giannotti, F., Pedreschi, D.: Anonymity preserving patterndiscovery. The VLDB Journal 2008, 703727 (2008)

    10. Wong, R., Li, J., Fu, A., Wang, K.: (, k)Anonymity: An Enhanced k-Anonymity Modelfor Privacy Preserving Data Publishing. In: KDD (2006)11. Hansell, S.: AOL removes search data on vast group of web users. New York Times

    (August 8, 2006)

    12. Xiao, X., Tao, Y.: Personalized Privacy Preservation. In: SIGMOD (2006)13. Chin, F.Y., Ozsoyoglu, G.: Auditing and inference control in statistical databases. IEEE

    Trans. Softw. Eng. SE-8(6), 113139 (1982)

    14. Liew, K., Choi, U.J., Liew, C.J.: A data distortion by probability distribution. ACMTODS 10(3), 395411 (1985)