Privacy FP Tree

8/8/2019 Privacy FP Tree

1/15

L. Chen et al. (Eds.): DASFAA 2009 Workshops, LNCS 5667, pp. 246260, 2009.

Springer-Verlag Berlin Heidelberg 2009

Privacy FP-Tree

Sampson Pun and Ken Barker

University of Calgary

2500 University Drive NW

Calgary, Alberta, Canada

T2N 1N4

Tel.: (403) 220-5110

{szypun,kbarker}@ucalgary.ca

Abstract. Current technology has made the publication of peoples privateinformation a common occurrence. The implications for individual privacy and

security are still largely poorly understood by the general public but the risks

are undeniable as evidenced by the increasing number of identity theft cases

being reported recently. Two new definitions of privacy have been developed

recently to help understand the exposure and how to protect individuals from

privacy violations, namely, anonymized privacy andpersonalized privacy. This

paper develops a methodology to validate whether a privacy violation exists for

a published dataset. Determining whether privacy violations exist is a non-

trivial task. Multiple privacy definitions and large datasets make exhaustive

searches ineffective and computationally costly. We develop a compact treestructure called the Privacy FP-Tree to reduce the costs. This data structure

stores the information of the published dataset in a format that allows for

simple, efficient traversal. The Privacy FP-Tree can effectively determine the

anonymity level of the dataset as well as identify any personalized privacy

violations. This algorithm is O (n log n ) , which has acceptable characteristics

for this application. Finally, experiments demonstrate the approach is scalable

and practical.

Keywords: Privacy, Database, FP-Tree, Anonymized privacy, Personalized

privacy.

1 Introduction

Increasing identity theft frequency throughout the world has made privacy a major

concern for individuals. We are asked on a nearly daily basis to provide personal data

in exchange for goods and services, particularly online. Credit card histories, phone

call histories, medical histories, etc. are stored in various computers often without our

knowledge. This data is often released to the public either through the failings ofsecurity protocols or purposefully by companies undertaking data mining for reasons

(possibly) beyond the individuals knowledge. One goal of privacy research is to

ensure that an individuals privacy is not violated even when sensitive data is made

available to the public. This paper provides an effective method for validating

whether an individuals privacy has been (or is potentially exposed to be) violated.


2/15

Privacy FP-Tree 247

Over the last 10 years there have been numerous breaches of privacy on publicly

available data. Sweeny [6] showed that by cross-referencing a medical database in

Massachusetts with voter registration lists; private medical data could be exposed. As

a result, patient data thought to be private, by anonymization, could be linked to

specific individuals. More recently in 2006, AOL [11] was forced to make a publicapology after releasing the search data of over 650,000 individuals. AOL employees

thought the data was private and contained no identifiable information. AOL only

removed the data from their website once users demonstrated that the data could be

used of identifying specific individuals. In the same year, the Netflix1 prize was

announced to encourage research into improving the search algorithm for its

recommender system to assist subscribers when selecting movies they might be

interested in based on past preferences. Narayanan and Shmatikov [3] showed that

through their de-anonymization algorithms the Netflix dataset exposed individually

identifiable information.Each time such a violation is identified, work is undertaken to remove the

vulnerability but this retroactive approach does not prevent the damage caused by the

initial violation so new algorithms need to be developed to identify potential

vulnerabilities. Statistical databases were the focus of much of the privacy research in

the late 80s and early 90s. These databases provided statistical information (sum,

average, max, min, count, etc.) on the dataset without violating the privacy of its

members. The research on statistical databases itself revolved mainly around query

restriction [13] and data perturbation [4, 14]. However with the current growth of data

mining, researchers are demanding more user specific data for analysis. Unfortunately,

data perturbation techniques utilized by statistical databases often left the tuples in state

that was inappropriate for data mining [6, 7]. To address this issue two new privacy

classes have emerged: anonymized privacy and personalized privacy. These two

privacy definitions allow the data collector to published tuple level information for data

analysis while still guaranteeing some form of privacy to its members.

Anonymized privacy is a privacy guarantee made by the data collector. When

publishing user specific data, a members tuple will be anonymized so that it cannot

be identified with a high degree of confidence. Many properties have been defined

within anonymized privacy. These include k-anonymity [6], l-diversity [2], and (, k)-

anonymity [10], among many others. Xiao and Tao [12] have also purposed the notionofpersonalized privacy. This notion allows users to define their own level of privacy

to be used when data is published. If the data being published provides more

information than the user is willing to release, then their privacy has been violated.

Both privacy concepts will be discussed in further detail in Section 2.

In this paper, we have developed a novel approach that identifies the amount of

privacy released within a published dataset. Using the concepts of anonymized and

personalized privacy, we determine the privacy properties exhibited within an

arbitrary dataset. If the privacy requirements are correctly and explicitly specified in

the meta-data, then by comparing the exhibited privacy properties to those stated inthe meta-data, we can expose discrepancies between the specification and the actual

exposure found in the data. Thus, given a privacy requirement (specification), we can

validate a datasets claims to conforming to that specification. This paper contributes

1 http://www.netflixprize.com/ - Accessed September 17, 2008.


3/15

248 S. Pun and K. Barker

by characterizing the dataset, and leaves as future work the development of a suitable

meta-data specification that can be used for comparison and/or validation. However,

to help motivate the utility of the approach, consider the following simple privacy

specification: Anonymity is guaranteed to be at least 5-anonymous. The dataset can

now be tested, using the tool developed in this paper, to ensure that there exists atleast 5 tuples in the dataset with the same quasi-identifier. Obviously this is a simple

motivational example so the policy specifications are expected to be much more

complex in a real-world data set. The contribution of this paper is a tool to analyze the

data set using an efficient algorithm with respect to several anonymization criteria.

The remainder of this paper is organized as follows. In Section 2, we describe the

properties of anonymized privacy and personalized privacy. We present a new data

structure called the privacy FP-Tree in Section 3. Section 4 explains how we use the

privacy FP-tree to validate the privacy of the database. Section 5 describes scalability

experiments to demonstrate the algorithms pragmatics and provides a complexityanalysis. In Section 6, we discuss future work and draw conclusions about the privacy

FP-tree.

2 Background

For anonymized and personalized privacy, the definition of privacy itself relies on

four keywords concepts, which must be defined first. These are quasi-Identifiers,

identifiers, sensitive attributes, and non-sensitive attributes. All data values from a

dataset must be categorized into one of these groups.

2.1 Identifiers and Quasi-Identifiers

An adversary interested in compromising data privacy may know either identifiers or

quasi-identifiers. These values provide hints or insight to the adversary about which

individuals are members of a particular dataset. Clearly some values reveal more

information than others. Identifiers are pieces of information, if published, that will

immediately identify an individual in the database. Social Insurance Numbers, Birth

Certificate Numbers, and Passport Numbers are examples of such identifiers. Quasi-identifiers are sets of information that when combined can explicitly or implicitly

identify a person. Addresses, gender, and birthdays are examples of quasi-identifiers

because each individual value provides insufficient identifying information but

collectively could identify an individual. This paper uses the generic term identifierto

reference both types and only uses the more specific terminology when necessary due

to context.

2.2 Sensitive and Non-sensitive Attributes

Attributes that are not identifiers can be considered either sensitive or non-sensitiveattributes. These attributes are assumed to be unknown by an adversary attempting to

gain knowledge about a particular individual. The sensitive attributes are those that

we must keep secret from an adversary. Information collected on an individuals

health status, income, purchase history, and/or criminal past would be examples of


4/15

Privacy FP-Tree 249

sensitive information. Non-sensitive attributes are those which are unknown by the

adversary but the user would not find problematic if the information is released as

general knowledge. It is difficult to define where the dividing line is between these

two types of attributes because each individual has their own preference so some may

consider all the information they provide as sensitive while others do not mind if allsuch information is released. This is the task of privacy policy specifications, which is

beyond the scope of this paper. Thus, the provider must identify which attributes are

considered sensitive and these are the only ones considered in the balance of this

paper, we do not consider non-sensitive attributes further.

2.3 Anonymized Privacy

2.3.1 K-Anonymity

K-anonymity [6] is a privacy principle where any row within the dataset cannot beidentified with a confidence greater than 1/k. To facilitate this property, each unique

combination of identifiers must occur in the dataset at least ktimes. Table 1 provides

a 2-anonymous table. The first three columns form the tables identifiers and each

unique combination of the identifiers occurs within the table at least 2 times.

While k-anonymity protects each row of the table from identification, it fails to

fully protect all sensitive values. Each Q-Block2is linked to a set of sensitive values.

If this set of sensitive values is identical, each row of a Q-block must contain the same

sensitive value. In this situation the adversary does not need to predict the victims

specific row. The adversary would still know the sensitive value for the victim with a

confidence of 100%. This type of problem is called a homogeneity attack[2].

2.3.2 L-DiversityMachanavajjhala et al. [2] describe a solution to the homogeneity attack called l-

diversity. The l-diversity principle is defined as:

A Q-block is l-diverse if it contains at least l well represented values for each

sensitive attribute. A table is l-diverse if every Q-block is l-diverse.

(1)

The key to this definition is the term well represented. Machanavajjhala et al.

provides three different interpretations for this term [2].Distinct l-diversity is the firstinterpretation of the term. Distinct l-diversity states that for each q-block there must

be l unique sensitive values. Distinct l-diversity can guarantee that the sensitive value

is predicted correctly by the adversary at a rate of:

(Q (l- 1)) / Q, where Q is the number of rows in the Q-block. (2)

Distinct 1-diversity cannot provide a stronger privacy guarantee because there is no

way to ensure the distribution among data values. It is feasible that a distinct 2-diverse

table has a q-block containing 100 rows where one sensitive value contains a positive

result while the other 99 contain negative results. An adversary would be ableto predict with 99% accuracy that the victim has a negative sensitive value. The

led Machanavajjhala et al. to [2] to define two other definitions for well represented

l-diversity, namely, entropy l-diversity and recursive (c, l) diversity. Entropy

2 Each set of rows corresponding to a unique combination of identifiers is known as a Q-Block.


5/15


l-diversity ensures that the distribution of sensitive values within each q-block

conforms to:

- p (q, s) * log (p (q, s)) log (l).

wherep (q, s) = S / Q. S is the number of rows in the Q-block with a sensitive

value s. Q is the number of rows in the Q-block. (3)

Therefore, to be entropy l-diverse a dataset must contain a relatively even distribution

among the sensitive values (dependent on the l value chosen). Conversely, recursive

(c, l)-diversity does not aim to have an even distribution among the values. Recursive

diversity attempts to ensure that the most frequent sensitive value of a q-block is not

too frequent. By counting the frequency of each sensitive value within a q-block and

then sorting it, we are left with a sequence r1, r2, , rm where r1 is the most frequent

sensitive value. A Q-block satisfies recursive (c, l)-diversity if:

r1< c * (rl +rl+1 + +rm), where r1 is the most frequent value. (4)

2.3.3 (, k) Anonymity(, k)-anonymity [10] is a privacy principle similar to l-diversity. Simply stated there

are two parts to (, k)-anonymity. The kportion is the same as previously described,

and is the maximum percentage that any sensitive value within a Q-block can

represent. Using ( , k)-anonymity can prevent a homogeneity attack by limiting the

sensitive values within a Q-block. Formally, (, k)-anonymity is defined as:

A q-block fulfills the (, k)-anonymity ifp (q, s), wherep (q, s) = S / Q. S

is the number of rows in the Q-block with a sensitive value s. Q is the number

of rows in the Q-block. A table is (, k)-anonymous if every Q-block fulfills

the (, k)-anonymity requirement.

(5)

Table 1. Anonymized Table Containing Individual Incomes in different provinces of Canada

(k= 2, = 0.667, c = 2, l = 2, entropy l = 3)

Address Date of Birth SIN IncomeAlberta 1984 1234* 80k

Alberta 1984 1234* 85k

Ontario 19** 5**** 120k

Ontario 19** 5**** 123k

Manitoba * 5**** 152k



2.4 Personalized Privacy Preservation

Xiao et al. [12] introduce a different concept for protecting privacy calledpersonalized

privacy. In personalize privacy; the data collector must collect a guarding node along

with the information of interested. This guarding node is a point on a taxonomy tree at


6/15

Privacy FP-Tree 251

which the data provider is willing to release information. When publishing the dataset,

each row is checked against the data providers guarding node. A data providers

sensitive value cannot be published at a level of greater detail than the provider feels

comfortable as indicated by their guard node. Figure 1 provides an example of a

taxonomy tree drawn from the education domain. While data is collect at the lowestlevel (representing the most detailed or specific data) a persons guarding node may be

at any point ranging from exact data (found at the leaves) up to the root node. For

example, an individual may have completed Grade 9 but does not want this level of

detailed released to others. By setting their guarding node as Junior High School, data

can only be published if the public cannot know with high confidence that this

individual only completed Grade 9.

Jr. High

ANY_EDU

High School University

Sr. High Undergrad Graduate

Masters Doctoral7 108 119 12

Fig. 1. Taxonomy Tree of the Education Domain

3 Privacy FP-Tree

Given the growing size of datasets, efficiency and capacity must be considered when

attempting to protect privacy. Willenborg and De Waal developed a compact data

structure called the FP-tree [8]. It shows that by storing a dataset in the form of an

FP-Tree, files sizes differ by orders of magnitude. The main purpose of the FP-treewas to identify frequent patterns of transactional datasets. Instead, we use this

functionality to identify the frequencies of each Q-block in a privacy context. As

such, only columns of the dataset that are considered identifiers (recall Section 2) are

used to create the FP-Tree.

3.1 FP-Tree Construction

Creating a FP-Tree requires two scans of the dataset. The first scan retrieves the

frequency of each unique item and sorts that list in descending order. The second scanof the database will order the identifiers of each row according to its frequency and

then append each identifier to the FP-tree. A sketch of this algorithm is shown in

Algorithm 1 below.


7/15


Input: A database DBOutput: FP-tree, the frequent-pattern tree ofDB.Method: The FP-tree is constructed as follows.Scan the databaseDBonce to collect the set of frequent identifiers (F)

and the number of occurrences of each frequent identifier.Sort Fin occurrence-descending order as FList,

the listof frequent attributes.Create the root of an FP-tree, T, and label it as null.For each row ROWSin DBdo:

Select the frequent identifiers in ROWSSort them according to the order ofFList.

Let the sorted frequent-identifier list in ROWSbe [p| P],wherepis the first element and Pis the remaining list.Call insert tree([p| P], T).

Algorithm 1. Algorithm for FP-Tree Construction [5]

The function insert tree ([p | P], T) is performed as follows: IfThas a child Nsuch

that N.item-name = p.item-name, then increment Ns count by 1; else create a new

nodeN, with its count initialized to 1, its parent link is linked to T, and its node-link is

linked to the nodes with the same item-name via the node-link structure. If P is

nonempty, call insert tree (P, N) recursively as indicated.

3.1.1 Example of FP-Tree Construction

To create the FP-Tree of Table 1, the first three columns are labeled as the identifiers

and the last column is considered the sensitive attribute. The database used to create

the FP-Tree only includes the portion of Table 1 containing identifying columns.

Table 2 contains the sorted list of items based on its frequency within the dataset. The

following step three of Algorithm 1 will result in the FP-Tree shown in Figure 2.

Table 2. Frequency of Each Identifier within Table 1

Identifier Frequency

SIN_5**** 5

Address_ Manitoba 3

DOB_* 3

Address_Alberta 2

DOB_1984 2SIN_1234* 2

Address_Ontario 2

DOB_19** 2


8/15

Privacy FP-Tree 253

DOB_1984:2

Root

Add_Alberta:2 SIN_5****:5

Add_Ontario:2 Add_Manitoba:3

DOB_19**:2 DOB_*:3SIN_1234*:2

Fig. 2. FP-Tree of Table 1

3.2 Privacy FP-Tree Construction

This paper extends Algorithm 1 to develop Privacy FP-Trees. Using the FP-tree

allows us to find privacy properties related to identifiers. However sensitive values

are a crucial part of any privacy definition. To account for this, sensitive values must

be appended to the FP-tree. It was observed in Figure 2 that each leaf node of the FP-

Tree represents one unique q-block within the dataset. Appending a list of sensitive

values to the end of each leaf node allows the sensitive values to be associated with

the correct q-block. In cases where the dataset contains multiple sensitive values, each

column of sensitive values is stored in its own linked list at the end of each leaf node.

The amendment to Algorithm 1 is as follows:

Input: A database DB, Columns of Identifiers, Columns of Sensitivevalues

Output: Privacy FP-tree, the frequent-pattern tree ofDBwith itsassociate sensitive values.

Method: The Privacy FP-tree is constructed as follows.Scan the databaseDBonce to collect the set of frequent identifiers (F)

and the number of occurrences of each frequent identifier.Sort Fin occurrence-descending order as FList,

the listof frequent attributes.Create the root of an FP-tree, T, and label it as null.For each row ROWSin DBdo:

Select the frequent identifiers in ROWSSort them according to the order ofFList.Let the sorted frequent-identifier list in ROWSbe [p| P],

wherepis the first element and Pis the remaining list.

IfPis null, --pwill represent the leaf node ofRowLet the sensitive values of the Rowbe [s|S],where sis the first and Sare the remaining sensitive values.

Call insert sensitive (p, [s|S]).Call insert tree([p| P], T).

Algorithm 2. Algorithm for Privacy FP-Tree Construction


9/15


The function insert sensitive (p, [s|S]) is performed as follows: Ifp has a linked-list of

type3 s; search through that linked-list for an element N such that N.item-name =

s.item_name, then increment N by 1; else create a new node N, with its count

initialized to 1, and append node N to the end of the linked-list. If no linked-list is

found; create a new nodeN, with its count initialize to 1 and create a new linked-listforp of type s. IfS is nonempty, call insert sensitive (p, S) recursively, as indicated.

3.2.1 Example of a Privacy FP-Tree ConstructionInput for Algorithm 2 is a table, so we illustrate it with Table 1 by providing; address,

date of birth, and Social Insurance Number labeled as the identifying columns; and

income as the sensitive column. The resulting Privacy FP-Tree is shown in Figure 3.

4 Determining Privacy

4.1 Anonymized Privacy

4.1.1 Finding K-Anonymity

To determine the k-anonymity of a dataset, the q-block with the minimum number

rows must be located. Using the privacy FP-tree, we represent each q-block by a leaf

node. Within each leaf node is a frequency, which is the number of occurrences

between the leaf node and the root node. For example, node SIN_1234* has a

frequency of 2. The value 2 is the number of occurrences that SIN_1234*,

DOB_1984, Address_Alberta appeared together within the dataset. Using this

property of the tree, we traverse though all the leaf nodes. By identifying the

minimum value among all the leaf nodes, k-anonymity is determined for the dataset.

This minimum is the k of the dataset since no other q-block will have less than k

common rows.

4.1.2 Finding l-Diversity

To find the distinct l-diversity of a dataset, the q-block that contains the fewest unique

sensitive values must be located. Using the privacy FP-tree, the number of unique

sensitive values of a q-block is represented by the depth of the linked-list stored in the

leaf node. Traversing through each leaf node and storing the minimum depth of the

linked-lists will identify the distinct l of the dataset. Entropy of a q-block was defined

above (3). Within each node of the linked-list is the sensitive values name and count.

p (q, s) is determined by accessing the count of the sensitive value and dividing it by

the frequency within the leaf node. Traversing the linked-list of sensitive values for a

q-block will determine p (q, s) for all sensitive values s in that q-block. Finally, to

calculate the entropy of the q-block we sum p (q, s) * log (p (q, s)) for all sensitive s.

The entropy of a dataset is the q-block with the lowest entropy. Once again we can

determine this by traversing each leaf node to identify the q-block with the lowest

entropy.

3 Values of the same type belong to the same sensitive domain. Examples provided in this

document assume that values of the same type are within the same column of a dataset.


10/15

Privacy FP-Tree 255

The l within (c, l)-diversity is the same l as the one found using the distinct l-

diversity method. To calculate c of a q-block the most frequent sensitive value (i.e.

max)must be determined. This can be accomplished by going through the linked-list

of a q-block. Recall formula (4) above captured the properties of(c, l) diversity.

The frequency of the leaf node is equal to the sum of all the counts of the sensitivevalues. This frequency subtracted from the l-1 most frequent sensitive values will

result in (rl +rl+1 + +rm). c of a q-block can be determined by r1 / (rl +rl+1 + +rm).

Traversing the leaf nodes to find the c of each q-block will determine the c for the

dataset as a whole. The greatest c among the q-block is the c value for the dataset.

32k: 2

Sensitive Values

DOB_1984:2

Root

Add_Alberta:2 SIN_5****:5

Add_Ontario:2 Add_Manitoba:3

DOB_19**:2 DOB_*:3SIN_1234*:2

80k: 1

85k:1

120k : 1

123k:1

152k:1

Identifiers

Fig. 3. Privacy FP-Tree of Table 1

4.1.3 Finding (, k) anonymityThe method used to find k was described in Section 4.1.1. can be determined by

calculating the max (p (q, s)) within the privacy FP-Tree. To find the max (p (q, s)) of

a q-block, the sensitive value with the maximum count, max, is returned.Max is then

divided by the frequency within the leaf node. Once all the leaf nodes have been

traversed the q-block with max (p (q, s)) will be known and that value is the of the

dataset.

4.1.4 Multiple Sensitive Values

While the examples and explanations have only involved datasets with a singlesensitive value, multiple sensitive values within a datasets are common.

Machanavajjhala et al. [2] defines a dataset to be l-diverse on a multi-sensitive table,

if the table is l-diverse when each sensitive attribute is treated as the sole sensitive

attribute. This is easily implemented in our privacy FP-tree. Each sensitive attribute is


11/15


given its own linked-list within each q-block. By comparing the values calculated

from each linked-list within the same q-block and returning only the value required

(i.e. min or max); we can determine the correct anonymized privacy values on multi-

attribute tables by iterating our algorithm appropriately.

4.2 Personalized Privacy

Prior to finding whether or not a dataset preserves personalized privacy, a mechanism

to represent the taxonomy tree is required. Each node within a level of the taxonomy

tree is assigned a unique number. The sequence of numbers from the root to the node

is then used as the representation. The conversion of the taxonomy in Figure 1 is

shown in Table 3. Null is included to account for the possibility of an individual that

has no preference for the amount of information released.

Table 3. Numeric Representation of the Taxonomy Tree in Figure 1

Node Representation

ANY_EDU 1

High School 1,1

University 1,2

Jr. High 1,1,1

Sr. High 1,1,2

Undergrad 1,2,1Graduate 1,2,2

Grade 7 1,1,1,1

Grade 8 1,1,1,2

Grade 9 1,1,1,3

Grade 10 1,1,2,1

Grade 11 1,1,2,2

Grade 12 1,1,2,3

Masters 1,2,2,1

Doctoral 1,2,2,2Null Null

For datasets using personalized privacy, there are at least two sensitive columns.

One column contains the sensitive values, which are going to be published; while the

other contains the guarding nodes of each row. After building the privacy FP-tree, leaf

nodes within the tree will contain two linked-lists corresponding to the two columns.

By analyzing these two linked-lists we can determine if privacy is violated on a

q-block.

To analyze the sensitive values and the guarding nodes we first convert both lists to

its numerical representation. The sensitive values and guarding nodes are then passed

to Algorithm 3. If any q-block violates the privacy of an individual then the table

itself violates the personalized privacy constraint.


12/15

Privacy FP-Tree 257

Input: List of sensitive values for a q-block SList of guarding nodes G for same q-blockOutput: Boolean indicating if q-block preserves privacy

Method: The Privacy FP-tree is constructed as follows.for guards in Gfor data in Sif (data.length() < guard.length())

Guard Satisfiedelse if (data != guard) (For the length of the guard)Guard Satisfiedelse Guard is violated

if all guards are satisfiedthen the q-block preserves privacyelse privacy is violated.Algorithm 3. Algorithm for determining personalize privacy violations

4.2.1 Discussion of Algorithm 3A guard node is satisfied if a sensitive value within the q-block is higher on the

taxonomy tree then the guard. In this situation an adversary cannot predict with high

confidence4 at a level of detail the same as the guard node. For example, if a guardnode was set at High School. Within this q-block a sensitive value existed which

was ANY_EDU. The length of the guard node (High School: 1, 1) would be 2 and

the length of the sensitive value (ANY_EDU: 1) would be 1. In this situation the

guard node would be satisfied because an adversary would not be able to predict with

high confidence this education level. Secondly, a guard node would be satisfied if the

there was a sensitive value that does not have a common path to the root node. For

example, if a guard node was Grade 9 and there was a sensitive value Masters

being published, then their respective numerical representations would be 1,1,1,3 and

1,2,2,1. Any difference in the values will prevent an adversary from predicting thesensitive value with high (100%) confidence.

5 Experiments

5.1 Environment

Experiments were completed using based on a Javas JRE 1.6 implementation. The

hardware used was a Quad Core Q6600 @ 2.40 GHz. with 1024 Mb of memory

allocated exclusively to the eclipse platform for running the java virtual machine. Thedataset used for the experiments were variations of the Adult dataset provided from

the UC Irvine Machine Learning Repository. In order to create a larger dataset for

analysis we analyzed multiple copies of the Adult dataset as one single file.

4 High confidence is defined as 100% in Hansell [11].


13/15


Fig. 4. Time to Process Dataset of Varying Sizes

5.2 Scalability

The first experiment was to determine the time required to analyze the privacy of

datasets of different sizes. In this experiment the number of identifiers and sensitive

columns remained constant while only increasing the number of rows. Maintaining

the same number of unique values meant the size of the privacy FP-tree would remain

constant. Only the counts within each node would differ. Figure 4 shows the results of

the experiment.

Figure 4 shows that there was a linear growth between the time to create theprivacy FP-Tree and to determine its anonymized privacy properties versus the

number of rows within the dataset. The linear growth was a result of the increasing

cost to scan through datasets of larger sizes. Since the privacy FP-tree was the same

size among all the datasets, determining the anonymized privacy properties was

constant.

The second experiment investigated the time required to determine the privacy

properties of varying privacy FP-Trees. This experiment analyzed datasets that varied

from one unique q-block to 1000 unique q-blocks. The size of the dataset remained

constant at 1000 rows. The results of the experiment showed a negligible cost in

running time for the various privacy FP-trees. Each dataset took approximately 5seconds to complete and there were less than 1% difference between the times. The

initial overhead for file I/O and class instantiation required the most resources for the

entire process.

These two experiments have shown that the privacy FP-tree is a practical solution

capable of verifying and determining the privacy characteristics of a dataset. While no

experiments are reported here to verify the personalized privacy aspect of the paper,

the cost for determining such a property is similar to the cost of determining

anonymized privacy.

5.3 Complexity

This approach is readily split into two sections for complexity analysis. The first

section is the creation of the privacy FP-tree and the second is the cost to determine

privacy properties of the dataset. The creation of the privacy FP-tree requires exactly


14/15

Privacy FP-Tree 259

two scans of the database or O (n). Prior to the second database scan, the frequencies

ofn nodes must be sorted in descending order. The sorting algorithm implemented

was a simple merge sort with O (n log n). Thus, the cost of creating the privacy FP-

tree is O (n) + O (n log n) so the total complexity is O (n log n). To determine the

privacy properties, our algorithms must traverses all q-blocks. At each q-block itcalculates the privacy properties by looking at the sensitive values. In the worst-case

scenario, the cost is O (n) where each row is a unique q-block. Therefore, the overall

cost to complete our approach is O (n log n).

6 Comparison to Some Directly Related Work

The work by Atzori et al. [9] focused on identifying k-anonymity violations in the

context of pattern discovery. The authors present an algorithm for detecting inference

channels. In essence this algorithm identifies frequent item sets within a dataset

consisting of only quasi-identifiers. This algorithm has exponential complex and is

not scalable. The authors acknowledge this and provide an alternate algorithm that

reduces the dataset that needs to be analyzed. This optimization reduces the running

time of their algorithm by an order of magnitude. In comparison, our approach

provides the same analysis with a reduced running cost and more flexibility by

allowing sensitive values to be incorporated into the model. Friedman et al. [1]

expanded on the existing k-anonymity definition beyond a release of a data table.

They define k-anonymity based on the tuples released by the resulting model instead

of the intermediate anonymized data tables. They provide an algorithm to induce ak-anonymous decision tree. This algorithm includes a component to maintain

k-anonymity within the decision tree consistent with our definition of k-anonymity.

To determine whether k-anonymity is breached in the decision tree, the authors have

chosen to pre-scan the database and store the frequencies of all possible splits. Our

algorithm would permit on demand scanning of the privacy FP-tree to reduce the cost

of this step.

7 Future Work and ConclusionsThe main purpose of the privacy FP-tree is as a verification tool. This approach can

be extended to ensure that datasets are stored in a manner that preserves privacy.

Rather than storing information on tables, it can be automatically stored in a privacy

FP-tree format. Thus, permissions can be developed to only allow entries into the

privacy FP-tree if privacy is preserved after insertion. We intend to explore the

possibility of implementing a privacy database system in which the privacy FP-tree

data structure is used to store all information.

Research must also be done on the effects of adding, removing, and updating rows

to the published dataset. Since altering the dataset also alters the structure of theprivacy FP-tree it requires us to rerun our algorithm to determine the privacy of the

new dataset. Methods must be developed such that altering the dataset will not require

the privacy FP-Tree to be rebuilt. Rebuilding the privacy FP-tree is the costliest


15/15


portion of our algorithm, as shown through our experiments, so further efficiencies

will prove important.

In this paper we have shown that the privacy afforded in a database can be

determined in an efficient and effective manner. The privacy FP-tree allows the

storage of the database to be compressed while considering of privacy principles. Wehave shown how k-anonymity, l-diversity, and ( , k) anonymity can be correctly

determined using the privacy FP-tree. We do acknowledge that many other similar

algorithms exist that merit future consideration. Furthermore, this approach can be

used to verify whether a dataset will preserve personalize privacy. Through our

experiments and complexity analysis we have shown that this approach is practical

and an improvement on current methods.

References

1. Friedman, R.W., Schuster, A.: Providing k-anonymity in data mining. In: The VLDBJournal 2008, pp. 789804 (2008)

2. Machanavajjhala, J.G., Kifer, D., Venkitasubramaniam, M.: l-diversity: Privacy beyond k-anonymity. In: Proc. 22nd Intnl. Conf. Data Engg. (ICDE), p. 24 (2006)

3. Narayanan, Shmatikov, V.: Robust De-anonymization of Large Datasets, February 5(2008)

4. Dwork: An Ad Omnia Approach to Defining and Achieving Private Data Analysis. In:Proceedings of the First SIGKDD International Workshop on Privacy, Security, and Trust

in KDD

5. Han, J., Pei, J., Yin, Y.: Mining Frequent Patterns without Candidate Generation. In: Chen,W., et al. (eds.) Proc. Intl Conf. Management of Data, pp. 112 (2000)

6. Sweeney, L.: K-anonymity: a model for protecting privacy. International Journal onUncertainty, Fuzziness and Knowledge-based Systems 10(5), 557570 (2002)

7. Sweeney, L.: Weaving technology and policy together to maintain confidentiality. J. ofLaw, Medicine and Ethics 25(2-3), 98110 (1997)

8. Willenborg, L., De Waal, T.: Statistical Disclosure Control in Practice. Springer,Heidelberg (1996)

9. Atzori, M., Bonchi, F., Giannotti, F., Pedreschi, D.: Anonymity preserving patterndiscovery. The VLDB Journal 2008, 703727 (2008)

10. Wong, R., Li, J., Fu, A., Wang, K.: (, k)Anonymity: An Enhanced k-Anonymity Modelfor Privacy Preserving Data Publishing. In: KDD (2006)11. Hansell, S.: AOL removes search data on vast group of web users. New York Times

(August 8, 2006)

12. Xiao, X., Tao, Y.: Personalized Privacy Preservation. In: SIGMOD (2006)13. Chin, F.Y., Ozsoyoglu, G.: Auditing and inference control in statistical databases. IEEE

Trans. Softw. Eng. SE-8(6), 113139 (1982)

14. Liew, K., Choi, U.J., Liew, C.J.: A data distortion by probability distribution. ACMTODS 10(3), 395411 (1985)

Documents

Privacy FP Tree