Upload
ganesh-babu-oruganti
View
214
Download
0
Embed Size (px)
Citation preview
8/8/2019 Privacy FP Tree
1/15
L. Chen et al. (Eds.): DASFAA 2009 Workshops, LNCS 5667, pp. 246260, 2009.
Springer-Verlag Berlin Heidelberg 2009
Privacy FP-Tree
Sampson Pun and Ken Barker
University of Calgary
2500 University Drive NW
Calgary, Alberta, Canada
T2N 1N4
Tel.: (403) 220-5110
{szypun,kbarker}@ucalgary.ca
Abstract. Current technology has made the publication of peoples privateinformation a common occurrence. The implications for individual privacy and
security are still largely poorly understood by the general public but the risks
are undeniable as evidenced by the increasing number of identity theft cases
being reported recently. Two new definitions of privacy have been developed
recently to help understand the exposure and how to protect individuals from
privacy violations, namely, anonymized privacy andpersonalized privacy. This
paper develops a methodology to validate whether a privacy violation exists for
a published dataset. Determining whether privacy violations exist is a non-
trivial task. Multiple privacy definitions and large datasets make exhaustive
searches ineffective and computationally costly. We develop a compact treestructure called the Privacy FP-Tree to reduce the costs. This data structure
stores the information of the published dataset in a format that allows for
simple, efficient traversal. The Privacy FP-Tree can effectively determine the
anonymity level of the dataset as well as identify any personalized privacy
violations. This algorithm is O (n log n ) , which has acceptable characteristics
for this application. Finally, experiments demonstrate the approach is scalable
and practical.
Keywords: Privacy, Database, FP-Tree, Anonymized privacy, Personalized
privacy.
1 Introduction
Increasing identity theft frequency throughout the world has made privacy a major
concern for individuals. We are asked on a nearly daily basis to provide personal data
in exchange for goods and services, particularly online. Credit card histories, phone
call histories, medical histories, etc. are stored in various computers often without our
knowledge. This data is often released to the public either through the failings ofsecurity protocols or purposefully by companies undertaking data mining for reasons
(possibly) beyond the individuals knowledge. One goal of privacy research is to
ensure that an individuals privacy is not violated even when sensitive data is made
available to the public. This paper provides an effective method for validating
whether an individuals privacy has been (or is potentially exposed to be) violated.
8/8/2019 Privacy FP Tree
2/15
Privacy FP-Tree 247
Over the last 10 years there have been numerous breaches of privacy on publicly
available data. Sweeny [6] showed that by cross-referencing a medical database in
Massachusetts with voter registration lists; private medical data could be exposed. As
a result, patient data thought to be private, by anonymization, could be linked to
specific individuals. More recently in 2006, AOL [11] was forced to make a publicapology after releasing the search data of over 650,000 individuals. AOL employees
thought the data was private and contained no identifiable information. AOL only
removed the data from their website once users demonstrated that the data could be
used of identifying specific individuals. In the same year, the Netflix1 prize was
announced to encourage research into improving the search algorithm for its
recommender system to assist subscribers when selecting movies they might be
interested in based on past preferences. Narayanan and Shmatikov [3] showed that
through their de-anonymization algorithms the Netflix dataset exposed individually
identifiable information.Each time such a violation is identified, work is undertaken to remove the
vulnerability but this retroactive approach does not prevent the damage caused by the
initial violation so new algorithms need to be developed to identify potential
vulnerabilities. Statistical databases were the focus of much of the privacy research in
the late 80s and early 90s. These databases provided statistical information (sum,
average, max, min, count, etc.) on the dataset without violating the privacy of its
members. The research on statistical databases itself revolved mainly around query
restriction [13] and data perturbation [4, 14]. However with the current growth of data
mining, researchers are demanding more user specific data for analysis. Unfortunately,
data perturbation techniques utilized by statistical databases often left the tuples in state
that was inappropriate for data mining [6, 7]. To address this issue two new privacy
classes have emerged: anonymized privacy and personalized privacy. These two
privacy definitions allow the data collector to published tuple level information for data
analysis while still guaranteeing some form of privacy to its members.
Anonymized privacy is a privacy guarantee made by the data collector. When
publishing user specific data, a members tuple will be anonymized so that it cannot
be identified with a high degree of confidence. Many properties have been defined
within anonymized privacy. These include k-anonymity [6], l-diversity [2], and (, k)-
anonymity [10], among many others. Xiao and Tao [12] have also purposed the notionofpersonalized privacy. This notion allows users to define their own level of privacy
to be used when data is published. If the data being published provides more
information than the user is willing to release, then their privacy has been violated.
Both privacy concepts will be discussed in further detail in Section 2.
In this paper, we have developed a novel approach that identifies the amount of
privacy released within a published dataset. Using the concepts of anonymized and
personalized privacy, we determine the privacy properties exhibited within an
arbitrary dataset. If the privacy requirements are correctly and explicitly specified in
the meta-data, then by comparing the exhibited privacy properties to those stated inthe meta-data, we can expose discrepancies between the specification and the actual
exposure found in the data. Thus, given a privacy requirement (specification), we can
validate a datasets claims to conforming to that specification. This paper contributes
1 http://www.netflixprize.com/ - Accessed September 17, 2008.
8/8/2019 Privacy FP Tree
3/15
248 S. Pun and K. Barker
by characterizing the dataset, and leaves as future work the development of a suitable
meta-data specification that can be used for comparison and/or validation. However,
to help motivate the utility of the approach, consider the following simple privacy
specification: Anonymity is guaranteed to be at least 5-anonymous. The dataset can
now be tested, using the tool developed in this paper, to ensure that there exists atleast 5 tuples in the dataset with the same quasi-identifier. Obviously this is a simple
motivational example so the policy specifications are expected to be much more
complex in a real-world data set. The contribution of this paper is a tool to analyze the
data set using an efficient algorithm with respect to several anonymization criteria.
The remainder of this paper is organized as follows. In Section 2, we describe the
properties of anonymized privacy and personalized privacy. We present a new data
structure called the privacy FP-Tree in Section 3. Section 4 explains how we use the
privacy FP-tree to validate the privacy of the database. Section 5 describes scalability
experiments to demonstrate the algorithms pragmatics and provides a complexityanalysis. In Section 6, we discuss future work and draw conclusions about the privacy
FP-tree.
2 Background
For anonymized and personalized privacy, the definition of privacy itself relies on
four keywords concepts, which must be defined first. These are quasi-Identifiers,
identifiers, sensitive attributes, and non-sensitive attributes. All data values from a
dataset must be categorized into one of these groups.
2.1 Identifiers and Quasi-Identifiers
An adversary interested in compromising data privacy may know either identifiers or
quasi-identifiers. These values provide hints or insight to the adversary about which
individuals are members of a particular dataset. Clearly some values reveal more
information than others. Identifiers are pieces of information, if published, that will
immediately identify an individual in the database. Social Insurance Numbers, Birth
Certificate Numbers, and Passport Numbers are examples of such identifiers. Quasi-identifiers are sets of information that when combined can explicitly or implicitly
identify a person. Addresses, gender, and birthdays are examples of quasi-identifiers
because each individual value provides insufficient identifying information but
collectively could identify an individual. This paper uses the generic term identifierto
reference both types and only uses the more specific terminology when necessary due
to context.
2.2 Sensitive and Non-sensitive Attributes
Attributes that are not identifiers can be considered either sensitive or non-sensitiveattributes. These attributes are assumed to be unknown by an adversary attempting to
gain knowledge about a particular individual. The sensitive attributes are those that
we must keep secret from an adversary. Information collected on an individuals
health status, income, purchase history, and/or criminal past would be examples of
8/8/2019 Privacy FP Tree
4/15
Privacy FP-Tree 249
sensitive information. Non-sensitive attributes are those which are unknown by the
adversary but the user would not find problematic if the information is released as
general knowledge. It is difficult to define where the dividing line is between these
two types of attributes because each individual has their own preference so some may
consider all the information they provide as sensitive while others do not mind if allsuch information is released. This is the task of privacy policy specifications, which is
beyond the scope of this paper. Thus, the provider must identify which attributes are
considered sensitive and these are the only ones considered in the balance of this
paper, we do not consider non-sensitive attributes further.
2.3 Anonymized Privacy
2.3.1 K-Anonymity
K-anonymity [6] is a privacy principle where any row within the dataset cannot beidentified with a confidence greater than 1/k. To facilitate this property, each unique
combination of identifiers must occur in the dataset at least ktimes. Table 1 provides
a 2-anonymous table. The first three columns form the tables identifiers and each
unique combination of the identifiers occurs within the table at least 2 times.
While k-anonymity protects each row of the table from identification, it fails to
fully protect all sensitive values. Each Q-Block2is linked to a set of sensitive values.
If this set of sensitive values is identical, each row of a Q-block must contain the same
sensitive value. In this situation the adversary does not need to predict the victims
specific row. The adversary would still know the sensitive value for the victim with a
confidence of 100%. This type of problem is called a homogeneity attack[2].
2.3.2 L-DiversityMachanavajjhala et al. [2] describe a solution to the homogeneity attack called l-
diversity. The l-diversity principle is defined as:
A Q-block is l-diverse if it contains at least l well represented values for each
sensitive attribute. A table is l-diverse if every Q-block is l-diverse.
(1)
The key to this definition is the term well represented. Machanavajjhala et al.
provides three different interpretations for this term [2].Distinct l-diversity is the firstinterpretation of the term. Distinct l-diversity states that for each q-block there must
be l unique sensitive values. Distinct l-diversity can guarantee that the sensitive value
is predicted correctly by the adversary at a rate of:
(Q (l- 1)) / Q, where Q is the number of rows in the Q-block. (2)
Distinct 1-diversity cannot provide a stronger privacy guarantee because there is no
way to ensure the distribution among data values. It is feasible that a distinct 2-diverse
table has a q-block containing 100 rows where one sensitive value contains a positive
result while the other 99 contain negative results. An adversary would be ableto predict with 99% accuracy that the victim has a negative sensitive value. The
led Machanavajjhala et al. to [2] to define two other definitions for well represented
l-diversity, namely, entropy l-diversity and recursive (c, l) diversity. Entropy
2 Each set of rows corresponding to a unique combination of identifiers is known as a Q-Block.
8/8/2019 Privacy FP Tree
5/15
250 S. Pun and K. Barker
l-diversity ensures that the distribution of sensitive values within each q-block
conforms to:
- p (q, s) * log (p (q, s)) log (l).
wherep (q, s) = S / Q. S is the number of rows in the Q-block with a sensitive
value s. Q is the number of rows in the Q-block. (3)
Therefore, to be entropy l-diverse a dataset must contain a relatively even distribution
among the sensitive values (dependent on the l value chosen). Conversely, recursive
(c, l)-diversity does not aim to have an even distribution among the values. Recursive
diversity attempts to ensure that the most frequent sensitive value of a q-block is not
too frequent. By counting the frequency of each sensitive value within a q-block and
then sorting it, we are left with a sequence r1, r2, , rm where r1 is the most frequent
sensitive value. A Q-block satisfies recursive (c, l)-diversity if:
r1< c * (rl +rl+1 + +rm), where r1 is the most frequent value. (4)
2.3.3 (, k) Anonymity(, k)-anonymity [10] is a privacy principle similar to l-diversity. Simply stated there
are two parts to (, k)-anonymity. The kportion is the same as previously described,
and is the maximum percentage that any sensitive value within a Q-block can
represent. Using ( , k)-anonymity can prevent a homogeneity attack by limiting the
sensitive values within a Q-block. Formally, (, k)-anonymity is defined as:
A q-block fulfills the (, k)-anonymity ifp (q, s), wherep (q, s) = S / Q. S
is the number of rows in the Q-block with a sensitive value s. Q is the number
of rows in the Q-block. A table is (, k)-anonymous if every Q-block fulfills
the (, k)-anonymity requirement.
(5)
Table 1. Anonymized Table Containing Individual Incomes in different provinces of Canada
(k= 2, = 0.667, c = 2, l = 2, entropy l = 3)
Address Date of Birth SIN IncomeAlberta 1984 1234* 80k
Alberta 1984 1234* 85k
Ontario 19** 5**** 120k
Ontario 19** 5**** 123k
Manitoba * 5**** 152k
Manitoba * 5**** 32k
Manitoba * 5**** 32k
2.4 Personalized Privacy Preservation
Xiao et al. [12] introduce a different concept for protecting privacy calledpersonalized
privacy. In personalize privacy; the data collector must collect a guarding node along
with the information of interested. This guarding node is a point on a taxonomy tree at
8/8/2019 Privacy FP Tree
6/15
Privacy FP-Tree 251
which the data provider is willing to release information. When publishing the dataset,
each row is checked against the data providers guarding node. A data providers
sensitive value cannot be published at a level of greater detail than the provider feels
comfortable as indicated by their guard node. Figure 1 provides an example of a
taxonomy tree drawn from the education domain. While data is collect at the lowestlevel (representing the most detailed or specific data) a persons guarding node may be
at any point ranging from exact data (found at the leaves) up to the root node. For
example, an individual may have completed Grade 9 but does not want this level of
detailed released to others. By setting their guarding node as Junior High School, data
can only be published if the public cannot know with high confidence that this
individual only completed Grade 9.
Jr. High
ANY_EDU
High School University
Sr. High Undergrad Graduate
Masters Doctoral7 108 119 12
Fig. 1. Taxonomy Tree of the Education Domain
3 Privacy FP-Tree
Given the growing size of datasets, efficiency and capacity must be considered when
attempting to protect privacy. Willenborg and De Waal developed a compact data
structure called the FP-tree [8]. It shows that by storing a dataset in the form of an
FP-Tree, files sizes differ by orders of magnitude. The main purpose of the FP-treewas to identify frequent patterns of transactional datasets. Instead, we use this
functionality to identify the frequencies of each Q-block in a privacy context. As
such, only columns of the dataset that are considered identifiers (recall Section 2) are
used to create the FP-Tree.
3.1 FP-Tree Construction
Creating a FP-Tree requires two scans of the dataset. The first scan retrieves the
frequency of each unique item and sorts that list in descending order. The second scanof the database will order the identifiers of each row according to its frequency and
then append each identifier to the FP-tree. A sketch of this algorithm is shown in
Algorithm 1 below.
8/8/2019 Privacy FP Tree
7/15
252 S. Pun and K. Barker
Input: A database DBOutput: FP-tree, the frequent-pattern tree ofDB.Method: The FP-tree is constructed as follows.Scan the databaseDBonce to collect the set of frequent identifiers (F)
and the number of occurrences of each frequent identifier.Sort Fin occurrence-descending order as FList,
the listof frequent attributes.Create the root of an FP-tree, T, and label it as null.For each row ROWSin DBdo:
Select the frequent identifiers in ROWSSort them according to the order ofFList.
Let the sorted frequent-identifier list in ROWSbe [p| P],wherepis the first element and Pis the remaining list.Call insert tree([p| P], T).
Algorithm 1. Algorithm for FP-Tree Construction [5]
The function insert tree ([p | P], T) is performed as follows: IfThas a child Nsuch
that N.item-name = p.item-name, then increment Ns count by 1; else create a new
nodeN, with its count initialized to 1, its parent link is linked to T, and its node-link is
linked to the nodes with the same item-name via the node-link structure. If P is
nonempty, call insert tree (P, N) recursively as indicated.
3.1.1 Example of FP-Tree Construction
To create the FP-Tree of Table 1, the first three columns are labeled as the identifiers
and the last column is considered the sensitive attribute. The database used to create
the FP-Tree only includes the portion of Table 1 containing identifying columns.
Table 2 contains the sorted list of items based on its frequency within the dataset. The
following step three of Algorithm 1 will result in the FP-Tree shown in Figure 2.
Table 2. Frequency of Each Identifier within Table 1
Identifier Frequency
SIN_5**** 5
Address_ Manitoba 3
DOB_* 3
Address_Alberta 2
DOB_1984 2SIN_1234* 2
Address_Ontario 2
DOB_19** 2
8/8/2019 Privacy FP Tree
8/15
Privacy FP-Tree 253
DOB_1984:2
Root
Add_Alberta:2 SIN_5****:5
Add_Ontario:2 Add_Manitoba:3
DOB_19**:2 DOB_*:3SIN_1234*:2
Fig. 2. FP-Tree of Table 1
3.2 Privacy FP-Tree Construction
This paper extends Algorithm 1 to develop Privacy FP-Trees. Using the FP-tree
allows us to find privacy properties related to identifiers. However sensitive values
are a crucial part of any privacy definition. To account for this, sensitive values must
be appended to the FP-tree. It was observed in Figure 2 that each leaf node of the FP-
Tree represents one unique q-block within the dataset. Appending a list of sensitive
values to the end of each leaf node allows the sensitive values to be associated with
the correct q-block. In cases where the dataset contains multiple sensitive values, each
column of sensitive values is stored in its own linked list at the end of each leaf node.
The amendment to Algorithm 1 is as follows:
Input: A database DB, Columns of Identifiers, Columns of Sensitivevalues
Output: Privacy FP-tree, the frequent-pattern tree ofDBwith itsassociate sensitive values.
Method: The Privacy FP-tree is constructed as follows.Scan the databaseDBonce to collect the set of frequent identifiers (F)
and the number of occurrences of each frequent identifier.Sort Fin occurrence-descending order as FList,
the listof frequent attributes.Create the root of an FP-tree, T, and label it as null.For each row ROWSin DBdo:
Select the frequent identifiers in ROWSSort them according to the order ofFList.Let the sorted frequent-identifier list in ROWSbe [p| P],
wherepis the first element and Pis the remaining list.
IfPis null, --pwill represent the leaf node ofRowLet the sensitive values of the Rowbe [s|S],where sis the first and Sare the remaining sensitive values.
Call insert sensitive (p, [s|S]).Call insert tree([p| P], T).
Algorithm 2. Algorithm for Privacy FP-Tree Construction
8/8/2019 Privacy FP Tree
9/15
254 S. Pun and K. Barker
The function insert sensitive (p, [s|S]) is performed as follows: Ifp has a linked-list of
type3 s; search through that linked-list for an element N such that N.item-name =
s.item_name, then increment N by 1; else create a new node N, with its count
initialized to 1, and append node N to the end of the linked-list. If no linked-list is
found; create a new nodeN, with its count initialize to 1 and create a new linked-listforp of type s. IfS is nonempty, call insert sensitive (p, S) recursively, as indicated.
3.2.1 Example of a Privacy FP-Tree ConstructionInput for Algorithm 2 is a table, so we illustrate it with Table 1 by providing; address,
date of birth, and Social Insurance Number labeled as the identifying columns; and
income as the sensitive column. The resulting Privacy FP-Tree is shown in Figure 3.
4 Determining Privacy
4.1 Anonymized Privacy
4.1.1 Finding K-Anonymity
To determine the k-anonymity of a dataset, the q-block with the minimum number
rows must be located. Using the privacy FP-tree, we represent each q-block by a leaf
node. Within each leaf node is a frequency, which is the number of occurrences
between the leaf node and the root node. For example, node SIN_1234* has a
frequency of 2. The value 2 is the number of occurrences that SIN_1234*,
DOB_1984, Address_Alberta appeared together within the dataset. Using this
property of the tree, we traverse though all the leaf nodes. By identifying the
minimum value among all the leaf nodes, k-anonymity is determined for the dataset.
This minimum is the k of the dataset since no other q-block will have less than k
common rows.
4.1.2 Finding l-Diversity
To find the distinct l-diversity of a dataset, the q-block that contains the fewest unique
sensitive values must be located. Using the privacy FP-tree, the number of unique
sensitive values of a q-block is represented by the depth of the linked-list stored in the
leaf node. Traversing through each leaf node and storing the minimum depth of the
linked-lists will identify the distinct l of the dataset. Entropy of a q-block was defined
above (3). Within each node of the linked-list is the sensitive values name and count.
p (q, s) is determined by accessing the count of the sensitive value and dividing it by
the frequency within the leaf node. Traversing the linked-list of sensitive values for a
q-block will determine p (q, s) for all sensitive values s in that q-block. Finally, to
calculate the entropy of the q-block we sum p (q, s) * log (p (q, s)) for all sensitive s.
The entropy of a dataset is the q-block with the lowest entropy. Once again we can
determine this by traversing each leaf node to identify the q-block with the lowest
entropy.
3 Values of the same type belong to the same sensitive domain. Examples provided in this
document assume that values of the same type are within the same column of a dataset.
8/8/2019 Privacy FP Tree
10/15
Privacy FP-Tree 255
The l within (c, l)-diversity is the same l as the one found using the distinct l-
diversity method. To calculate c of a q-block the most frequent sensitive value (i.e.
max)must be determined. This can be accomplished by going through the linked-list
of a q-block. Recall formula (4) above captured the properties of(c, l) diversity.
The frequency of the leaf node is equal to the sum of all the counts of the sensitivevalues. This frequency subtracted from the l-1 most frequent sensitive values will
result in (rl +rl+1 + +rm). c of a q-block can be determined by r1 / (rl +rl+1 + +rm).
Traversing the leaf nodes to find the c of each q-block will determine the c for the
dataset as a whole. The greatest c among the q-block is the c value for the dataset.
32k: 2
Sensitive Values
DOB_1984:2
Root
Add_Alberta:2 SIN_5****:5
Add_Ontario:2 Add_Manitoba:3
DOB_19**:2 DOB_*:3SIN_1234*:2
80k: 1
85k:1
120k : 1
123k:1
152k:1
Identifiers
Fig. 3. Privacy FP-Tree of Table 1
4.1.3 Finding (, k) anonymityThe method used to find k was described in Section 4.1.1. can be determined by
calculating the max (p (q, s)) within the privacy FP-Tree. To find the max (p (q, s)) of
a q-block, the sensitive value with the maximum count, max, is returned.Max is then
divided by the frequency within the leaf node. Once all the leaf nodes have been
traversed the q-block with max (p (q, s)) will be known and that value is the of the
dataset.
4.1.4 Multiple Sensitive Values
While the examples and explanations have only involved datasets with a singlesensitive value, multiple sensitive values within a datasets are common.
Machanavajjhala et al. [2] defines a dataset to be l-diverse on a multi-sensitive table,
if the table is l-diverse when each sensitive attribute is treated as the sole sensitive
attribute. This is easily implemented in our privacy FP-tree. Each sensitive attribute is
8/8/2019 Privacy FP Tree
11/15
256 S. Pun and K. Barker
given its own linked-list within each q-block. By comparing the values calculated
from each linked-list within the same q-block and returning only the value required
(i.e. min or max); we can determine the correct anonymized privacy values on multi-
attribute tables by iterating our algorithm appropriately.
4.2 Personalized Privacy
Prior to finding whether or not a dataset preserves personalized privacy, a mechanism
to represent the taxonomy tree is required. Each node within a level of the taxonomy
tree is assigned a unique number. The sequence of numbers from the root to the node
is then used as the representation. The conversion of the taxonomy in Figure 1 is
shown in Table 3. Null is included to account for the possibility of an individual that
has no preference for the amount of information released.
Table 3. Numeric Representation of the Taxonomy Tree in Figure 1
Node Representation
ANY_EDU 1
High School 1,1
University 1,2
Jr. High 1,1,1
Sr. High 1,1,2
Undergrad 1,2,1Graduate 1,2,2
Grade 7 1,1,1,1
Grade 8 1,1,1,2
Grade 9 1,1,1,3
Grade 10 1,1,2,1
Grade 11 1,1,2,2
Grade 12 1,1,2,3
Masters 1,2,2,1
Doctoral 1,2,2,2Null Null
For datasets using personalized privacy, there are at least two sensitive columns.
One column contains the sensitive values, which are going to be published; while the
other contains the guarding nodes of each row. After building the privacy FP-tree, leaf
nodes within the tree will contain two linked-lists corresponding to the two columns.
By analyzing these two linked-lists we can determine if privacy is violated on a
q-block.
To analyze the sensitive values and the guarding nodes we first convert both lists to
its numerical representation. The sensitive values and guarding nodes are then passed
to Algorithm 3. If any q-block violates the privacy of an individual then the table
itself violates the personalized privacy constraint.
8/8/2019 Privacy FP Tree
12/15
Privacy FP-Tree 257
Input: List of sensitive values for a q-block SList of guarding nodes G for same q-blockOutput: Boolean indicating if q-block preserves privacy
Method: The Privacy FP-tree is constructed as follows.for guards in Gfor data in Sif (data.length() < guard.length())
Guard Satisfiedelse if (data != guard) (For the length of the guard)Guard Satisfiedelse Guard is violated
if all guards are satisfiedthen the q-block preserves privacyelse privacy is violated.Algorithm 3. Algorithm for determining personalize privacy violations
4.2.1 Discussion of Algorithm 3A guard node is satisfied if a sensitive value within the q-block is higher on the
taxonomy tree then the guard. In this situation an adversary cannot predict with high
confidence4 at a level of detail the same as the guard node. For example, if a guardnode was set at High School. Within this q-block a sensitive value existed which
was ANY_EDU. The length of the guard node (High School: 1, 1) would be 2 and
the length of the sensitive value (ANY_EDU: 1) would be 1. In this situation the
guard node would be satisfied because an adversary would not be able to predict with
high confidence this education level. Secondly, a guard node would be satisfied if the
there was a sensitive value that does not have a common path to the root node. For
example, if a guard node was Grade 9 and there was a sensitive value Masters
being published, then their respective numerical representations would be 1,1,1,3 and
1,2,2,1. Any difference in the values will prevent an adversary from predicting thesensitive value with high (100%) confidence.
5 Experiments
5.1 Environment
Experiments were completed using based on a Javas JRE 1.6 implementation. The
hardware used was a Quad Core Q6600 @ 2.40 GHz. with 1024 Mb of memory
allocated exclusively to the eclipse platform for running the java virtual machine. Thedataset used for the experiments were variations of the Adult dataset provided from
the UC Irvine Machine Learning Repository. In order to create a larger dataset for
analysis we analyzed multiple copies of the Adult dataset as one single file.
4 High confidence is defined as 100% in Hansell [11].
8/8/2019 Privacy FP Tree
13/15
258 S. Pun and K. Barker
Fig. 4. Time to Process Dataset of Varying Sizes
5.2 Scalability
The first experiment was to determine the time required to analyze the privacy of
datasets of different sizes. In this experiment the number of identifiers and sensitive
columns remained constant while only increasing the number of rows. Maintaining
the same number of unique values meant the size of the privacy FP-tree would remain
constant. Only the counts within each node would differ. Figure 4 shows the results of
the experiment.
Figure 4 shows that there was a linear growth between the time to create theprivacy FP-Tree and to determine its anonymized privacy properties versus the
number of rows within the dataset. The linear growth was a result of the increasing
cost to scan through datasets of larger sizes. Since the privacy FP-tree was the same
size among all the datasets, determining the anonymized privacy properties was
constant.
The second experiment investigated the time required to determine the privacy
properties of varying privacy FP-Trees. This experiment analyzed datasets that varied
from one unique q-block to 1000 unique q-blocks. The size of the dataset remained
constant at 1000 rows. The results of the experiment showed a negligible cost in
running time for the various privacy FP-trees. Each dataset took approximately 5seconds to complete and there were less than 1% difference between the times. The
initial overhead for file I/O and class instantiation required the most resources for the
entire process.
These two experiments have shown that the privacy FP-tree is a practical solution
capable of verifying and determining the privacy characteristics of a dataset. While no
experiments are reported here to verify the personalized privacy aspect of the paper,
the cost for determining such a property is similar to the cost of determining
anonymized privacy.
5.3 Complexity
This approach is readily split into two sections for complexity analysis. The first
section is the creation of the privacy FP-tree and the second is the cost to determine
privacy properties of the dataset. The creation of the privacy FP-tree requires exactly
8/8/2019 Privacy FP Tree
14/15
Privacy FP-Tree 259
two scans of the database or O (n). Prior to the second database scan, the frequencies
ofn nodes must be sorted in descending order. The sorting algorithm implemented
was a simple merge sort with O (n log n). Thus, the cost of creating the privacy FP-
tree is O (n) + O (n log n) so the total complexity is O (n log n). To determine the
privacy properties, our algorithms must traverses all q-blocks. At each q-block itcalculates the privacy properties by looking at the sensitive values. In the worst-case
scenario, the cost is O (n) where each row is a unique q-block. Therefore, the overall
cost to complete our approach is O (n log n).
6 Comparison to Some Directly Related Work
The work by Atzori et al. [9] focused on identifying k-anonymity violations in the
context of pattern discovery. The authors present an algorithm for detecting inference
channels. In essence this algorithm identifies frequent item sets within a dataset
consisting of only quasi-identifiers. This algorithm has exponential complex and is
not scalable. The authors acknowledge this and provide an alternate algorithm that
reduces the dataset that needs to be analyzed. This optimization reduces the running
time of their algorithm by an order of magnitude. In comparison, our approach
provides the same analysis with a reduced running cost and more flexibility by
allowing sensitive values to be incorporated into the model. Friedman et al. [1]
expanded on the existing k-anonymity definition beyond a release of a data table.
They define k-anonymity based on the tuples released by the resulting model instead
of the intermediate anonymized data tables. They provide an algorithm to induce ak-anonymous decision tree. This algorithm includes a component to maintain
k-anonymity within the decision tree consistent with our definition of k-anonymity.
To determine whether k-anonymity is breached in the decision tree, the authors have
chosen to pre-scan the database and store the frequencies of all possible splits. Our
algorithm would permit on demand scanning of the privacy FP-tree to reduce the cost
of this step.
7 Future Work and ConclusionsThe main purpose of the privacy FP-tree is as a verification tool. This approach can
be extended to ensure that datasets are stored in a manner that preserves privacy.
Rather than storing information on tables, it can be automatically stored in a privacy
FP-tree format. Thus, permissions can be developed to only allow entries into the
privacy FP-tree if privacy is preserved after insertion. We intend to explore the
possibility of implementing a privacy database system in which the privacy FP-tree
data structure is used to store all information.
Research must also be done on the effects of adding, removing, and updating rows
to the published dataset. Since altering the dataset also alters the structure of theprivacy FP-tree it requires us to rerun our algorithm to determine the privacy of the
new dataset. Methods must be developed such that altering the dataset will not require
the privacy FP-Tree to be rebuilt. Rebuilding the privacy FP-tree is the costliest
8/8/2019 Privacy FP Tree
15/15
260 S. Pun and K. Barker
portion of our algorithm, as shown through our experiments, so further efficiencies
will prove important.
In this paper we have shown that the privacy afforded in a database can be
determined in an efficient and effective manner. The privacy FP-tree allows the
storage of the database to be compressed while considering of privacy principles. Wehave shown how k-anonymity, l-diversity, and ( , k) anonymity can be correctly
determined using the privacy FP-tree. We do acknowledge that many other similar
algorithms exist that merit future consideration. Furthermore, this approach can be
used to verify whether a dataset will preserve personalize privacy. Through our
experiments and complexity analysis we have shown that this approach is practical
and an improvement on current methods.
References
1. Friedman, R.W., Schuster, A.: Providing k-anonymity in data mining. In: The VLDBJournal 2008, pp. 789804 (2008)
2. Machanavajjhala, J.G., Kifer, D., Venkitasubramaniam, M.: l-diversity: Privacy beyond k-anonymity. In: Proc. 22nd Intnl. Conf. Data Engg. (ICDE), p. 24 (2006)
3. Narayanan, Shmatikov, V.: Robust De-anonymization of Large Datasets, February 5(2008)
4. Dwork: An Ad Omnia Approach to Defining and Achieving Private Data Analysis. In:Proceedings of the First SIGKDD International Workshop on Privacy, Security, and Trust
in KDD
5. Han, J., Pei, J., Yin, Y.: Mining Frequent Patterns without Candidate Generation. In: Chen,W., et al. (eds.) Proc. Intl Conf. Management of Data, pp. 112 (2000)
6. Sweeney, L.: K-anonymity: a model for protecting privacy. International Journal onUncertainty, Fuzziness and Knowledge-based Systems 10(5), 557570 (2002)
7. Sweeney, L.: Weaving technology and policy together to maintain confidentiality. J. ofLaw, Medicine and Ethics 25(2-3), 98110 (1997)
8. Willenborg, L., De Waal, T.: Statistical Disclosure Control in Practice. Springer,Heidelberg (1996)
9. Atzori, M., Bonchi, F., Giannotti, F., Pedreschi, D.: Anonymity preserving patterndiscovery. The VLDB Journal 2008, 703727 (2008)
10. Wong, R., Li, J., Fu, A., Wang, K.: (, k)Anonymity: An Enhanced k-Anonymity Modelfor Privacy Preserving Data Publishing. In: KDD (2006)11. Hansell, S.: AOL removes search data on vast group of web users. New York Times
(August 8, 2006)
12. Xiao, X., Tao, Y.: Personalized Privacy Preservation. In: SIGMOD (2006)13. Chin, F.Y., Ozsoyoglu, G.: Auditing and inference control in statistical databases. IEEE
Trans. Softw. Eng. SE-8(6), 113139 (1982)
14. Liew, K., Choi, U.J., Liew, C.J.: A data distortion by probability distribution. ACMTODS 10(3), 395411 (1985)