31
24 CHAPTER II REVIEW OF LITERATURE Database keeps growing rapidly because of the availability of powerful and affordable database systems. This explosive growth in data and databases has generated an urgent need for new techniques and tools that can intelligently and automatically transform the processed data into useful information and knowledge. Consequently, data mining has become a research area with increasing importance. To design an effective data mining technique several issues to be taken into account such as types of data, efficiency and scalability of data mining algorithms, usefulness, different sources of data, protection of privacy & data security and so on. The problem of finding privacy-preserving data mining has found considerable attention in because of recent concerns on the privacy of underlying data. Privacy-preserving data mining considers the problem of running data mining algorithms on large databases consisting of confidential data that is not supposed to be revealed even to the party running the algorithm. The sharing of data and/or knowledge may come at a cost to privacy, primarily due to If the data contains business (or organizational) information, then the disclosure of this data or any knowledge extracted from the data may potentially reveal sensitive trade secrets, whose knowledge can provide a significant advantage to competitors and could cause the data holder to lose business. In shared knowledge concept, privacy can be achieved in the process of finding association rule mining in different ways using various methods. Several approaches have been proposed in the research literature to offer privacy in data mining. The majority of the proposed approaches can be classified along two principal research directions: (1) data hiding approaches and (2) knowledge hiding approaches. The data hiding approaches aims at the removal of confidential or private information from the original data prior to its disclosure by applying techniques such as perturbation, sampling, generalization

CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

24

CHAPTER II

REVIEW OF LITERATURE

Database keeps growing rapidly because of the availability of powerful

and affordable database systems. This explosive growth in data and databases

has generated an urgent need for new techniques and tools that can intelligently

and automatically transform the processed data into useful information and

knowledge. Consequently, data mining has become a research area with

increasing importance. To design an effective data mining technique several

issues to be taken into account such as types of data, efficiency and scalability

of data mining algorithms, usefulness, different sources of data, protection of

privacy & data security and so on.

The problem of finding privacy-preserving data mining has found

considerable attention in because of recent concerns on the privacy of

underlying data.

Privacy-preserving data mining considers the problem of running data

mining algorithms on large databases consisting of confidential data that is not

supposed to be revealed even to the party running the algorithm. The sharing of

data and/or knowledge may come at a cost to privacy, primarily due to If the

data contains business (or organizational) information, then the disclosure of

this data or any knowledge extracted from the data may potentially reveal

sensitive trade secrets, whose knowledge can provide a significant advantage to

competitors and could cause the data holder to lose business.

In shared knowledge concept, privacy can be achieved in the process of

finding association rule mining in different ways using various methods.

Several approaches have been proposed in the research literature to offer

privacy in data mining. The majority of the proposed approaches can be

classified along two principal research directions: (1) data hiding approaches

and (2) knowledge hiding approaches. The data hiding approaches aims at the

removal of confidential or private information from the original data prior to its

disclosure by applying techniques such as perturbation, sampling, generalization

Page 2: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

25

or suppression, transformation using the cryptography techniques. The

knowledge hiding approaches, aim to protect the sensitive data mining results

rather than the raw data itself, which are produced by the application of data

mining tools on the original database. This direction of approaches mainly deals

with heuristic, reconstruction and blocking techniques.

Privacy-preserving distributed data mining is a multidisciplinary field

and requires close cooperation between researchers and practitioners from the

fields of cryptography, data mining, public policy and law. Now, the question is

how to compute the results without pooling the data in a way that reveals

nothing but the final results of the data mining computation. This question of

privacy-preserving data mining is actually a special case of a long-studied

problem in cryptography called secure multiparty computation. This problem

deals with a setting where a set of parties with private inputs wishes to jointly

compute some function of their inputs. This joint computation should have the

property that the parties learn the correct output and nothing else, even if some

of the parties maliciously collude to obtain more information. Clearly, a

protocol is needed to solve privacy-preserving distributed data mining

problems.

Basically there are four cryptography based methods exist for privacy-

preserving distributed data mining such as Secure Sum, Secure Set Union,

Secure Size of Set Intersection, Scalar Product. Secure sum is a simple example

of secure multi party computation and is used when three or more parties need

to compute securely a sum where no collusion occurs. Secure union methods

are useful in data mining where each party needs to give rules, frequent item

sets and so on without revealing the owner. When several parties having their

own set of items from a common domain then Secure Size of Set Intersection

method is used to securely compute the cardinality/size of the intersection of the

parties over local sets. Scalar product is a powerful component technique and

can be used to solve data mining problems by computing the scalar product of

two vectors securely. These computing techniques are mainly used for different

data distribution models such as horizontal and vertical data distributions for

finding data mining results without violating privacy constraints.

Page 3: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

26

The association rule mining was first introduced in 1993. Since its

inception, association rule mining has become one of the core data-mining tasks

and has attracted tremendous interest among researchers and practitioners.

Privacy preserving association rule mining is one of the most popular

pattern discovery methods in the new and rapidly emerging research area of

privacy preserving data mining. Several privacy-preserving techniques for

association rule mining have also been proposed in the recent years. Various

approaches and algorithms have been developed for centralized data, while

others refer to a distributed data scenario. Distributed data scenarios can also be

classified as horizontal data distribution, vertical data distribution and mixed

data distribution. The approaches for privacy preserving association rule

mining can be categorized into three categories such as heuristic-based

techniques, reconstruction-based techniques, cryptography based techniques.

The following sections gives the review of literature related to data

mining, association rule mining, privacy preserving data mining in centralized

as well as in distributed database environment and emphasis is given on

privacy preserving association rule mining in centralized and also for

distributed environment.

2.1 Data Mining

In [6], requirements and challenges of data mining are studied such as

handling of different types of data, efficiency and scalability of data mining

algorithms, usefulness, certainty, and expressiveness of data mining results,

expression of various kinds of data mining requests and results, interactive

mining knowledge at multiple abstraction levels, mining information from

different sources of data, protection of privacy and data security. A

comprehensive overview of recently developed data mining techniques by

considering the requirements and challenges of data mining is studied to

understand the user behavior, to improve the services, and to increase the

business opportunities. A classification of the available data mining techniques

and a comparative study of each technique are also discussed.

Page 4: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

27

The different tasks of data mining and the various applications where

data mining techniques can be used is addressed by the authors [7].

An overview of tasks involved in knowledge discovery system and the

approaches to solve these tasks are provided by the authors and they also

described the software tools which are available to use for knowledge discovery

tasks and also proposed a feature classification scheme which can be used to

study knowledge and data mining software tools. Based on the general

characteristics, database connectivity and characteristics of data mining,

software tools are classified. They further investigated 43 software products in

which some are research prototypes and some are commercial packages. From

their analysis, they specify features which should exist in knowledge discovery

software in order to accommodate its novice users as well as experienced

analysts effectively, also discussed the issues which are not addressed or not

solved yet [8].

A comprehensive review of different classification techniques in data

mining is presented and different kinds of classification techniques such as

decision tree induction, Bayesian networks, k-nearest neighbor techniques are

discussed with algorithms [9]. They also presented methods of case-based

reasoning, genetic algorithm and fuzzy logic techniques with suitable examples.

Distributed data mining focuses the attention of the researchers because

of its potentials in dealing with distributed data and its performance advantages.

The basic introduction to distributed data mining and issues to be considered in

distributed data mining are presented in [10]. The author also addressed the

various problems exist in distributed data mining, significance of distributed

data mining and the progress of distributed data mining with classification,

association and clustering techniques.

2.2 Association Rule Mining

R Agarwal, et al proposed an algorithm [11] for finding association rules

between items or item sets for a database consisting of transactions and each

transaction consists of items purchased by a customer in a visit. The authors

incorporated buffer management, novel estimation and pruning techniques to

Page 5: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

28

find significant association rules for the proposed algorithm. They presented the

formal definition of association rule, algorithm for finding large item sets as

well as association rules. The results of applying proposed algorithm to market

basket data obtained from a large retailing company are presented and they

proved that proposed algorithm is more effective.

Two new algorithms apriori and apriori-Tid are discussed for finding

association rule mining that are fundamentally different from the known

algorithms. From empirical evaluation the authors showed that the developed

algorithms outperform the known algorithms by factors ranging from three for

small problems to more than an order of magnitude for large problems. They

combined the features of these two algorithms and developed a new algorithm

called AprioriHybrid algorithm which scales linearly with the number of

transactions and number of items in the database [12]. The authors discussed

generalization concept with association rule mining by giving hierarchies over

items and to capture interesting rules at all levels of multiple hierarchies. They

developed method [13] for finding association rules for large database with

generalization hierarchy. They also conducted experiments with supermarket

dataset to show the methods effectiveness.

In [14], author presented a survey on large-scale parallel and distributed

data mining algorithms and systems. Research issues and challenges that must

be overcome for designing and implementing successful tools for large-scale

data mining are also discussed by the author.

A pattern decomposition algorithm is proposed to reduce the size of the

dataset on each pass in the process of mining all frequent patterns in a large

database[15]. This algorithm minimizes the cost incurred for process of

generating candidate set and also saves a great amount of counting time. The

empirical evaluation showed that the algorithm outperforms Apriori by one

order of magnitude and is more scalable.

The authors in [16], discussed the basic concepts of association rule

mining and existing association rule mining techniques. They also discussed the

issues related to the efficiency of the association rule mining algorithms such as

Page 6: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

29

reduction of number of passes over the database, sampling the database, adding

extra constraints on the structure of patterns and through parallelization. The

authors also presented the categories of databases in which Association Rules

are applied, the progress made in recent years related to this area such as

Redundant Association Rules, Rare Association Rules Generalized Association

Rules and Negative Association Rules. Other measures of interestingness of an

association are also addressed in this article.

In [17], the authors studied association rule mining technique and

proved that mining association rules algorithm based on support, confidence

and interestingness is improved, aiming at creating interestingness over rules

which are not useful. Useless rules are cancelled, creating more reasonable

association rules including negative items. Based on this facts algorithm was

developed and implemented with 2002 student score list of computer

specialized field in Inner Mongolia university of science and technology to

mine association rules.

2.3 Privacy Preserving Data Mining

In 1996, Clifton et al. [18] presented a number of ideas to protect the

privacy of individuals in the database. The authors provided examples which

indicates the applications of applying data mining algorithms on a database

reveals critical information to business rivals. Clifton[19] presented a technique

to prevent the disclosure of sensitive information by releasing only samples of

the original data. This technique is applicable independently of the specific data

mining algorithm to be used. Clifton et al. [20], introduced some definitions for

PPDM and discussed some metrics for information disclosure in data mining. In

[21], the authors defined privacy preserving data mining (PPDM) as data

mining methods which have to meet two targets: (1) meeting privacy

requirements and (2) providing valid data mining results. They described the

problems in defining what information is private in data mining and discussed

how privacy can be violated in data mining. Privacy preservation in data mining

based on users' personal information and information concerning their collective

activity are also addressed.

Page 7: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

30

The authors in [22], proposed classification of privacy preserving data

mining techniques based on different dimensions such as data distribution, data

modification, data mining algorithm, data or rule hiding, privacy preservation.

They also discussed various methods exist in each classification of methodology

based on the dimension. The existing methodologies are discussed for different

privacy preserving data mining techniques such as classification, association

rule mining and clustering in various dimensions. They evaluated the algorithms

related to heuristic-based techniques, cryptography-based techniques, and

reconstruction-based techniques for different data mining techniques.

The article [23] shows how technology from the security community can

change data mining for the better, providing all its benefits while still

maintaining privacy. The authors presented the discussion over existing privacy

preserving algorithms in case of centralized as well as in distributed

environment. For centralized database applications, the authors discussed

various existing methodologies such as de-identification, data perturbation,

randomization, and reconstruction based technique with suitable examples. The

basic idea in distributed applications is that, the parties hold their own data, but

cooperate to get the final result and also provided the solution using secure

multiparty computation concept. They discussed the advantages and drawbacks

with various secure sum computation methods.

The state of the art in the area of privacy preserving data mining

(PPDM) techniques is discussed by the authors in [24]. The authors presented

the classification of privacy preserving techniques based on the five dimensions

such as data distribution, data modification, data mining algorithm, data or rule

hiding, privacy preservation. They also discussed the methodologies based on

heuristic for classification, association rule mining and clustering techniques

and also cryptography based techniques for vertically partitioned and

horizontally partitioned databases in multi distributed environment for

association rule mining and classification technique. Privacy preserving

clustering problem’s solution is discussed in this paper with expectation-

maximization algorithm. They also studied Reconstruction-Based Techniques

for Binary and Categorical Data.

Page 8: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

31

An overview of the state-of-the-art in PPDM and some current

suggestions for proceeding towards standardization in PPDM are summarized in

[25]. This is followed by considerations of how PPDM could be improved

based on the European Directive 95/46/EC, additionally taking into account

procedural and process-related considerations.

The aim of the PPDM algorithms is the extraction of relevant knowledge

from large amount of data, while protecting sensitive data or information. The

several existing data mining techniques, incorporating privacy protection

mechanisms such as association rule mining, classification and clustering

techniques are discussed in [26]. An important aspect is discussed in

determining suitable algorithms for various data mining techniques to protect

sensitive data or information by doing modifications to the original database

before releasing it to the intended parties and they also presented

comprehensive set of criteria with respect to existing PPDM algorithms which

helps the designer to determine which algorithm meets specific requirements.

The authors have also been defined parameters such as efficiency, scalability,

level of privacy, data quality and hiding failure. Then evaluated set of

association rule hiding algorithms with these parameters and showed the quality

and performance of each methodology.

The authors in [27], proposed classification of PPDM techniques based

on different dimensions such as data distribution, data modification, data

mining algorithm, data or rule hiding, privacy preservation. They also

discussed various methods exist in each classification of methodology based on

the dimension. The existing methodologies are discussed for different privacy

preserving data mining techniques such as classification, association rule

mining and clustering in various dimensions. They evaluated the algorithms

related to heuristic-based techniques, cryptography-based techniques, and

reconstruction-based techniques for different data mining techniques.

The problem of protecting sensitive knowledge in large databases is

addressed[28]. The authors introduced an efficient algorithm that improves the

balance between protection of sensitive knowledge and pattern discovery, called

Sliding Window Algorithm (SWA) The experimental results revealed that SWA

is effective and can achieve significant improvement over the other previous

approaches.

Page 9: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

32

In [29], privacy-preserving distributed association rule mining protocol

based on a new semi-trusted mixer mode was proposed by the authors which

can protect the privacy of each distributed database against the coalition up

to n − 2 other data sites or even the mixer if the mixer does not collude with any

data site.

The authors in [30] addressed PPDM technique and presented national

security applications where privacy is the main concern. They viewed the

privacy problem as a form of inference problem and introduced the notion of

privacy constraints. They described an approach for privacy constraint

processing. Finally, some directions for future research on privacy related to

data mining are presented.

The authors surveyed the current state of the art in Statistical Disclosure

Control methods for protecting individual data (micro data) [31]. A

classification of micro data protection methods such as perturbative masking

methods, nonperturbative masking methods and synthetic microdata generation

are presented. They discussed several information loss and disclosure risk

measures and then analyzed several ways of combining them to assess the

performance of the various methods. Additive noise, micro aggregation, rank

swapping, rounding, resampling and so on are perturbative method and

sampling, global recoding, top coding, bottom coding, local suppression are

non-perturbative methods which do not rely on distortion of the original data

but relies on partial suppressions or reductions for categorical as well as

continuous data are presented.

The authors in [32] emphasized the important aspects such as

identification of suitable evaluation criteria and the development of related

benchmarks required in the design of privacy preserving data mining

algorithms. In this article, they also discussed issues related to recent research in

the privacy preserving data mining to balance the trade-off between the right to

privacy and the need of knowledge discovery. From their analysis, they pointed

out that no privacy preserving algorithm exists that outperforms all the others on

all possible criteria and therefore they provided a comprehensive view on a set

of metrics related to existing privacy preserving algorithms. These metrics can

be used to evaluate the privacy preserving techniques.

Page 10: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

33

A state of art in privacy preserving data mining techniques was provided

in [33]. The authors addressed the methods for preserving private data mining

such as randomization technique, K-anonymization and distributed privacy-

preserving data mining. The computational and theoretical limits associated

with privacy-preservation over high dimensional data sets were also presented.

The output of data mining applications needs to be sanitized for privacy-

preservation is discussed.

The authors intend to reiterate several privacy preserving data mining

technologies clearly and then they analyzed the merits and shortcomings of

these technologies [34]. They stated the concepts involved in discovering

knowledge from large databases of various privacy preserving data mining

techniques such as k-anonymity, the perturbation approach, cryptographic

techniques, randomized response techniques, the condensation approach. They

also illustrate the working nature of each method with suitable database.

Aris Gkoulalas Divanis et al. [35] discussed the two broad categories of

privacy preserving data mining which prohibits leakage of private and sensitive

information when data or information is to be shared to many people. The

authors also discussed the privacy issues with micro-data and provided existing

methodologies such as data modification approaches and synthetic data

generation approaches. The existing methodologies for finding Privacy

preserving data mining in case of distributed database environment as well as in

collaborative environment is discussed with secure multi party computation

scheme particularly for horizontal and vertical data distribution is also

discussed. The second category of approach to find privacy preserving

association rule mining while preserving sensitive mined results is termed as

association rule hiding and is studied along three principal directions: heuristic

approaches, border-based approaches, and exact approaches. The merits and

demerits of each approach is also analyzed which makes interest to the

researchers to study further.

Page 11: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

34

The authors presented the goals of privacy preserving data mining and

discussed main classification Privacy-Preserving data mining such as Privacy-

Preserving Association Rule Mining, Privacy-Preserving Classification Mining,

Privacy-Preserving Clustering Mining [36].

Y. Lindell et al. studied the basic paradigms and notions of secure

multiparty computation and discuss their relevance to the field of privacy-

preserving data mining [37]. They described in this article about some simple

protocols that are often used as basic building blocks, or primitives, of secure

computation protocols such as oblivious transfer and oblivious polynomial

evaluation, which are two-party protocols, and homomorphic encryption, which

is an encryption system with special properties. The authors discussed the issue

of relationship between secure multiparty computation and privacy-preserving

data mining, and showed which problems it solves and which problems it does

not. They also addressed the issue of generic protocols that implement secure

computation for any probabilistic polynomial time function and also described

that the protocols are different for a scenario in which there are two parties, and

for the multiparty scenario where there are m > 2 parties. They are interested to

highlight common errors that may occur in secure protocols in order to inform

to the researchers when they design secure protocols.

The wide availability of personal data has made the problem of privacy

preserving data mining an important issue especially in finding association

rules. The authors addressed the issue of preserving the data, before the data is

published and categorized the existing methodologies into two such as k-

anonymity and probability based methodologies. Analysis of several existing

privacy preserving data mining techniques has made clearly and analyzed the

merits and demerits of each one [38].

Privacy preserving association rule mining is widely used in many real

applications. The following section discusses the earlier work related to privacy

preserving association rule mining in different database environments like

centralized and distributed.

Page 12: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

35

2.4 Privacy Preserving Association Rule Mining

Data and knowledge hiding are two research directions that investigate

how the privacy of raw data, or information, can be maintained either before or

after the course of mining association rules. By focusing on the knowledge

hiding thread, the authors presented taxonomy and surveyed recent approaches

that have been applied to the association rule hiding problem [39]. They also

provided a thorough comparison of the surveyed approaches which are used for

other data mining tasks by focusing on its metrics. The metrics which are used

to evaluate the performance of the approaches is also presented in this article.

Privacy preserving data mining is a novel research area to preserve

privacy for sensitive knowledge from disclosure. The authors presented a

detailed overview and classification of approaches which can be applied to

knowledge hiding in the context of association rule mining [40]. Evaluation

metrics which can be used to evaluate the performance of various hiding

algorithms are presented in this article.

The discussion of various proposed methodologies, algorithms for

finding privacy preserving association rule mining is summarized, analyzed the

advantages as well as disadvantages of each methodology is presented [41]. The

authors classified the methodologies into three such as heuristic-based

techniques, reconstruction-based techniques, and cryptography-based

techniques and each methodology is discussed. In [42], an overview of

knowledge hiding methodology related to classification, clustering and

sequence discovery are studied.

After the event occurred in September 11, 2001, more attention is

received from United States and elsewhere to the use of multiple government

and private databases for the identification of possible perpetrators of future

attacks, as well as an unprecedented expansion of federal

government data mining activities, many involving databases containing

personal information. The authors in [43], claimed that prospective data mining

could be used to find the “signature” of terrorist cells embedded in larger

networks. They focused in this article on the matching problem across databases

and the concept of “selective revelation” and their confidentiality implications.

Page 13: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

36

The authors in [44] presented a literature on different privacy preserving

data mining approaches existing along with details showing how to develop

specific solutions within each. They studied the privacy preserving data mining

techniques belongs to these models such as predictive and descriptive models.

Different data partitioning methods such as homogeneous and heterogeneous

partitioning methods are addressed. Various protocols for various situations in

privacy preserving in different data mining techniques are broadly classified

into two categories, one is data perturbation and other one is secure multi party

computation techniques each may further classified into many like

cryptographic techniques, K-anonymity technique, blocking, randomization and

so on. For each protocol, the authors analyzed the performance to show its

effectiveness in terms of security, computations and communications

complexities.

A Survey on privacy protection in distributed environment with four

cryptographic based techniques is discussed in [45]. The others described a

privacy preserving technique for learning Bayesian networks for vertically

partitioned databases between two sites. Three privacy-preserving data mining

techniques in a fully distributed setting are also presented.

An overview of distributed data mining applications and algorithms for

peer to peer environments are addressed in [46]. The authors addressed the

issues related to problems of existing privacy-preserving multi-party data

mining techniques. This article offered a more realistic formulation of the

PPDM problem as a multi-party game and focuses on some recent results.

The Earlier work related to association rule mining in centralized

database as well as distributed database environment are given in the following

sections.

2.5 Privacy Preserving Association Rule Mining in Centralized

Database Environment

The various approaches to find privacy preserving association rule

mining are heuristic, border based and exact approaches. The earlier research

work related to heuristic is presented below:

Page 14: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

37

An extensive research work in the area of statistics has been discussed

[47,48], to provide statistical information without compromising sensitive

information about individuals to find privacy preserving association rule

mining. Evfimievski et al. presented a new framework for preserving privacy

using randomization technique [49]. Randomization technique is analyzed and

also presented the privacy breaches that may occur in privacy preservation.

They proposed a class of randomization operators and proved its efficiency by

comparing with randomization in limiting the breaches. Also experimental

results of algorithm with real datasets are given in this article. In [50], the

authors generalized the privacy preserving association rule mining by allowing

different attributes to have different levels of privacy. Different randomization

factors are used for different attributes in the randomization process and they

developed an efficient algorithm called Recursive Estimation to estimate the

support of an item set for this framework. They also proved that non uniform

randomization factors improve the accuracy compared to uniform

randomization approaches.

In [51], authors proposed a privacy preserving association rule mining

algorithm called DDIL based on data disturbance and inquiry limitation. The

proposed method, disturbes and hides the original data with high degree of

privacy-preserving specially, a high effective method of generating frequent

items from transformed data sets is proposed. From the experimental study they

proved that proposed methods are effective in balancing privacy and accuracy.

The term “association rule hiding” has been mentioned for the first time

in 1999 by Atallah et al. The concept of data sanitization is nothing but reducing

the support of sensitive item set to hide it from disclosure and was first

proposed [52], to solve association rule hiding problem. They also proved that

the optimal sanitization is an NP-hard problem.

A key problem and still not sufficiently investigated is the need to

balance the confidentiality of the disclosed data with the legitimate needs of the

data users. Dasseni et al., developed three strategies [53], to hide sensitive

association rules in the mining process based on the two approaches These

strategies work either on support, confidence of the rule by decreasing either

Page 15: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

38

one of these until the sensitive rule is hidden. Proposed algorithms are based on

single rule heuristic hiding approaches, following any one of either support or

confidence of antecedent or consequent of the sensitive rule.

A new informative rule set is defined in [54], to generate prediction

sequences equal to those generated by the association rule set by the confidence

priority. The authors presented an algorithm to directly generate the informative

rule set, without generating all frequent item sets first. Less number of database

accesses are required than unconstrained direct methods. From the experimental

results, they proved that the informative rule set is smaller than both the

association rule set and the non-redundant association rule set.

Oliveria et al. introduced multiple rules hiding approach for hiding

multiple association rules and it requires two scans only [55]. An index file

created in first scan to speed up the process of finding sensitive transactions and

to retrieve them quickly. In the second scan, the algorithms sanitize the database

by selectively removing the least amount of individual items that accommodate

the hiding of the sensitive knowledge. An interesting novelty feature in these

approaches is considering an account of the impact of sanitization on hiding the

sensitive patterns, but also the impact related to the hiding of non sensitive

knowledge.

Privacy is the main threat to many people in discovering knowledge

from large databases using data mining techniques. In [56], authors presented a

scheme based on probabilistic distortion of user data that can simultaneously

provide a high degree of privacy to the user and retain a high level of accuracy

in the mining results. In the same article, authors analyzed their algorithm with

real and synthetic datasets and proved that the proposed algorithm preserves

privacy while providing accurate results to the users without generating

spurious rules.

Two sanitizing algorithms such as round robin and the random algorithm

is proposed for balancing privacy and knowledge discovery in privacy

preserving association rule mining [57]. These algorithms require only two

scans regardless of the database size and the number of restrictive association

Page 16: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

39

rules that must be protected where the first scan is required to build the index

(inverted file) for speeding up the sanitization process and second scan is used

to sanitize the original database. They compared and analyzed their algorithms

in terms of effectiveness and scalability with previously proposed algorithms.

The analyses in this article proved that proposed sanitizing algorithms are

significantly improved over the previous algorithms for hiding sensitive

association rules. The authors also proved that their sanitization methods are

robust in the sense that there is no de-sanitization possible.

In [58], the authors, proposed two algorithms, ISL (Increase Support of

LHS) and DSR (Decrease Support of RHS), to automatically hiding informative

association rule sets without pre-mining and selection of hidden rules. Analysis

is performed to illustrate effectiveness of the proposed algorithms. They also

recommended appropriate usage of the proposed algorithms based on the

characteristics of databases.

The authors in [59], proposed two distortion based heuristics algorithms

to selectively hide the sensitive association rules accepting limited side effects.

The first algorithm, called Priority-based Distortion Algorithm (PDA), reduces

the confidence of a sensitive association rule by reversing 1’s to 0’s in items

belonging in the rule’s consequent. The second algorithm, called Weight-based

Sorting Distortion Algorithm (WDA), concentrates on the optimization of the

hiding process in an attempt to achieve the least side-effects and the minimum

complexity. They proved that both PDA and WDA produces hiding solutions of

better quality in terms of computation complexity and privacy.

Menon, et al. studied about finding frequent item sets for privacy

preserving association rule mining while maximizing the accuracy of shared

database. The authors presented an optimal approach for hiding sensitive item

sets while keeping the number of modified transactions to a minimum. They

also proved that proposed approach works well with databases with millions of

transactions and presented the experimental results with real data as well as

synthetic data[60].

Page 17: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

40

Chih-Chia Weng, et al. proposed an efficient algorithm called Fast

Hiding Sensitive Association Rules denoted as FHSAR for hiding sensitive

association rule [61]. The algorithm can completely hide any given sensitive

association rules by scanning the database only once, which significantly

reduces the execution time. Experimental results showed that FHSAR

outperforms previous works in terms of execution time and side effects. The

number of new rules generated in hiding process is minimized and is

independent of the size of database. In addition to FHSAR algorithm, authors

also proposed two heuristic approaches for improving the performance of

algorithms. First, a heuristic function is used to obtain a prior weight for each

transaction, by which the order of transactions modified can be efficiently

decided. Second, the correlations between the sensitive association rules and

each transaction in the original database are analyzed.

In order to find privacy preserving association rule mining in centralized

database, a new algorithm is presented in [62] and after the mining phase filter

is used to weed out or hide the restricted discovered association rules. Before

applying the algorithm, the data structure of the database and sensitive

association rule mining set is analyzed to build the more effective model. This

new algorithm can be used to balance privacy preserving and knowledge

discovery in association rule mining.

A new association rule hiding algorithm is proposed [63] and its

algorithm are stated in detail aiming to hide simple rules, including single rule

and composed rule. Weak association rules and strong association rules are

distinguished in this work to do sanitization process easily and effectively to

avoid side effects. Four item modification methods are designed for updating

the selected weak association transactions. Only a small number of transactions

are required in updating to keep the original features in the mined dataset. The

modification factor is used to achieve hiding rate 100% to reduce the number of

lost rules and newly generated rules. The authors proved that this proposed

approach is robust in finding privacy preserving association rule mining.

Page 18: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

41

In the proposed approach [64], an efficient data structure (FCET) is used

to store maximal frequent item sets to support scalability. The authors also

proposed a new framework called greedy approximation approach by

combining efficient techniques of hiding sensitive rules with four lemmas and

the transaction retrieval engine based on the FCET index tree. The items in the

transactions are modified based on four lemmas and its strategies.

The authors proposed a greedy based approach which is a variant of

greedy approximation algorithm called greedy exhausted algorithm which also

hides sensitive rules by their confidence or support below a user specified

threshold [65]. From their experimental results, they proved that both methods

works well for hiding sensitive rules completely but the later algorithm

considers cost for side effects and based on the cost of side effects, suitable

modifications will be made to the database to reduce the side effects further.

A new approach was proposed in [66], called ISSRH (Increase Support

Sensitive Rule Hiding) to hide sensitive association rules that contain sensitive

items. The approach has six steps and each one performs specific task and

clustering technique is also one of the steps which is used to group the similar

items related to specified sensitive rules. This approach considers different

characteristics while hiding sensitive rules to increase the efficiency of the

algorithm. The authors presented an algorithm for the proposed approach and

also showed the results with examples. The authors analyzed the algorithm and

proved the effectiveness of the algorithm in terms of privacy, computational

cost, number of database scans and minimal modifications.

The authors investigated the issue of exact knowledge hiding and

proposed three schemes that are suitable for identifying exact solutions of high

quality [67]. They also introduced a structural decomposition to partition the

original constrains satisfaction problem (CSP) into numerous independent

components and parallelization framework, which can be applied to all the three

schemes and which dramatically improves the runtime of the hiding algorithm.

In the same article they introduced a novel framework for decomposition and

parallel solving of hiding problems, which are handled by the exact hiding

approaches. This novel framework is efficient in solving large size database and

Page 19: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

42

significantly decreases their runtime. The authors conducted experiment and

proved that effectiveness of the approaches towards providing high quality

knowledge hiding solutions.

An investigation on efficient reconstruction based techniques for

association rule hiding is addressed by Yuhong Guo and proposed a frequent

pattern tree based method for inverse frequent item set mining which is used in

reconstruction based framework for finding privacy preserving association rules

mining [68]. The proposed model has three phases, first phase generates

frequent item sets from the original database, second phase performs

sanitization algorithm over frequent item sets by selecting hiding strategy and

identifying sensitive frequent items sets according to sensitive association rules.

The third phase generates sanitized database by using inverse frequent item set

mining algorithm and then releases this database. Hiding effects, data utility and

time complexity are considered as performance measures for the proposed

protocol and are analyzed in this article.

A real example for individual identifiability problem in privacy

preserving data mining is given in [69]. Suppose medical data was disclosed

without name and address but linking with publicly available voter registration

records using birth date, gender, and postal code may reveal the name and

address corresponding to the medical records. This raises a key point that

absence of identity of an individual in data is not sufficient since with joining

the data with other sources may reveal identity of the individual. The authors

proposed an approach using the concepts from [70] and introduced quasi

identifier to solve this problem. k-anonymity is used which states that any

record must not be unique in its quasi identifiers there must be at least k records

with the same quasi-identifier.

Y. Saygin, et al. developed two algorithms to generate sanitized

database from the original database by modifying the value of items with

unknown value for selected transactions to hide the sensitive association rules

[71]. The first algorithm focuses on hiding the rules by reducing the minimum

support of the item sets which generated the sensitive rules and the second one

focuses on reducing the minimum confidence of the rules in two different ways.

Page 20: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

43

In the first method, confidence of a rule is reduced by replacing ls with ?s, while

the second method replaces 0s with ?s. Analysis of each algorithm is made and

proved that these algorithms are effective in hiding sensitive frequent item sets

in order to hide sensitive association rules.

The authors studied various data altering techniques for hiding

association rules, classification and clustering rules [72]. Usually entire data

mining process needs to be executed to find the hidden rules. The authors

proposed two algorithms, ISL (increase support of LHS) and DSR (decrease

support of RHS), to replace data by unknowns in database so that sensitive

predicative rules containing specified items on the left hand side of rule cannot

be inferred through association rule mining. They analyzed the performance of

the algorithm in terms of privacy, number of database scans & number of

pruning hidden rules. Compared with approach in [73], this approach hides all

the rules containing hidden items on the left hand side.

Pontikakis et al. argued that the main disadvantage of blocking is, an

adversary can disclose the hidden association rules simply by identifying the

generating item sets that contain question marks and can lead to rules with a

maximum confidence that lies above the minimum confidence threshold. The

authors proposed a blocking algorithm to avoid disclosing sensitive patterns

which generates rules that were not exist in the original dataset In order to

balance the trade-off between the level of privacy and data utility, the proposed

algorithm incorporates a safety margin [74].

Data reconstruction methods put the original data aside and start from

sanitizing the so-called “knowledge base”. The new released data is then

reconstructed from the sanitized knowledge base. This idea is first depicted in

[75] but the proposed approach is still very incomplete and limited in aspects

such as does not giving concrete guidance on how to sanitize the item set lattice

according to the sensitive association rules and the feasibility of the data

reconstruction process is restricted to knowledge sanitization process which can

produce an item set lattice with consistent support value configuration

relationship. This method cannot guarantee to find a consistent one within a

polynomial time. .

Page 21: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

44

A study of finding an appropriate balance between a need for privacy

and information discovery on frequent patterns is discussed in [76]. The authors

proposed an innovative technique for hiding sensitive patterns. In their

approach, a sanitization matrix is defined. By multiplying the original

transaction database and the sanitization matrix, a new database, called sanitized

database is obtained which preserves sensitive item sets. En Tzu Wang, et al.

also studied the same problem in [74] and proposed a method based on

sanitization concept. A probability policy is additionally adopted in this method

against the recovery of sensitive patterns to avoid forward inference attack

absolutely where the confidence level is given by the users approximates to 1.

They also discussed the efficiency of the proposed method.

Some of relevant works which utilizes the border based approaches are

as follows:

The first frequent item set hiding methodology based on the concept of

the border revision of the non sensitive frequent item sets to track the impact of

altering transactions in the original database is proposed by Sun & Yu in [77].

A study on hiding sensitive frequent item sets in the process of

computing privacy preserving association rule mining [78], by modifying the

transactions in the database considering the quality of the sanitized database

especially on preserving the non-sensitive frequent item sets. To preserve the

non sensitive frequent item sets, the authors proposed border-based approach to

efficiently evaluate the impact of any modification to the database in the process

of hiding sensitive frequent item sets and also qualitative database can be well

maintained by greedily selecting the modifications with minimal side effect.

The authors also analyzed the performance of their proposed approach in terms

of privacy and cost. They also proved that the proposed approach finds solution

by satisfying privacy preserving goals effectively using border revision

approach.

The authors in [79], presented a new algorithm for sanitizing raw data

from sensitive knowledge in the context of mining of association rules. This

approach relies on the maxmin criterion which is a method in decision theory

Page 22: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

45

for maximizing the minimum gain, and then builds upon the border theory of

frequent item sets. They proved that the proposed method is efficient by

conducting experiments.

The earlier works in privacy preserving association rule mining based on

exact approach are presented as follows:

A novel methodology based on exact approach is proposed to find

optimal solution for association rule hiding without producing any side effects

using border revision concept and integer programming technique [80]. They

formulated the hiding process as constraints satisfaction problem and the

solution for association rule hiding is nothing but determining a sanitized

database by satisfying constraints. Minimizing the distance between original

database and sanitized database is the main concept in this approach while

finding optimal solution. They also demonstrated the effectiveness of the

algorithm with suitable database.

The authors in [81], proposed a novel methodology based on exact

approach which provides an optimal solution for hiding of sensitive frequent

item sets. This approach minimally extends the original database by a

synthetically generated database called extended database and formulates the

construction of the extended database as a constraint satisfaction problem which

is then solved by using Binary Integer Programming (BIP). They also proved

that privacy preserving association rule mining using hybrid approach

provides an approximate solution close to the optimal one when an ideal

solution does not exist without producing any side effects.

A. Gkoulalas-Divanis, et al. proposed a novel approach based on exact

approach to find optimal solution for the association rule hiding problem to

avoid side effects [82]. The approach adopts border revision concept and integer

programming technique to find optimal solution for created database called

extended database. The approach works by determining minimally extended

database based on the positive border item sets and negative border item sets

and their threshold values. Then the hiding problem is formulated as a

constraint satisfaction problem and finally applies BIP to find the solution for

Page 23: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

46

the extended database. The authors proved that this approach finds optimal

solutions to an extended database having higher quality than already developed

exact based approaches.

The authors in [83] proposed an algorithm and it is an extension of

functionality of inline algorithm for hiding sensitive association rules called two

phase iterative algorithm by computing original and revised borders in a

transactional database based on the given minimum support threshold value.

The approach has two phases that iterates until either an exact solution of the

given problem instance is identified, or a prespecified number of subsequent

iterations have taken place. The two–phase iterative algorithm is constantly

superior to the inline algorithm since its worst performance equals the

performance of the inline algorithm. The experimental results indicate that there

are several settings in which the two–phase iterative algorithm finds an optimal

hiding solution. The two–phase iterative algorithm can capture all the exact

solutions which can be identified by the inline approach.

The authors investigated the issue of exact knowledge hiding and

proposed three schemes that are suitable for identifying exact solutions of high

quality [84]. They also introduced a structural decomposition to partition the

original CSP into numerous independent components and parallelization

framework, which can be applied to all the three schemes and which

dramatically improves the runtime of the hiding algorithm. A novel framework

for decomposition and parallel solving of hiding problems, which are handled

by the exact hiding approaches is also presented. This framework is efficient in

solving large size database and significantly decreases runtime. The authors

conducted experiments and proved the effectiveness of the approaches towards

providing high quality knowledge hiding solutions.

A. Gkoulalas Divanis, et al. addressed many issues related to privacy

preserving data mining, association rule hiding, classes of association rule

hiding methodologies and also rule hiding in classification technique, privacy

preserving clustering & sequence hiding [85]. They also presented the goals of

association rule hiding. Many approaches are proposed for border based as well

as for exact based approaches and also presented algorithms for these

Page 24: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

47

approaches such as BBA, Max-Min algorithm using border revision concept,

Menon’s, Inline, Two phase iterative and Hybrid algorithms based on exact

approach. The authors conducted several experiments to prove the effectiveness

of each algorithm with examples and also discussed the difficulties in some

situations.

2.6 Privacy preserving Association Rule Mining in Distributed

Database Environment

The database may be partitioned into horizontal, vertical and mixed mode

in distributed database environment. Some of the relevant works to find

privacy preserving association rule mining when data is partitioned horizontally

is presented as follows:

The problem of knowing who is richer without disclosing their wealth is

addressed in two milliner’s problem and which belongs to secure multi party

computation. The authors proposed protocols for two milliner’s problem and

also proposed for multi party case [86].

Yao first postulated the two-party computation problem and developed a

provably secure solution [87]. A new tool is proposed for controlling the

knowledge transfer process in cryptographic protocol design by the author. The

authors showed that how two parties A and B can interactively generate a

random integer N = pċq such that its secret, that is the prime factors (p, q), is

hidden from either party individually but is recoverable jointly if desired. Using

this concept, they proposed a two party protocol with private values i and j to

compute any polynomial computable functions f(i,j) and g(i,j) with minimal

knowledge transfer. A framework for secure multiparty computation is

developed in [88] and proved that computing a function privately is equivalent

to computing it securely. This protocol is extended to multiparty computations

by Goldreich et al. [89].

It is important to investigate efficient methods for distributed mining of

association rules when the databases size is large and it requires to achieve high

scalability of distributed systems with the easy partitioning. The study in [90],

discloses some interesting relationships between locally large and globally large

Page 25: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

48

item sets and proposed an interesting distributed association rule mining

algorithm, FDM (Fast Distributed Mining of association rules), which generates

a small number of candidate sets and substantially reduces the number of

messages to be passed at mining association rules. The authors analyzed the

performance of the proposed algorithm and showed that FDM has a superior

performance over the direct application of a sequential algorithm.

Naor M, et al. addressed a protocol called Oblivious Transfer protocol

and is used to provide communication between sender and receiver in secure

manner [91]. In this protocol one party, the sender transmits part of its inputs to

another party, the receiver, in such a way that protects both of them. The sender

is assured that the receiver does not receive more information than it is entitled,

while the receiver is assured that the sender does not learn which part of the

inputs it received. This protocol is used as a key component in many

applications of cryptography. The authors analyzed the protocol to measure the

performance of the protocol and presented the merits and demerits.

In [92], authors showed that two of the private scalar product protocols

proposed previously are insecure. They described a private scalar product

protocol based on homomorphic encryption and the efficiency of the protocol is

demonstrated with massive datasets.

For many data mining applications, data is typically represented as

attribute-vectors and the scalar (dot) product can be considered as one of the

fundamental operations. The authors in [93], presented a very efficient and very

practical secure scalar product protocol for horizontally partitioned database.

They compared it with most common scalar product protocols and also proved

the efficiency of the proposed protocol by taking real data set.

Clifton proposed a toolkit of components that can be combined for

specific privacy preserving data mining applications [94]. They showed that

how components of toolkit can be used to solve different privacy preserving

data mining problems. They also presented four efficient protocols such as

Secure sum, Secure set union, Secure size of set intersection and Scalar product

are the protocols for privacy preserving computations which can be used to

support data mining. They also demonstrated some of the protocols for finding

privacy preserving data mining problems in distributed environment.

Page 26: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

49

In [95], authors proposed a unique approach for mining knowledge from

grid scale system while ensuring that the data is cryptographically safe using

third party model. The proposed algorithm called Private-Majority-Rule a k-

private distributed association rule mining algorithm involves no global

communication patterns and dynamically adjusts to changes in the data or to the

failure and recovery of resources. The architecture adopted Majority-Rule – a

highly scalable distributed association rule mining algorithm [96]. By

simulations with thousands of resources, authors proved that the algorithm

quickly converges to the correct result while using reasonable communication.

In many real life applications data is split between multiple

organizations and these organizations wish to utilize all of the data to create

more accurate predictive models while revealing neither their training data nor

the instances to be classified. To address the issue, Naive Bayes Classifier is

used. In their study [97], authors presented a privacy preserving Naive Bayes

Classifier for horizontally partitioned data to address this issue.

Most of the privacy preserving distributed data mining algorithms are

based on perturbation and secure multi party computation by accepting

reduction in accuracy and some overheads. Alex et al., offer a new approach to

perform privacy preserving distributed data mining without using secure

computation or perturbation. They adopted two new entities termed miner and

calculator who do not possess databases and developed three algorithms based

on this new approach for handling three cases horizontally partitioned,

vertically partitioned and any data mining method in distributed databases

environment. The authors compared each algorithm based on computation,

communication overheads with secure computation or perturbation [98].

The article [99] makes primary contributions on two different grounds.

First, it explores independent component analysis as a possible tool for

breaching privacy in deterministic multiplicative perturbation-based models

such as random orthogonal transformation and random rotation. Then, it

proposes an approximate random projection-based technique to improve the

level of privacy protection while still preserving certain statistical

Page 27: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

50

characteristics of the data. An extensive theoretical analysis and experimental

results are also presented in this paper. The authors proved that the proposed

technique is effective and can be successfully used for different types of

privacy-preserving data mining applications.

In many data mining applications, data exist from different parties with

different schemas. In [100], the authors addressed the problem of privacy-

preserving frequent pattern mining in many to many schemas across two

dimension sites where the sites are not trusted and they are semi-honest. A

method is proposed to address this issue in data mining techniques based on the

concept of semi-join which do not involve data encryption which is used

commonly. Experimental results are presented to study the efficiency of the

proposed methods.

A new protocol [101] has been proposed for horizontally partitioned

databases in the process of finding privacy preserving association rules. The

same method also provides an additional benefit of finding privately discover

association rules. The authors also proved that the protocol is more efficient

than previous methods. The protocol supports to achieve privacy goals such as

every party can access only their data, no party is able to learn the links between

other parties and their data, and no party learns any transactions of the other

parties' databases.

The authors in [102], proposed a new algorithm for semi-honest model

with negligible collision probability is a modified algorithm of privacy

preserving association rule mining on distributed homogenous database

algorithm. This new algorithm adopts public key cryptosystem which

overcomes the overheads involved in employing the algorithm with

commutative encryption system. The authors proved that this new algorithm is

faster than old one by considering privacy and accuracy of results. One of the

important feature of this algorithm is scalability which means the same

algorithm can be extended to any number of sites. From the experimental

results, the authors proved that new algorithm has a high performance in

computations, communications, time and accuracy than the previous algorithms

due to the total bit-communication cost for this algorithm is function in N, N

indicates the number of sites.

Page 28: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

51

The author studied the problem of finding privacy preserving association

rule mining when data are partitioned and placed in different locations and no

site owner willing to provide their data or information to any one in any site in

the thesis work. The algorithms for privacy preserving association rules mining

over horizontally, vertically and mixed partitioned database are presented in this

thesis work [103]. Several experiments are conducted in each partitioning

method to analyze the performance and also to find out the limitations that may

exist in any method.

Kantarcioglu proposed methods to mine horizontally partitioned data

without violating privacy and discussed how to use the data mining results by

preserving privacy. The proposed methods incorporated cryptographic

techniques to minimize the information shared, while adding as little as possible

overhead to the mining and processing task [104].

Secure mining of association rules over horizontally partitioned database

using cryptographic technique to minimize the information shared by adding

overhead to the mining process is presented in [105]. Using cryptographic tools,

the authors proposed two protocols to mine distributed association rules on

horizontally partitioned data securely in semi honest model. Communication

and computation costs of mining with the two protocols are discussed.

A new solution for privacy preserving association rule mining by

integrating the advantages of two approaches such as protecting the private data

by using an extended role based access control approach and the second

approach which finds solution by adopting cryptographic techniques when

sensitive information is to be preserved is proposed by the authors in [106].

They classified the data into two as sensitive objects and non sensitive objects,

sensitive objects are encrypted and stored, the permitted user allowed to access

the sensitive objects only after decryption, ensuring privacy. By using these

techniques the authors proved that the new algorithm minimizes the information

loss and privacy loss. The cryptographic technique helps to store sensitive data

and providing access to the stored data based on an individual’s role which

ensures that the data is safe from privacy breaches.

Page 29: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

52

Support vector machine classification is one of the most widely used

classification methodologies in data mining and machine learning and it is

based on solid theoretical foundations and has wide practical application.

Privacy-preserving algorithm was developed in [107], for support vector

machine nonlinear classification in horizontally partitioned databases. Secure

set intersection cardinality to securely compute the gram matrix was adopted in

this algorithm.

The earlier research work when the database is partitioned vertically or

in mixed mode in order to find privacy preserving association rule mining is

given below:

In [108], authors addressed the problem of association rule mining in

vertically partitioned database by using cryptography based approach. Each site

holds some attributes of each transaction and the sites wish to collaborate to

identify global valid association rules. In this article, the authors defined the

definition of global frequent item sets in the case of vertically partitioned

database where every site possess different set of attributes for the common set

of tuples. Based on secure computation of scalar product protocol, the authors

developed algorithm for two party case for finding global frequent item sets

and their supports efficiently without violating any one’s privacy constraints.

Analysis of security and communication over the proposed algorithm is

presented and it proves that the algorithm is efficient.

The authors in [109], addressed several private scalar product protocols

and also analyzed their insecurity. A two party scalar product protocol is

proposed with a un trusted third party using algebraic computations. They also

analyzed and proved that the proposed protocol is secure and practical by taking

low cost of communications and computations.

Privacy preserving association rule mining problem is addressed and a

new approach is proposed when one data miner and multiple data providers

exist. The proposed approach finds privacy preserving association rule mining

accurately but discloses less private information based on algebraic technique

with randomization technique [110].

Page 30: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

53

The problem of finding association rules for vertically

partitioned databases while preserving the confidentiality of each database in all

sites is addressed in [111]. The authors proposed two algorithms for discovering

frequent item sets and for calculating the confidence of the rules. They also

analyzed the algorithms considering privacy properties, and compared them to

existing algorithms.

In their study [112], authors developed a privacy preserving version of

the popular clustering algorithm DBSCAN based on density-based and notion of

clustering allows discovering clusters of arbitrary shape. DBSCAN uses R-trees

to support efficient associative queries and it requires only two input

parameters, but it offers some support in determining appropriate values. They

also proved that privacy preserving DBSCAN requires privacy preserving R-

trees which is achieved in the developed algorithm.

In [113], simple technique of transforming the categorical and numeric

sensitive data using a mapping table and graded grouping technique respectively

is discussed by treating distributed data as centralized. The authors also

discussed the proposed technique with data mining tasks such as classification,

clustering and association rule mining and the results are analyzed.

Secure scalar product protocol is an important fundamental protocol in

secure multi-party computation. Based on additive homomorphism public key

cryptosystem, the authors developed a new secure scalar product protocol under

semi-honest model with low communication complexity [114]. Furthermore,

they applied it to position relationship decision for privacy preserving space

vectors.

In medical domains, medical practitioners use many data mining

techniques to make right decisions. As medical data holds patient’s personal

information revealing all information is a adversary situation and this makes to

prevent private information such as patient’s personal identification information

from disclosing. In [115], authors addressed preservation of privacy in

classification techniques for two cases which are centralized and distributed

database environment. They proposed architecture for privacy preservation in

classification technique for mixed partitioned distributed database model and

this model has a combination of vertical and horizontal for Breast cancer

dataset.

Page 31: CHAPTER II - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/6873/7/07_chapter 2.pdf · [21], the authors defined privacy preserving data mining (PPDM) as data mining methods

54

Database in distributed applications is partitioned commonly in two

types that is horizontally partitioned and vertically partitioned databases but in

many distributed applications mixed partitioned database partitioning methods

are also used. The authors in [116] addressed the issue of finding privacy

preserving association rule mining in mixed partitioned database and developed

an algorithm which is a modified algorithm in [90] based on cryptography

technique. Algorithm is evaluated and showed the efficiency in finding global

results without violating privacy constraints based on metrics.

In [117], authors discussed broad areas of privacy preserving data

mining, the underlying algorithms and methodologies such as randomization

and k anonymity model. They also discussed the existing methodologies for

privacy preserving association rule mining techniques in the distributed

environment of different partitioning methods such as horizontal and vertical

models and showed the limitations of each method. The authors proposed a

novel approach to preserve privacy in association rule mining using secure hash

function on semi-anonymize sensitive attributes to eliminate the possibility of

original data re-construction.

Danfeng Yao et al., presented a private distributed scalar product

protocol in [118], which can be used for obtaining trust values from private

recommendations. They proposed a credential-based trust model where the

trustworthiness of a user is computed based on his or her affiliations and role

assignments. This trust model is proved simple to compute and scalable for

many users.

Privacy preserving data mining is getting more attention from

researchers to find effective solution without producing any side effects. There

are many real applications where privacy preserving data mining techniques are

used in surveillance which is naturally supposed to be “privacy-violating”

applications. A number of techniques have been discussed for facial de-

identification, bio-surveillance, and identity theft [119, 120 & 121] which uses

privacy preserving data mining algorithms.