6
A Novel Approach: CART Algorithm for Vertically Partitioned Database in Multi-Party Environment Jayanti Dansana Debadutta Dey Raghvendra Kumar School of Computer Engineering School of Computer Engineering School of Computer Engineering KIIT University KIIT University KIIT University Bhubaneswar INDIA Bhubaneswar INDIA Bhubaneswar INDIA [email protected] [email protected] [email protected] Abstract—Advancement of technology and use of distributed database has made an increasing concern about privacy of private data. Collaboration and team work brings huge and demanding result. But organizations are unwilling to participate for data mining due to data leakage. For availing huge demanding result, it is required to use and collect data from different participating parties in a secured way, that is, no data leakage and to use correct decision tree algorithm that can correctly model the data and show accurate result. This paper represents how CART algorithm can be used for multi parties in vertically partitioned environment. For security purpose, secure sum and secure size of set intersection protocols are used. Keywords- Privacy Preserving; CART; decision tree I. INTRODUCTION Let us consider a scenario where more than two parties are willing to cooperate and perform data mining algorithm upon the union of their databases as they are well aware of the result it brings due to the collaboration. But parties’ do not collaborate due to least preservation of privacy of private data. We have showed a way how CART algorithm can be implemented for multiparty in vertically partitioned environment in a secured way. II. RELATED WORKS Data mining [1] often called knowledge discovery of data (KDD) is a useful process of extracting useful data and hidden predictive information from large databases. Data mining takes relational database, data warehouse, transactional database, advanced database system, flat files, data streams and world wide web into data repositories as it stimulus and demands an integration of techniques from multiple disciples such as database and data warehouse technology, statistics, machine learning, pattern recognition, neural network etc. The knowledge discovered by running data mining algorithm can be used for decision tree making, process control, information management and query processing. The data mining step might discover multiple groups in the data, which can be used to hold most accurate prediction which is the outcome generated by a decision support system. Classification rule mining [2] is one of the most popular algorithms used for data mining. Classification is a data mining algorithm that assigns data or transactions in a group or classes. The goal of classification is to accurately forecast the class attribute for each transaction of the data. Decision tree classification [2] is one of the most popular classification technique used for classifying the class attributes. It is called supervised learning because first decision tree is build on training dataset then test attributes are introduced for classification of test data. In decision tree internal nodes are called test and leaf nodes are called class labels and the arrows shows the path between tests and leads to class label. Different decision tree uses different attribute selection measure for best splitting node. Most popular decision trees ID3 [2], C4.5 [3], CART [4] that uses different attribute selection measure information gain, gain ratio and gini index respectively. Advancement of technology has naturally evolved distributed database. Distributed database [5] is a database in which database is partitioned and stored in different system which logically belong to the same system. Distributed database can be achieved by two methods. First is replication where the system maintains several indistinguishable replicas of a database and it synchronizes all the data at regular intervals and when multiple users must access the same data, it ensures that updates and deletes Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013) 978-1-4673-5758-6/13/$31.00 © 2013 IEEE 829

[IEEE 2013 IEEE Conference on Information & Communication Technologies (ICT) - Thuckalay, Tamil Nadu, India (2013.04.11-2013.04.12)] 2013 IEEE CONFERENCE ON INFORMATION AND COMMUNICATION

Embed Size (px)

Citation preview

Page 1: [IEEE 2013 IEEE Conference on Information & Communication Technologies (ICT) - Thuckalay, Tamil Nadu, India (2013.04.11-2013.04.12)] 2013 IEEE CONFERENCE ON INFORMATION AND COMMUNICATION

A Novel Approach: CART Algorithm for

Vertically Partitioned Database in Multi-Party

Environment Jayanti Dansana Debadutta Dey Raghvendra Kumar School of Computer Engineering School of Computer Engineering School of Computer Engineering

KIIT University KIIT University KIIT University

Bhubaneswar INDIA Bhubaneswar INDIA Bhubaneswar INDIA

[email protected] [email protected] [email protected]

Abstract—Advancement of technology and use of

distributed database has made an increasing concern

about privacy of private data. Collaboration and team

work brings huge and demanding result. But

organizations are unwilling to participate for data

mining due to data leakage. For availing huge

demanding result, it is required to use and collect data

from different participating parties in a secured way,

that is, no data leakage and to use correct decision tree

algorithm that can correctly model the data and show

accurate result. This paper represents how CART

algorithm can be used for multi parties in vertically

partitioned environment. For security purpose, secure

sum and secure size of set intersection protocols are

used.

Keywords- Privacy Preserving; CART; decision tree

I. INTRODUCTION

Let us consider a scenario where more than two

parties are willing to cooperate and perform data

mining algorithm upon the union of their databases as

they are well aware of the result it brings due to the

collaboration. But parties’ do not collaborate due to

least preservation of privacy of private data. We have

showed a way how CART algorithm can be

implemented for multiparty in vertically partitioned

environment in a secured way.

II. RELATED WORKS

Data mining [1] often called knowledge discovery of

data (KDD) is a useful process of extracting useful

data and hidden predictive information from large

databases. Data mining takes relational database, data

warehouse, transactional database, advanced database

system, flat files, data streams and world wide web

into data repositories as it stimulus and demands an

integration of techniques from multiple disciples such

as database and data warehouse technology, statistics,

machine learning, pattern recognition, neural network

etc. The knowledge discovered by running data

mining algorithm can be used for decision tree

making, process control, information management

and query processing. The data mining step might

discover multiple groups in the data, which can be

used to hold most accurate prediction which is the

outcome generated by a decision support system.

Classification rule mining [2] is one of the most

popular algorithms used for data mining.

Classification is a data mining algorithm that assigns

data or transactions in a group or classes. The goal of

classification is to accurately forecast the class

attribute for each transaction of the data.

Decision tree classification [2] is one of the most

popular classification technique used for classifying

the class attributes. It is called supervised learning

because first decision tree is build on training dataset

then test attributes are introduced for classification of

test data. In decision tree internal nodes are called

test and leaf nodes are called class labels and the

arrows shows the path between tests and leads to

class label. Different decision tree uses different

attribute selection measure for best splitting node.

Most popular decision trees ID3 [2], C4.5 [3], CART

[4] that uses different attribute selection measure

information gain, gain ratio and gini index

respectively.

Advancement of technology has naturally evolved

distributed database. Distributed database [5] is a

database in which database is partitioned and stored

in different system which logically belong to the

same system. Distributed database can be achieved

by two methods. First is replication where the system

maintains several indistinguishable replicas of a

database and it synchronizes all the data at regular

intervals and when multiple users must access the

same data, it ensures that updates and deletes

Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)

978-1-4673-5758-6/13/$31.00 © 2013 IEEE 829

Page 2: [IEEE 2013 IEEE Conference on Information & Communication Technologies (ICT) - Thuckalay, Tamil Nadu, India (2013.04.11-2013.04.12)] 2013 IEEE CONFERENCE ON INFORMATION AND COMMUNICATION

performed on the data at one location will be

automatically reflected in the data stored in different

systems. Second method for achieving distributed

database is fragmentation where database is

fragmented into several partitions and update and

delete operation can be done on a single partition and

updating is not required for other partitions.

Fragmentation can be achieved by horizontal

partition where database is partitioned column wise,

vertical partition where database is partitioned row

wise and by mixed partition where first database is

partitioned horizontally then vertically or vice versa.

Since classification is a centralized algorithm, all the

data or transactions have to be present in a central

place for mining the data. Classification algorithm

has been developed for Distributed data mining [5]

holding the efficiency as point of view and not

security.

Privacy preservation has been very important since

the distributed database came to exposure. Solution

to privacy preserving data mining problem was

initiated in [6] using the oblivious transfer protocol

[7]. This solution however only deals with

classification of horizontally partitioned data using

ID3 algorithm. Another approach of solving the

privacy preserving data mining algorithm was

proposed in [8]. Here each transaction is disordered

and the distributions of all the data are reconstructed

at an aggregate level. In this approach probability

based data mining algorithm are used instead of

algorithm using transactional records. The research

on privacy preservation has widened. In [9] scalar dot

protocol was proposed and demonstrated how

privacy is preserved using ID3 decision tree. Secure

sum protocol in [10] has also developed that can run

efficiently in multi party environment. Secure set

intersection cardinality in [11] was proposed for

better security. In [12] a solution to the privacy

preserving distributed association rule mining

problem has been described. Later a solution to

privacy preserving ID3 decision tree in vertically

partitioned environment was proposed in [13]. An

extensive research on privacy preserving C4.5

algorithm in distributed database was done in [14]

and [15] studied C4.5 classification and introduced

PPC4.5 algorithm over vertically distributed database

for two parties. Later C4.5 algorithm has been

extended to multiparty in vertically partitioned

environment in [16].

III. PRIVACY PRESERVING CART DECISION

TREE CLASSIFICATION

Privacy preserving data mining depends on privacy

preserving techniques, data mining algorithms and

mostly depend on input sources which is broadly

divided into three categories

i) Centralized Database: Databases are centralized

when all the data are stored in only one place (central

place).

ii) Horizontal Databases: Databases are horizontal

when all the parties involved in data mining have

same attributes but have different data records

because the centralised data are horizontally

partitioned, that is, data are divided according to the

rows.

iii) Vertical Databases: Databases are vertical when

all the parties involved in data mining have different

attribute and have a part of data records because the

centralised data are vertically partitioned, that is, data

are divided according to the columns.

In this paper privacy preserving data mining depends

on privacy preserving techniques such as secure sum

protocol, secure size of set intersection protocol,

CART decision tree algorithm and vertically

partitioned database.

CART is an improved decision tree than ID3 and

C4.5. It creates binary tree and closely model to more

balanced binary tree and provide most accurate

results. It uses gini index for its splitting attributes.

CART algorithm has been explained in [4]. CART

algorithm has been modified for vertically distributed

environment.

A. Privacy Preserving CART Algorithm in Vertically

Partitioned Environment

Let us consider a view where N number of sites

wants to collaborate and compute data mining task

using CART algorithm in a secured way.

In the algorithm A_empty() checks if A is empty or

not. If the result is 0 then no attributes are left and at

the end of this step, node is constructed with its ID. A

constraint (Con) has be created which helps the sites

to keep track of only those attributes and their values

which satisfies Con and build its path to class

attribute and rest attributes values are changed to not

required (#) values. Initially all the attribute contains

not required (#) value. Constraint just acts as a filter.

Constr.S (attribute, value) is a function which

Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)

978-1-4673-5758-6/13/$31.00 © 2013 IEEE 830

Page 3: [IEEE 2013 IEEE Conference on Information & Communication Technologies (ICT) - Thuckalay, Tamil Nadu, India (2013.04.11-2013.04.12)] 2013 IEEE CONFERENCE ON INFORMATION AND COMMUNICATION

converts the attribute with the appropriate values.

Attributes that satisfies Con retain their values and

rest attributes have not required values. An attribute

can satisfy Con if and only if it satisfies the following

condition

∀ J(AttrJ(x)=v ⇔ Con(AttrJ)=v)"Con(AttrJ)=’#’ (1)

TransSet (con) returns local transaction satisfying

con. DistCount () calculates the class distribution.

Two privacy preserving protocols, secure sum

protocol is used for to check if any attributes are left

and cardinality of set intersection protocol is used to

determine the majority class of a node.

CheckSameClass() checks if all the transactions

belong to same class values. D_Gini() calculates gini

index of all the attributes (A). Built_Tree() builds the

tree based on NodeId but reveals only the leaf node

by classifying the transaction.

Algorithm 1: PPCART ()

/*

1. There are N sites participating in the process

having A attributes.

2. Each site has a flag F. It is 0 if there is no

remaining attribute and 1 if there are attributes

remaining.

3. Each site contains a local constraint set (Con)

which keeps the values o those attributes that lead

to class attribute and changes the value of other

attributes to not required (*)

4. Transaction set T partitioned between sites

S1....SN.

5. Site SN holding n class attribute having C1, C2,

......, Cn values.

*/

Begin

Step 1: if A is empty then

A_empty()

{

S1 chooses a random integer R uniformly

from 0 .... m - 1.

S1 sends R + F1 to S2

for J = 2....k - 1 do

site SJ receives R’ from SJ-1.

SJ sends R’ + FJ mod M to SJ+1

end for

site SN receives R’ from SN-1.

R’ R’ + FN modM (2)

S1 and SN create secure keyed commutative

hash keys E1 and EN

S1 sends E1(R) to SN

SN receives E1(R) and sends EN(E1(R)) and

EN(R’) to S1

S1 returns E1(EN(R’)) = EN(E1(R))

{ R’= R

=

=

N

I

JF1

0 0 attributes

remain}

}

continue at site SN upto the return

(cont1.... contn) DistCount()

DistCount()

{

for all sites SJ except SN

At SJ : XJ TransSet(ContJ)

TransSet(Cont)

{

X = Ø

for all transaction id J T

if tJ satisfies constr

X X {J}

end if

end for

return X

}

end for

for each class C1,....., Cn

at SN: constr.SN(CR, CJ) /* To include

class restriction*/

at SN: XN TransSet(ConN)

ConstJ |X1 X2 ...... Xk| using the

cardinalty of set intersection protocol.

end for

return (Cont1,Cont2,......., Contn)

}

build a leaf node with distribution (Cont1 ,....,

Contn)

{class maxJ=1...S consJ}

return ID of the constructed node

Step 2: else if ClassNode (at SN) CheckSameClass()

CheckSameClass()

{

(Cont1.... Contn) DistCount()

if " I s.t. consI !=0 " ∀ J != I, consJ = 0

/* If only one of the count is non zero*/

build a leaf node with distribution

(Cont1.... Contn)

return ID of the constructed node

else

return false

end if

}

return leaf NodeId ClassNode

Step 3: else

Best_Site DiffGini()

DiffGini()

Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)

978-1-4673-5758-6/13/$31.00 © 2013 IEEE 831

Page 4: [IEEE 2013 IEEE Conference on Information & Communication Technologies (ICT) - Thuckalay, Tamil Nadu, India (2013.04.11-2013.04.12)] 2013 IEEE CONFERENCE ON INFORMATION AND COMMUNICATION

{

for all sites SN

BestGiniJ -1

for each attribute AttrJI at site SJ

Gini D_Gini(AttrJI)

D_Gini(AttrJI)

{

P DistCount() /* total number of

transaction at this node */

GiniIndex(P)

=

m

J

JP1

21

(3)

for each attribute value aJ

Constr.S(Attr, aJ) /* update local

constraints Tuple*/

PaJ DistCount()

Gini GiniIndex(P) –

GiniIndex(PaJ)*|PaJ| / |P| (4)

where |P| is

=

m

J

JCont1

end for

Constr.S(Attr, ‘#’) /*

update local Tuple*/

return Gini

}

if Gini > BestGiniJ

BestGiniJ Gini

BestAttributeJ AttrJI

end if

end for

return maxI,BestGiniI

}

continue execution at Best_Site

create interior node No with attribute

No.Attr Best_AttBest_site

for each attribute value VJ No.attr

Constr.S(No.Attr, VJ) /* updates local

constraints set*/

NodeId PPCART()

No.AttrJ NodeId /* add appropriate branch

to interior node*/

end for

Constr.S(Attr, ‘#’) /*Returning to parents;

should no longer filter transactions

with Attr.*/

store No locally keyed with Node_ID

return Node_ID of interior node No

/*Exectution continues at site

owning parent node*/

Step 4: Build_Tree(TId, NodeId)

{

if NodeId is a leaf node /* Starting site and

root node are known*/

return class or distribution saved in

NodeId

else

No local node with Id NodeId

Val the value of attributes No.Attr for

transaction TId

ChildId No.Val

return ChildId.Site.Build_tree (TId, ChildId)

end if }

end if

B. Privacy Preservation of Private Data among Sites

The privacy of private data of all the sites

participating are fully secured. Function A_empty()

only reveals no data in any site but it checks if any

attribute left in the site. Since cardinality of set

intersection protocol has been used in DistCount()

function so it just disclose the combination of

constraint set for each class. CheckSameClass() only

disclose only the class distribution by discovering the

fact that if all transaction are of same class or not and

disclose only the class distribution. D_Gini()

computes the gini of all attributes and expose only

the counts of different attributes and DiffGini() only

brings out the gini of different sites. Function

Build_Tree only expose the leaf node by classifying

the transactions.

IV. EXPERIMENTATION AND EXAMPLE

Let us take a sample dataset having five attributes

(Outlook, Temperature, Humidity, Windy and play

Tennis) and partition it vertically into 3 sites such

that site 1 will have two attributes, that is, transaction

(TID) and Outlook attribute, Site 2 will have TID,

temperature and humidity attributes and site 3 will

have TID, windy and Play Tennis (class) attributes.

TABLE I. VERTICAL DATASET OF SITE 1

TID Outlook

T1 Sunny T2 Sunny T3 Overcast T4 Rainy T5 Rainy T6 Rainy T7 Overcast T8 Sunny T9 Sunny T10 Rainy T11 Sunny T12 Overcast T13 Overcast T14 Rainy

Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)

978-1-4673-5758-6/13/$31.00 © 2013 IEEE 832

Page 5: [IEEE 2013 IEEE Conference on Information & Communication Technologies (ICT) - Thuckalay, Tamil Nadu, India (2013.04.11-2013.04.12)] 2013 IEEE CONFERENCE ON INFORMATION AND COMMUNICATION

TABLE II. VERTICAL DATASET OF SITE 2

TID Temperature Humidity

T1 >75 >75

T2 >75 >75

T3 >75 >75

T4 <= 75 >75

T5 <= 75 >75

T6 <= 75 <= 75

T7 <= 75 <= 75

T8 <= 75 >75

T9 <= 75 <= 75

T10 <= 75 >75

T11 <= 75 <= 75

T12 <= 75 >75

T13 >75 <= 75

T14 <= 75 >75

TABLE III. VERTICAL DATASET OF SITE 3

TID Windy Play Tennis

T1 No No

T2 Yes No

T3 No Yes

T4 No Yes

T5 No Yes

T6 Yes No

T7 Yes Yes

T8 No No

T9 No Yes

T10 No Yes

T11 Yes Yes

T12 Yes Yes

T13 No Yes

T14 Yes No

After applying the PPCART algorithm we will have

the following decision tree.

V1: Rainy V2: Sunny

V1: No V2: Yes V1: >75 V2: <=75

Fig.1. CART decision tree

After the calculation of gini it is clear that root node is

in site 1, that is, outlook (S1 Label 1). If outlook is

rainy then attribute from site 3 is considered, that is,

Windy (S3Label2). If day is not windy then it is

concluded yes (S3 Label3) as class attributes are

present in site3 and if day is windy that is value is yes

then class value is no (S3 Label4). Similarly if the

outlook is sunny then from site 2 the attribute

humidity (S2 Label5) is selected and if humidity is

>75 then class value is no (S3 Label6) and if

humidity is <=75 then class value is yes (S3 Label7).

IV. CONCLUSION AND FUTURE WORK

CART algorithm was a centralized algorithm. But the

usage of distributed database has forced to modify

the algorithm that can run in distributed database. We

have implemented CART algorithm in vertically

partitioned dataset. Privacy preserving techniques has

been used to minimize the data leakage. This

algorithm can securely compute the gini index of all

the attributes of all the sites participating in

performing the data mining process. This algorithm

can be implemented in various data mining

application fields like prediction of future uses of

books in a library, in academics, medical etc.

REFERENCES

[1] J. Han and M. Kamber, Data Mining: Concepts

and Techniques, 2nd ed., Morgan Kaufmann,

New York, Elsevier, 2009.

[2] J. R. Quinlan, “Induction of decision trees,”

Machine Learning, vol. 1, pp. 81–106, 1986.

[3] J.R. Quinlan, C4.5: Programs for Machine

Learning, Morgan Kaufmann, San Mateo, 1993.

[4] L. Breiman, J. Friedman, R. Olshen and C. Stone,

Classification and Regression Trees, Wadsworth

International Group, 1984.

[5] S. Ceri and G. Pelagatti, Distributed Databases:

Principles and Systems, International Edition,

Singapore, 1984.

[6] Y. Lindell and B. Pinkas, “Privacy preserving

data mining”, Journal of Cryptology, Vol.15,

pp. 177-206, 2002

[7] M. Naor and B. Pinkas, "Oblivious transfer and

polynomial evaluation", in proc. 31st ACM

Symposium on theory of computing, pp 245-

254, May 1999.

[8] R. Agrawal, and R. Srikant, “Privacy-preserving

data mining,” in proc. 2000 ACM SIGMOD on

Management of Data, pp. 439–450, 2000.

[9] W. Du and Z. Zhan, “Building Decision Tree

Classifier on Private Data,” in proc: IEEE

S3 Label2:

WINDY

S3 Label3:

YES

S2 Label5: HUMIDITY

S1Label1:

OUTLOOK

S3 Label4:

NO

S3 Label6:

NO

S3 Label7:

YES

Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)

978-1-4673-5758-6/13/$31.00 © 2013 IEEE 833

Page 6: [IEEE 2013 IEEE Conference on Information & Communication Technologies (ICT) - Thuckalay, Tamil Nadu, India (2013.04.11-2013.04.12)] 2013 IEEE CONFERENCE ON INFORMATION AND COMMUNICATION

International Conference on Privacy, Security

and Data Mining, pp. 1-8, 2002.

[10] M. Kantarcioglu and C. Clifton, “Privacy-

preserving distributed mining of association

rules on horizontally partitioned data”, IEEE

Transactions on Knowledge and Data

Engineering, NJ, USA, vol. 16, pp. 1026-1037,

2002.

[11] J. Vaidya and C. Clifton, “Secure set intersection

cardinality with application to association rule

mining”, Journal of computer security, IOS

Press, Amsterdam, vol. 13, pp. 593-622,

2005.

[12] J. Vaidya, and C. Clifton, “Privacy preserving

association rule mining in vertically partitioned

data,” in proc. ACM SIGKDD International

Conference on Knowledge Discovery and Data

Mining, pp 639-644, 2002.

[13] J. Vaidya, C. Clifton, M. Kanarcioglu, A.

Patterson, "Privacy-Preserving Decision Trees

over Vertically Partitioned Data," ACM

Transactions on Knowledge Discovery from

Data (TKDD), New York, USA, Vol. 2, pp. 1-

27, Oct. 2008.

[14] Y. Shen, H. Shao, L. Yang, “Privacy Preserving

C4.5 Algorithm over Vertically Distributed

Datasets,” in proc. 2009 International

Conference on Networks Security, Wireless

Communications and Trusted Computing, pp

446-448, , May 2009.

[15] Y. Shen, H. Shao, J. Jianzhong, “Research on

Privacy Preserving Distributed C4.5

Algorithm”, in proc. 2009 Third International

Symposium on Intelligent Information

Technology Application Workshops, pp 216-

218, Nov. 2009.

[16] A. Gangrade, R. Patel, “Building Privacy-

Preserving C4.5 Decision Tree Classifier on

Multiparties”, International Journal on

Computer Science and Engineering, Vol.1, pp.

199-205, 2009.

Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)

978-1-4673-5758-6/13/$31.00 © 2013 IEEE 834