Upload
raghvendra
View
214
Download
2
Embed Size (px)
Citation preview
A Novel Approach: CART Algorithm for
Vertically Partitioned Database in Multi-Party
Environment Jayanti Dansana Debadutta Dey Raghvendra Kumar School of Computer Engineering School of Computer Engineering School of Computer Engineering
KIIT University KIIT University KIIT University
Bhubaneswar INDIA Bhubaneswar INDIA Bhubaneswar INDIA
[email protected] [email protected] [email protected]
Abstract—Advancement of technology and use of
distributed database has made an increasing concern
about privacy of private data. Collaboration and team
work brings huge and demanding result. But
organizations are unwilling to participate for data
mining due to data leakage. For availing huge
demanding result, it is required to use and collect data
from different participating parties in a secured way,
that is, no data leakage and to use correct decision tree
algorithm that can correctly model the data and show
accurate result. This paper represents how CART
algorithm can be used for multi parties in vertically
partitioned environment. For security purpose, secure
sum and secure size of set intersection protocols are
used.
Keywords- Privacy Preserving; CART; decision tree
I. INTRODUCTION
Let us consider a scenario where more than two
parties are willing to cooperate and perform data
mining algorithm upon the union of their databases as
they are well aware of the result it brings due to the
collaboration. But parties’ do not collaborate due to
least preservation of privacy of private data. We have
showed a way how CART algorithm can be
implemented for multiparty in vertically partitioned
environment in a secured way.
II. RELATED WORKS
Data mining [1] often called knowledge discovery of
data (KDD) is a useful process of extracting useful
data and hidden predictive information from large
databases. Data mining takes relational database, data
warehouse, transactional database, advanced database
system, flat files, data streams and world wide web
into data repositories as it stimulus and demands an
integration of techniques from multiple disciples such
as database and data warehouse technology, statistics,
machine learning, pattern recognition, neural network
etc. The knowledge discovered by running data
mining algorithm can be used for decision tree
making, process control, information management
and query processing. The data mining step might
discover multiple groups in the data, which can be
used to hold most accurate prediction which is the
outcome generated by a decision support system.
Classification rule mining [2] is one of the most
popular algorithms used for data mining.
Classification is a data mining algorithm that assigns
data or transactions in a group or classes. The goal of
classification is to accurately forecast the class
attribute for each transaction of the data.
Decision tree classification [2] is one of the most
popular classification technique used for classifying
the class attributes. It is called supervised learning
because first decision tree is build on training dataset
then test attributes are introduced for classification of
test data. In decision tree internal nodes are called
test and leaf nodes are called class labels and the
arrows shows the path between tests and leads to
class label. Different decision tree uses different
attribute selection measure for best splitting node.
Most popular decision trees ID3 [2], C4.5 [3], CART
[4] that uses different attribute selection measure
information gain, gain ratio and gini index
respectively.
Advancement of technology has naturally evolved
distributed database. Distributed database [5] is a
database in which database is partitioned and stored
in different system which logically belong to the
same system. Distributed database can be achieved
by two methods. First is replication where the system
maintains several indistinguishable replicas of a
database and it synchronizes all the data at regular
intervals and when multiple users must access the
same data, it ensures that updates and deletes
Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)
978-1-4673-5758-6/13/$31.00 © 2013 IEEE 829
performed on the data at one location will be
automatically reflected in the data stored in different
systems. Second method for achieving distributed
database is fragmentation where database is
fragmented into several partitions and update and
delete operation can be done on a single partition and
updating is not required for other partitions.
Fragmentation can be achieved by horizontal
partition where database is partitioned column wise,
vertical partition where database is partitioned row
wise and by mixed partition where first database is
partitioned horizontally then vertically or vice versa.
Since classification is a centralized algorithm, all the
data or transactions have to be present in a central
place for mining the data. Classification algorithm
has been developed for Distributed data mining [5]
holding the efficiency as point of view and not
security.
Privacy preservation has been very important since
the distributed database came to exposure. Solution
to privacy preserving data mining problem was
initiated in [6] using the oblivious transfer protocol
[7]. This solution however only deals with
classification of horizontally partitioned data using
ID3 algorithm. Another approach of solving the
privacy preserving data mining algorithm was
proposed in [8]. Here each transaction is disordered
and the distributions of all the data are reconstructed
at an aggregate level. In this approach probability
based data mining algorithm are used instead of
algorithm using transactional records. The research
on privacy preservation has widened. In [9] scalar dot
protocol was proposed and demonstrated how
privacy is preserved using ID3 decision tree. Secure
sum protocol in [10] has also developed that can run
efficiently in multi party environment. Secure set
intersection cardinality in [11] was proposed for
better security. In [12] a solution to the privacy
preserving distributed association rule mining
problem has been described. Later a solution to
privacy preserving ID3 decision tree in vertically
partitioned environment was proposed in [13]. An
extensive research on privacy preserving C4.5
algorithm in distributed database was done in [14]
and [15] studied C4.5 classification and introduced
PPC4.5 algorithm over vertically distributed database
for two parties. Later C4.5 algorithm has been
extended to multiparty in vertically partitioned
environment in [16].
III. PRIVACY PRESERVING CART DECISION
TREE CLASSIFICATION
Privacy preserving data mining depends on privacy
preserving techniques, data mining algorithms and
mostly depend on input sources which is broadly
divided into three categories
i) Centralized Database: Databases are centralized
when all the data are stored in only one place (central
place).
ii) Horizontal Databases: Databases are horizontal
when all the parties involved in data mining have
same attributes but have different data records
because the centralised data are horizontally
partitioned, that is, data are divided according to the
rows.
iii) Vertical Databases: Databases are vertical when
all the parties involved in data mining have different
attribute and have a part of data records because the
centralised data are vertically partitioned, that is, data
are divided according to the columns.
In this paper privacy preserving data mining depends
on privacy preserving techniques such as secure sum
protocol, secure size of set intersection protocol,
CART decision tree algorithm and vertically
partitioned database.
CART is an improved decision tree than ID3 and
C4.5. It creates binary tree and closely model to more
balanced binary tree and provide most accurate
results. It uses gini index for its splitting attributes.
CART algorithm has been explained in [4]. CART
algorithm has been modified for vertically distributed
environment.
A. Privacy Preserving CART Algorithm in Vertically
Partitioned Environment
Let us consider a view where N number of sites
wants to collaborate and compute data mining task
using CART algorithm in a secured way.
In the algorithm A_empty() checks if A is empty or
not. If the result is 0 then no attributes are left and at
the end of this step, node is constructed with its ID. A
constraint (Con) has be created which helps the sites
to keep track of only those attributes and their values
which satisfies Con and build its path to class
attribute and rest attributes values are changed to not
required (#) values. Initially all the attribute contains
not required (#) value. Constraint just acts as a filter.
Constr.S (attribute, value) is a function which
Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)
978-1-4673-5758-6/13/$31.00 © 2013 IEEE 830
converts the attribute with the appropriate values.
Attributes that satisfies Con retain their values and
rest attributes have not required values. An attribute
can satisfy Con if and only if it satisfies the following
condition
∀ J(AttrJ(x)=v ⇔ Con(AttrJ)=v)"Con(AttrJ)=’#’ (1)
TransSet (con) returns local transaction satisfying
con. DistCount () calculates the class distribution.
Two privacy preserving protocols, secure sum
protocol is used for to check if any attributes are left
and cardinality of set intersection protocol is used to
determine the majority class of a node.
CheckSameClass() checks if all the transactions
belong to same class values. D_Gini() calculates gini
index of all the attributes (A). Built_Tree() builds the
tree based on NodeId but reveals only the leaf node
by classifying the transaction.
Algorithm 1: PPCART ()
/*
1. There are N sites participating in the process
having A attributes.
2. Each site has a flag F. It is 0 if there is no
remaining attribute and 1 if there are attributes
remaining.
3. Each site contains a local constraint set (Con)
which keeps the values o those attributes that lead
to class attribute and changes the value of other
attributes to not required (*)
4. Transaction set T partitioned between sites
S1....SN.
5. Site SN holding n class attribute having C1, C2,
......, Cn values.
*/
Begin
Step 1: if A is empty then
A_empty()
{
S1 chooses a random integer R uniformly
from 0 .... m - 1.
S1 sends R + F1 to S2
for J = 2....k - 1 do
site SJ receives R’ from SJ-1.
SJ sends R’ + FJ mod M to SJ+1
end for
site SN receives R’ from SN-1.
R’ R’ + FN modM (2)
S1 and SN create secure keyed commutative
hash keys E1 and EN
S1 sends E1(R) to SN
SN receives E1(R) and sends EN(E1(R)) and
EN(R’) to S1
S1 returns E1(EN(R’)) = EN(E1(R))
{ R’= R
=
=
N
I
JF1
0 0 attributes
remain}
}
continue at site SN upto the return
(cont1.... contn) DistCount()
DistCount()
{
for all sites SJ except SN
At SJ : XJ TransSet(ContJ)
TransSet(Cont)
{
X = Ø
for all transaction id J T
if tJ satisfies constr
X X {J}
end if
end for
return X
}
end for
for each class C1,....., Cn
at SN: constr.SN(CR, CJ) /* To include
class restriction*/
at SN: XN TransSet(ConN)
ConstJ |X1 X2 ...... Xk| using the
cardinalty of set intersection protocol.
end for
return (Cont1,Cont2,......., Contn)
}
build a leaf node with distribution (Cont1 ,....,
Contn)
{class maxJ=1...S consJ}
return ID of the constructed node
Step 2: else if ClassNode (at SN) CheckSameClass()
CheckSameClass()
{
(Cont1.... Contn) DistCount()
if " I s.t. consI !=0 " ∀ J != I, consJ = 0
/* If only one of the count is non zero*/
build a leaf node with distribution
(Cont1.... Contn)
return ID of the constructed node
else
return false
end if
}
return leaf NodeId ClassNode
Step 3: else
Best_Site DiffGini()
DiffGini()
Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)
978-1-4673-5758-6/13/$31.00 © 2013 IEEE 831
{
for all sites SN
BestGiniJ -1
for each attribute AttrJI at site SJ
Gini D_Gini(AttrJI)
D_Gini(AttrJI)
{
P DistCount() /* total number of
transaction at this node */
GiniIndex(P)
=
−
m
J
JP1
21
(3)
for each attribute value aJ
Constr.S(Attr, aJ) /* update local
constraints Tuple*/
PaJ DistCount()
Gini GiniIndex(P) –
GiniIndex(PaJ)*|PaJ| / |P| (4)
where |P| is
=
m
J
JCont1
end for
Constr.S(Attr, ‘#’) /*
update local Tuple*/
return Gini
}
if Gini > BestGiniJ
BestGiniJ Gini
BestAttributeJ AttrJI
end if
end for
return maxI,BestGiniI
}
continue execution at Best_Site
create interior node No with attribute
No.Attr Best_AttBest_site
for each attribute value VJ No.attr
Constr.S(No.Attr, VJ) /* updates local
constraints set*/
NodeId PPCART()
No.AttrJ NodeId /* add appropriate branch
to interior node*/
end for
Constr.S(Attr, ‘#’) /*Returning to parents;
should no longer filter transactions
with Attr.*/
store No locally keyed with Node_ID
return Node_ID of interior node No
/*Exectution continues at site
owning parent node*/
Step 4: Build_Tree(TId, NodeId)
{
if NodeId is a leaf node /* Starting site and
root node are known*/
return class or distribution saved in
NodeId
else
No local node with Id NodeId
Val the value of attributes No.Attr for
transaction TId
ChildId No.Val
return ChildId.Site.Build_tree (TId, ChildId)
end if }
end if
B. Privacy Preservation of Private Data among Sites
The privacy of private data of all the sites
participating are fully secured. Function A_empty()
only reveals no data in any site but it checks if any
attribute left in the site. Since cardinality of set
intersection protocol has been used in DistCount()
function so it just disclose the combination of
constraint set for each class. CheckSameClass() only
disclose only the class distribution by discovering the
fact that if all transaction are of same class or not and
disclose only the class distribution. D_Gini()
computes the gini of all attributes and expose only
the counts of different attributes and DiffGini() only
brings out the gini of different sites. Function
Build_Tree only expose the leaf node by classifying
the transactions.
IV. EXPERIMENTATION AND EXAMPLE
Let us take a sample dataset having five attributes
(Outlook, Temperature, Humidity, Windy and play
Tennis) and partition it vertically into 3 sites such
that site 1 will have two attributes, that is, transaction
(TID) and Outlook attribute, Site 2 will have TID,
temperature and humidity attributes and site 3 will
have TID, windy and Play Tennis (class) attributes.
TABLE I. VERTICAL DATASET OF SITE 1
TID Outlook
T1 Sunny T2 Sunny T3 Overcast T4 Rainy T5 Rainy T6 Rainy T7 Overcast T8 Sunny T9 Sunny T10 Rainy T11 Sunny T12 Overcast T13 Overcast T14 Rainy
Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)
978-1-4673-5758-6/13/$31.00 © 2013 IEEE 832
TABLE II. VERTICAL DATASET OF SITE 2
TID Temperature Humidity
T1 >75 >75
T2 >75 >75
T3 >75 >75
T4 <= 75 >75
T5 <= 75 >75
T6 <= 75 <= 75
T7 <= 75 <= 75
T8 <= 75 >75
T9 <= 75 <= 75
T10 <= 75 >75
T11 <= 75 <= 75
T12 <= 75 >75
T13 >75 <= 75
T14 <= 75 >75
TABLE III. VERTICAL DATASET OF SITE 3
TID Windy Play Tennis
T1 No No
T2 Yes No
T3 No Yes
T4 No Yes
T5 No Yes
T6 Yes No
T7 Yes Yes
T8 No No
T9 No Yes
T10 No Yes
T11 Yes Yes
T12 Yes Yes
T13 No Yes
T14 Yes No
After applying the PPCART algorithm we will have
the following decision tree.
V1: Rainy V2: Sunny
V1: No V2: Yes V1: >75 V2: <=75
Fig.1. CART decision tree
After the calculation of gini it is clear that root node is
in site 1, that is, outlook (S1 Label 1). If outlook is
rainy then attribute from site 3 is considered, that is,
Windy (S3Label2). If day is not windy then it is
concluded yes (S3 Label3) as class attributes are
present in site3 and if day is windy that is value is yes
then class value is no (S3 Label4). Similarly if the
outlook is sunny then from site 2 the attribute
humidity (S2 Label5) is selected and if humidity is
>75 then class value is no (S3 Label6) and if
humidity is <=75 then class value is yes (S3 Label7).
IV. CONCLUSION AND FUTURE WORK
CART algorithm was a centralized algorithm. But the
usage of distributed database has forced to modify
the algorithm that can run in distributed database. We
have implemented CART algorithm in vertically
partitioned dataset. Privacy preserving techniques has
been used to minimize the data leakage. This
algorithm can securely compute the gini index of all
the attributes of all the sites participating in
performing the data mining process. This algorithm
can be implemented in various data mining
application fields like prediction of future uses of
books in a library, in academics, medical etc.
REFERENCES
[1] J. Han and M. Kamber, Data Mining: Concepts
and Techniques, 2nd ed., Morgan Kaufmann,
New York, Elsevier, 2009.
[2] J. R. Quinlan, “Induction of decision trees,”
Machine Learning, vol. 1, pp. 81–106, 1986.
[3] J.R. Quinlan, C4.5: Programs for Machine
Learning, Morgan Kaufmann, San Mateo, 1993.
[4] L. Breiman, J. Friedman, R. Olshen and C. Stone,
Classification and Regression Trees, Wadsworth
International Group, 1984.
[5] S. Ceri and G. Pelagatti, Distributed Databases:
Principles and Systems, International Edition,
Singapore, 1984.
[6] Y. Lindell and B. Pinkas, “Privacy preserving
data mining”, Journal of Cryptology, Vol.15,
pp. 177-206, 2002
[7] M. Naor and B. Pinkas, "Oblivious transfer and
polynomial evaluation", in proc. 31st ACM
Symposium on theory of computing, pp 245-
254, May 1999.
[8] R. Agrawal, and R. Srikant, “Privacy-preserving
data mining,” in proc. 2000 ACM SIGMOD on
Management of Data, pp. 439–450, 2000.
[9] W. Du and Z. Zhan, “Building Decision Tree
Classifier on Private Data,” in proc: IEEE
S3 Label2:
WINDY
S3 Label3:
YES
S2 Label5: HUMIDITY
S1Label1:
OUTLOOK
S3 Label4:
NO
S3 Label6:
NO
S3 Label7:
YES
Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)
978-1-4673-5758-6/13/$31.00 © 2013 IEEE 833
International Conference on Privacy, Security
and Data Mining, pp. 1-8, 2002.
[10] M. Kantarcioglu and C. Clifton, “Privacy-
preserving distributed mining of association
rules on horizontally partitioned data”, IEEE
Transactions on Knowledge and Data
Engineering, NJ, USA, vol. 16, pp. 1026-1037,
2002.
[11] J. Vaidya and C. Clifton, “Secure set intersection
cardinality with application to association rule
mining”, Journal of computer security, IOS
Press, Amsterdam, vol. 13, pp. 593-622,
2005.
[12] J. Vaidya, and C. Clifton, “Privacy preserving
association rule mining in vertically partitioned
data,” in proc. ACM SIGKDD International
Conference on Knowledge Discovery and Data
Mining, pp 639-644, 2002.
[13] J. Vaidya, C. Clifton, M. Kanarcioglu, A.
Patterson, "Privacy-Preserving Decision Trees
over Vertically Partitioned Data," ACM
Transactions on Knowledge Discovery from
Data (TKDD), New York, USA, Vol. 2, pp. 1-
27, Oct. 2008.
[14] Y. Shen, H. Shao, L. Yang, “Privacy Preserving
C4.5 Algorithm over Vertically Distributed
Datasets,” in proc. 2009 International
Conference on Networks Security, Wireless
Communications and Trusted Computing, pp
446-448, , May 2009.
[15] Y. Shen, H. Shao, J. Jianzhong, “Research on
Privacy Preserving Distributed C4.5
Algorithm”, in proc. 2009 Third International
Symposium on Intelligent Information
Technology Application Workshops, pp 216-
218, Nov. 2009.
[16] A. Gangrade, R. Patel, “Building Privacy-
Preserving C4.5 Decision Tree Classifier on
Multiparties”, International Journal on
Computer Science and Engineering, Vol.1, pp.
199-205, 2009.
Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013)
978-1-4673-5758-6/13/$31.00 © 2013 IEEE 834