View
224
Download
2
Tags:
Embed Size (px)
Citation preview
Privacy Preserving K-means Clustering on Vertically
Partitioned Data
Presented by: Jaideep Vaidya
Joint work: Prof. Chris Clifton
Overview
• Global Problem– Privacy Preserving Distributed Data Mining
• Specific Problem– Clustering (K-Means)
• For– Vertically Partitioned Data
• Using– Cryptographic Tools
Medical Records
RPJ Yes Diabetic
CAC No Tumor No
PTR No Tumor Diabetic
Cell Phone Data
RPJ 5210 Li/Ion
CAC none none
PTR 3650 NiCd
Global Database ViewTID Brain Tumor? Diabetes? Model Battery
Vertical Partitioning of Data
Privacy Preserving Data Mining
• Perturbation– Agrawal & Srikant, Agrawal & Aggarwal, – Rizvi & Haritsa, Evfimievski et al.
• Cryptographic– Lindell & Pinkas, Du & Zhan– Vaidya & Clifton, Kantarcioglu & Clifton
Secure Multiparty Computation (SMC)
• Given a function f and n inputs, distributed at n sites, compute
the result
while revealing nothing to any site except its own input(s) and the result.
xxx n,...,,
21
nxxxfy ,,, 21
Results
• Cluster assignment for entities– Not private
• Cluster centers– Semi-private
2.3 34 19 15.5 5210 Li/Ion Piezo
Secure K-means clustering
Arbitrarily select k starting points
Repeat– Assign to respectively– (re)assign each object to closest cluster
based on distance from mean– Re-compute the cluster means
Until no change
''2
'1 ,,, k
k ,,, 21 ''2
'1 ,,, k
''2
'1 ,,, k
K-means clustering
Assigning objects to closest cluster
k
i
r
D
PPP
O,
O,ity object/entevery For
j
2
1
21
rj
ijki
x 11
minarg Compute
Key Idea
• Disguise site components with random values
• Compare distances while revealing only comparison result
• Permute order of clusters to conceal meaning of comparison results
Closest Cluster Computation
• 3 special sites, P1, P2 and Pr
• P1 generates
– r random vectors such that– Permutation π (over 1 .. K)
iV 01
r
iiV
Permutation ProtocolDu and Atallah ’01
A B,
V
X
EXE ),(
))((
VXE
Homomorphic encryption: Ek(x)*Ek(y) = Ek(x+y)
)(
VX
Closest Cluster Computation
P1
P2
,
V i
2X222 ),( EXE
))(( 222
VXE
Pr
rX
rrr EXE ),(
))((
rrr VXE
Stage 1
P1
Pr-1
P3
Pr
)( 33
VX
)( 11
VX
)( 11
rr VX
Stage 2
2i
ii VX
Closest Cluster Computation
• Stage 3– P2 and Pr determine i, the index of the cluster
with minimum distance
• Stage 4– P1 computes and broadcasts i1
When to stop?
• Locally compute difference in means
• Globally known threshold
• Use simple random-adding technique to disguise actual values– First party adds random value to its distance and
sends to next party– Each party adds its value to total and sends on– Last party compares with first party’s random
+threshold
Communication Cost
• r parties, n data elements, m bit distances
Bits Rounds
Basic Algorithm
O(knr) O(r+k)
Optimized Algorithm
O(kmr) O(r)
Generic Method
O(kmnr3) 1
Non-Secure Method
O(n) 1