Upload
clement-benson
View
218
Download
0
Embed Size (px)
Citation preview
Randomization based Privacy Preserving Data Mining
Xintao Wu
University of North Carolina at CharlotteAugust 30, 2012
2
Scope
3
Outline
Part I: Randomization for Numerical Data Additive noise Projection Modeling based
Part II: Randomization for Categorical Data Randomized Response Application to Market Basket Data Analysis
4
Additive Noise Randomization Example
Bal income … IntP
1 10k 85k … 2k
2 15k 70k … 18k
3 50k 120k … 35k
4 45k 23k … 134k
. . . … .
N 80k 110k … 15k
10 85 2
15 70 18
50
45
80
120
23
110
35
134
15
= +
Y = X + E
7.334 3.759 0.099
4.199 7.537 7.939
9.199
6.208
9.048
8.447
7.313
5.692
3.678
1.939
6.318
17.334 88.759 2.099
19.199 77.537 25.939
59.199
51.208
89.048
128.447
30.313
115.692
38.678
135.939
21.318
5
Additive Randomization (Z=X+Y)
50 | 40K | ... 30 | 70K | ... ...
...
Randomizer Randomizer
ReconstructDistribution
of Age
ReconstructDistributionof Salary
ClassificationAlgorithm
Model
65 | 20K | ... 25 | 60K | ... ...30
becomes 65
(30+35)
Alice’s age
Add random number to
Age
• R.Agrawal and R.Srikant SIGMOD 00
6
Reconstruction Problem
• Original values x1, x2, ..., xn
from probability distribution X (unknown)
• To hide these values, we use y1, y2, ..., yn
from probability distribution Y
• Given x1+y1, x2+y2, ..., xn+yn
the probability distribution of Y
Estimate the probability distribution of X.
7
• Converges to maximum likelihood estimate Agrawal and Aggarwal PODS 01
• Extension to muti-variate case Domingo-Ferrer et al. PSD04
Distribution Reconstruction Alg.
• Bootstrapping Algorithm
met) criterion (stopping until
1
)())((
)())((1)(
repeat
0
ondistributi uniform
1
1
0
jj
afayxf
afayxf
naf
j
f
n
ij
XiiY
jXiiYj
X
X
0
200
400
600
800
1000
1200
20 60
Age
Num
ber
of P
eopl
e
OriginalRandomizedReconstructed
9
Individual Value Reconstruction (Additive Noise)
• Methods Spectral Filtering, Kargupta et al. ICDM03 PCA, Huang, Du, and Chen SIGMOD05 SVD, Guo, Wu and Li, PKDD06
• All aim to remove noise by projecting on lower dimensional space.
10
Individual Reconstruction Algorithm
Apply EVD : Using some published information about V, extract the first k
components of as the principal components. λ1≥ λ2··· ≥ λk ≥ λe and e1, e2, · · · ,ek are the corresponding
eigenvectors. Qk = [e1 e2 · · · ek] forms an orthonormal basis of a subspace X.
Find the orthogonal projection on to X : Get estimate data set: PUU pˆ
TUp QQ
Up
TkkQQP
Up = U + VNoisePerturbed Original
12
Why it works
• Original data are correlated • Noise are not correlated
noise
2nd principal vector
1st principal vectororiginal signal perturbed
+ =
2-d estimation1-d estimation
14
Additive Noise vs. Projection
• Additive perturbation is not safe Spectral Filtering Technique
H.Kargupta et al. ICDM03 PCA Based Technique
Huang et al. SIGMOD05 SVD based & Bound Analysis
Guo et al. SAC06,PKDD06
• How about the projection based perturbation? Projection models Vulnerabilities Potential attacks
XX EEYY
NoisePerturbed Original
= +
RR XXYY
Perturbed Transformation Original
=
15
Rotation Randomization Example
0.3333 0.6667 0.6667
-0.6667 0.6667 -0.3333
-0.6667 -0.3333 0.6667
10 15 50 45 80
85 70 120 23 110
2 18 35 134 15
61.33 63.67 110.00 119.67 63.33
49.33 30.67 55.00 -59.33 -31.67
-33.67 -21.33 -30.00 51.67 -51.67
=
Y = R X
Bal income … IntP
1 10k 85k … 2k
2 15k 70k … 18k
3 50k 120k … 35k
4 45k 23k … 134k
. . . … .
N 80k 110k … 15k
RRT = RTR = I
16
Rotation Approach (R is orthonormal)
• When R is an orthonormal matrix (RTR = RRT = I) Vector length: |Rx| = |x| Euclidean distance: |Rxi - Rxj| = |xi - xj|
Inner product : <Rxi ,Rxj> = <xi , xj>
• Many clustering and classification methods are invariant to this rotation perturbation. Classification, Chen and Liu, ICDM 05 Distributed data mining, Liu and Kargupta, TKDE 06
17
Example
866.0500.0
500.0866.0R
RXY
0.2902
0.2902
1.30
86
1.3086
RRT = RTR = I
18
Weakness of Rotation
0.2902
1.30
86
0.2902
1.3086?
866.0500.0
500.0866.0RRegression
•Known sample attackKnown
Info Original data
19
General Linear Transformation
• Y = R X + E When R = I: Y = X + E (Additive Noise Model) When RRT = RTR = I and E = 0: Y = RX (Rotation Model) R can be an arbitrary matrix
4.751 2.429 2.282
1.156 4.457 0.093
3.034 3.811 4.107
10 15 50 45 80
85 70 120 23 110
2 18 35 134 15
265.95 286.63 475.68 581.71 520.53
394.30 338.49 569.58 174.22 277.79
362.55 394.11 665.37 776.46 463.08
=
Y = R X
+
7.334 4.199 9.199 6.208 9.048
3.759 7.537 8.447 7.313 5.692
0.099 7.939 3.678 1.939 6.318
+ E
20
Is Y = R X + E Safe?
• R can be an arbitrary matrix, hence regression based attack wont work
• How about noisy ICA direct attack?
Y = R X + E General Linear Transformation Model
X = A S + N Noisy ICA Model
21
Scope (Part II)
ssn name zip race … age Sex Bal income … IntP
1 28223 Asian … 20 M 10k 85k … 2k
2 28223 Asian … 30 F 15k 70k … 18k
3 28262 Black … 20 M 50k 120k … 35k
4 28261 White … 26 M 45k 23k … 134k
. . . … . . . . … .
N 28223 Asian … 20 M 80k 110k … 15k
69% unique on zip and birth date
87% with zip, birth date and gender.
k-anonymity, L-diversity
SDC etc.
Our approach: Randomized Response
22
Randomized Response ([ Stanley Warner; JASA 1965])
: Cheated in the exam : Didn’t cheat in the exam
Cheated in exam
Didn’t cheat
AAA
A
Randomization device
Do you belong to A? (p)
Do you belong to ?(1-p)A…
)1)(1( pp AA 12
ˆ
12
1ˆ
pp
pAW
1
“Yes” answer
“No” answer
As: Unbiased estimate of is: A
Procedure:
Purpose: Get the proportion( ) of population members that cheated in the exam.
A
…
Purpose
23
Matrix Expression
• RR can be expressed by matrix as: ( 0: No 1:Yes)
=
Unbiased estimate of is:
P
ˆˆ 1P
0
1p1
0
1
p p1
p
25
Vector Response
is the true proportions of the population
is the observed proportions in the survey
is the randomization device set by the interviewer.
),...,( 1 t
))(( jipP
),...,( 1 t
i
j
1 2 3 4
1 0.60 0.20 0.00 0.10
2 0.20 0.50 0.20 0.10
3 0.15 0.15 0.70 0.30
4 0.05 0.15 0.10 0.50
0.10
0.30
0.20
0.40
=
0.16
0.25
0.32
0.27
=P
26
Analysis
)()ˆ( 1 ndisp 111 )()ˆ( PPndisp
P ˆˆ 1P
)( 111 PPn
21
)(11 n
1112 )( PPPPn
the dispersion matrix of the regular survey estimation
nonnegative definite, represents the components of
dispersion associated with RR experiment
diagonal matrix with elements
27
Extension to Multi Attributes
,,...,, 21 mAAA m sensitive attributes: each has categories:
denote the true proportion corresponding to the combination
be vector with elements ,arranged
lexicographically.
e.g., if m =2, t1 =2 and t2=3
Simultaneous Model
Consider all variables as one compounded variable and apply the regular vector response RR technique
Sequential Model
jtjjtj AA ,...,1
mii ,...,1
),,...,(11 mmii AA
mii ,...,1 ),..,1( jj ti
)',,,,,( 232221131211
)...( 21 mPPP stands for Kronecker product
28
Kronecker Product Example
2221
1211
aa
aa
333231
232221
131211
bbb
bbb
bbb
=
222221
212211
PaPa
PaPa=
332232223122332132213121
232222222122232122212121
132212221122132112211121
331232123112331132113111
231222122112231122112111
131212121112131112111111
babababababa
babababababa
babababababa
babababababa
babababababa
babababababa
:1P :2P
1P 2P
29
Analysis
)()ˆ( 111 PPndisp
)( 1 mPP ˆ)(ˆ 111
mPP
Similarly, the dispersion matrix can be decomposed to two parts: one corresponds to that of the regular survey estimation and the other corresponds to the components of dispersion associated with RR experiment
30
Outline(Part II)
Randomization for Categorical Data Randomized Response Model To what extent it affects mining results? To what extent it protects privacy?
Application in market basket 0-1 data analysis Data swapping Frequent itemsets or rule hiding Inverse frequent itemset mining (SDM05, ICDM05) Item randomization (PKDD07,PAKDD08)
31
Market Basket Data
TID milk sugar bread … cereals
1 1 0 1 … 1
2 0 1 1 … 1
3 1 0 0 … 1
4 1 1 1 … 0
. . . . … .
N 0 1 1 … 0
1: presence 0: absence
…
Association rule (R.Agrawal SIGMOD 1993)
with support and confidence
• Other measures in MBA
Correlation, Lift, Interest, etc.
YX )(XYPs
)(
)(
XP
XYPc
32
Item Perturbation
TID milk sugar bread … cereals
1 1 0 1 1
2 0 1 1 1
3 1 0 0 1
4 1 1 1 0
. . . . … .
N 0 1 1 0
TID milk sugar bread … cereals
1 0 1 1 1
2 1 1 1 0
3 1 1 1 1
4 0 0 1 1
. . . . … .
N 1 1 0 1
Original Data Randomized Data
Individual privacy is preserved!
33
Research Problems
• How it affects the accuracy of discovered association rules?
• How it affects the accuracy of other measures?
• Two scenarios known distortion probability Unknown distortion probability
34
Motivation example
TID milk sugar bread … cereals
1 1 0 1 1
2 0 1 1 1
3 1 0 0 1
4 1 1 1 0
. . . . … .
N 0 1 1 0
Original Data Randomized Data
TID milk sugar bread … cereals
1 0 1 1 1
2 1 1 1 0
3 1 1 1 1
4 1 0 1 1
. . . . … .
N 0 1 0 1
RR
A: Milk B: Cereals
8.02.0
2.08.0AP
9.01.0
1.09.0BP
0.415 0.043 0.458
0.183 0.359 0.542
0.598 0.402
A
AA
B B0.368 0.097 0.465
0.218 0.317 0.537
0.586 0.414
B B
A
),,,( 11100100 )ˆ,ˆ,ˆ,ˆ(ˆ
11100100
)ˆ,ˆ,ˆ,ˆ(ˆ)(ˆ 1110010011
BA PP=(0.415,0.043,0.183,0.359)’=(0.427,0.031,0.181,0.362)’
ABsABs
1110
11
ABc 0.662
1110
11
ˆˆ
ˆˆ
ABc 0.671
We can get the estimate, how accurate we can achieve?ABc ABc
=(0.368,0.097,0.218,0.316)’
Data minersData owners
35
Motivation
31.5
35.936.3
22.1
12.3
23.8
%23min s
min26 ˆˆ sss
min6 ss
min2 ss Frequent set
Not frequent set
Estimated values
Original values
Rule 6 is falsely recognized from estimated value!
Lower& Upper bound
min2 ss l
min6 ss l
Frequent set with high confidence
Frequent set without confidence
Such errors can be avoided!
Both are frequent set
36
Accuracy on Support S
• Estimate of support
• Variance of support
• Interquantile range (normal dist.)
ˆ)...(ˆˆ 111
1 kPPP
111 )ˆˆˆ()1()ˆv(oc PPn 5
11
10
01
00
10
566.6777.2478.1311.2
777.2667.5244.0134.3
478.1244.0902.2668.1
311.2134.3668.1113.7
ˆ
ˆ
ˆ
ˆ
)ˆv(oc
)ˆar(v 11)ˆ,ˆv(oc 1110
00 01 10 11
)ˆ(ˆˆ,)ˆ(ˆˆ1111 2/2/ kkkk iiaiiiiaii arzarz
0.362
0.346 0.378
37
Accuracy on Confidence C
• Estimate of confidence A =>B
• Variance of confidence
• Interquantile range (ratio dist. is F(w))
Loose range derived on Chebyshev’s theorem
1110
11
ˆˆ
ˆ
ˆ
ˆˆ
A
AB
s
sc
)ˆ,ˆ(ˆˆ
ˆˆ2)ˆ(ˆ
ˆ
ˆ)ˆ(ˆ
ˆ
ˆ)ˆ(ˆ 10114
1
1110104
1
211
1141
210
ocararcar
)ˆar(v1
ˆ,)ˆar(v1
ˆ cccc
2/1 kwhere
38
Chebyshev's theorem Let be a random variable with expected value and
variance .Then for any real
X 2 0k
2/1)Pr( kkX
Accuracy on Confidence c
2/11)Pr( kkX
1)1
Pr( X
2
1
k
A lower bound to the proportion of data that are within a certain
number of standard deviations from the mean
39
Based on the theorem, the loose interquantile
range can be approximated as :
For the previous example with item A, B, the 95% interquantile range of is:
)ˆar(v1
)ˆ(ˆ,)ˆar(v1
)ˆ(ˆ ccEccE
%100)1(
ABc
]904.0,266.0[)ˆar(v05.0
1)ˆ(ˆ,)ˆar(v
05.0
1)ˆ(ˆ
ccEccE )662.0( ABc
Accuracy on Confidence c
40
Accuracy vs. varying p for individual rule
(a) Support (b) Confidence
G => H (35.9%, 66.2%) from COIL
P P
41
Accuracy Bounds
• With unknown distribution, Chebyshev theorm only gives loose bounds.
Support bounds vs. varying p for G => H
42
Other measures
Accuracy Bounds
43
Future work
• Conduct accuracy vs. disclosure analysis for general categorical data
• Develop a general randomization framework which combines Additive noise for numerical data Randomized response for categorical data
• Build a prototype system for real world applications Various query, analysis, and mining tasks Complex privacy requirements in different scenarios
non-confidential correlated attributes potential combinations
Dividends + Wages + Interests = Total Income