Upload
hyatt-black
View
21
Download
0
Tags:
Embed Size (px)
DESCRIPTION
On the Use of Spectral Filtering for Privacy Preserving Data Mining. Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte. Source: http://www.privacyinternational.org/issues/foia/foia-laws.jpg. PIPEDA 2000. European Union (Directive 94/46/EC). HIPAA for health care - PowerPoint PPT Presentation
Citation preview
SAC’06 April 23-27, 2006, Dijon, France
On the Use of Spectral Filtering for Privacy Preserving Data Mining
Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte
SAC, Dijon, France April 23-27, 2006 2
Source: http://www.privacyinternational.org/issues/foia/foia-laws.jpg
SAC, Dijon, France April 23-27, 2006 3Source: http://www.privacyinternational.org/survey/dpmap.jpg
HIPAA for health care California State Bill 1386
Grann-Leach-Bliley Act for financial
COPPA for childern’s online privacy
PIPEDA 2000
European Union (Directive 94/46/EC)
SAC, Dijon, France April 23-27, 2006 4
Mining vs. Privacy Data mining
The goal of data mining is summary results (e.g., classification, cluster, association rules etc.) from the data (distribution)
Individual Privacy Individual values in database must not be disclosed, or at least no
close estimation can be derived by attackers
Privacy Preserving Data Mining (PPDM) How to “perturb” data such that
we can build a good data mining model (data utility) while preserving individual’s privacy at the record level
(privacy)?
SAC, Dijon, France April 23-27, 2006 5
Outline Additive Randomization
Distribution Reconstruction Bayesian Method Agrawal & Srikant SIGMOD00 EM Method Agrawal & Aggawal PODS01
Individual Value Reconstruction Spectral Filtering H. Kargupta ICDM03 PCA Technique Du et al. SIGMOD05
Error Bound Analysis for Spectral Filtering
Upper Bound Conclusion and Future Work
SAC, Dijon, France April 23-27, 2006 6
Additive Randomization To hide the sensitive data by randomly modifying
the data values using some additive noise
Privacy preserving aims at
and
Utility preserving aims at The aggregate characteristics remain unchanged or can
be recovered
VUU ~
||ˆ|| UU ||~|| UU
SAC, Dijon, France April 23-27, 2006 7
Distribution Reconstruction The original density distribution can be reconstructed
effectively given the perturbed data and the noise's distribution --– Agrawal & Srikant SIGMOD 2000 Independent random noises with any distribution
n
ijXiiY
jXiiY
afayxf
afayxf
n 1 )())((
)())((1
fX0 := Uniform distribution
j := 0 // Iteration number repeat
fXj+1(a) :=
j := j+1
until (stopping criterion met)
It can not reconstruct individual value
0
200
400
600
800
1000
1200
20 60
Age
Num
ber
of P
eopl
e
OriginalRandomizedReconstructed
SAC, Dijon, France April 23-27, 2006 8
Individual Value Reconstruction Spectral Filtering, Kargupta et al. ICDM 2003
1. Apply EVD :2. Using some published information about V, extract the first k
components of as the principal components. – and are the corresponding eigenvectors. – forms an orthonormal basis of a subspace .
3. Find the orthogonal projection on to :4. Get estimate data set: ~
~ˆ PUU
TU
QQ~~~
~
U~
TXXP~~
~
ek ~~~
21 ]~~~[
~21 keeeX X
~
X~
keee ~,,~,~ 21
PCA Technique, Huang, Du and Chen, SIGMOD 05
SAC, Dijon, France April 23-27, 2006 9
Motivation Previous work on individual reconstruction are only
empirical The relationship between the estimation accuracy and the
noise was not clear
Two questions Attacker question: How close the estimated data using
SF is from the original one?
Data owner question: How much noise should be added to preserve privacy at a given tolerated level?
SAC, Dijon, France April 23-27, 2006 10
Our Work Investigate the explicit relationship between the estimation
accuracy and the noise Derive one upper bound of in terms of V
The upper bound determines how close the estimated data achieved by attackers is from the original one
It imposes a serious threat of privacy breaches
FUU ||ˆ||
SAC, Dijon, France April 23-27, 2006 11
Preliminary F-norm and 2-norm
Some properties and ,the square root of the largest
eigenvalue of ATA If A is symmetric, then ,the largest
eigenvalue of A
m
i
n
jijF aA
1 1
2|||| 2
2
02 ||||
||||max||||
x
AxA
x
FFF BAAB |||||||||||| 222 |||||||||||| BAAB
22 |||||||||||| AnAA F)(|||| max2 AAA T
)(|||| max2 AA
SAC, Dijon, France April 23-27, 2006 12
Matrix Perturbation Traditional Matrix perturbation theory
How the derived perturbation E affects the co-variance matrix A
Our scenario How the primary perturbation V affects the
data matrix U
VVVUUVUUVUVUUUA TTTTTT )()(~~~
A E+
SAC, Dijon, France April 23-27, 2006 13
Error Bound Analysis
Prop 1. Let covariance matrix of the perturbed data be . Given and
Prop 2. (eigenvalue of E)
VPPPUUPPUUU )~(~~~ˆ ~~
FFFF VPPPUUU ||||||~||||
~||||ˆ|| ~
2
2||
~|| ~
FPP
FE |||| 1 kk EAA ~
(eigengap)
n 21
]~,
~[ 1 niii
SAC, Dijon, France April 23-27, 2006 14
Theorem Given a date set and a noise set we have the
perturbed data set . Let be the estimation obtained from the Spectral Filtering, then
where is the derived perturbation on the original covariance matrix A = UUT
Proof is skipped
nmRU nmRV VUU ~ U
VVVUUVE TTT
F
Fk
FFF VP
EE
EUUU ||||
||||2)||||~(
||||2||~||||ˆ||
2
SAC, Dijon, France April 23-27, 2006 15
Special Cases When the noise matrix is generated by i.i.d.
Gaussian distribution with zero mean and known variance
When the noise is completely correlated with data
F
Fk
FFPF Vnk
VE
VUUU ||||/
||||2)||||~(
||||2||||||ˆ||
22
2
F
Fk
FFPF V
VE
VUUU ||||
||||2)||||~(
||||2||||||ˆ||
22
2
SAC, Dijon, France April 23-27, 2006 16
F
1
F
2
F
3
F
4
F
5
F
6
F
7
F
8
F
9
F
1
4
F
1
0
F
1
5
F
1
1
F
1
6
F
1
2
F
1
3
F
1
7
F
2
1
F
1
8
F
1
9
F
2
0
F
2
2
F
2
3
F
2
4
F
2
5
F
2
8
F
2
6
F
2
7
F
3
0
F
2
9
F
3
1
F
3
3
F
3
2
F
3
4
F
3
5
Experimental Results Artificial Dataset 35 correlated variables 30,000 tuples
SAC, Dijon, France April 23-27, 2006 17
Experimental Results Scenarios of noise addition
Case 1: i.i.d. Gaussian noise N(0,COV), where COV = diag(σ2,…, σ2)
Case 2: Independent Gaussian noise N(0,COV), where COV = c * diag(σ1
2, …, σn2)
Case 3: Correlated Gaussian noise N(0,COV), where COV = c * ΣU (or c * A……)
Measure Absolute error
Relative errorF
F
U
UUUUre
||||
||ˆ||)ˆ,(
FUUUUae ||ˆ||)ˆ,(
SAC, Dijon, France April 23-27, 2006 18
Determining k Determine k in Spectral Filtering
According to Matrix Perturbation Theory
Our heuristic approach: check
K =
2|||||}~
max{| Eii
2||||~
Ei
)||||~|min( 2Ei i
SAC, Dijon, France April 23-27, 2006 19
Effect of varying k (case 1) N(0,COV), where COV = diag(σ2,…, σ2) relative error
||V||F 229 323 561 725 1025
σ2 0.05 0.10 0.3 0.5 1.0
K=1 0.43 0.44 0.45 0.46 0.48
K=2 0.22 0.23 0.26 0.29 0.36
K=3 0.16 0.18 0.24 0.29 *0.31
K=4 *0.09 *0.12 *0.22 *0.28 0.40
K=5 0.10 0.14 0.25 0.32 0.45
SAC, Dijon, France April 23-27, 2006 20
Effect of varying k (case 2) N(0,COV), where COV = c * diag(σ1
2, σ22 …, σn
2)
relative error
||V||F 229 323 561 725 1025
c 0.07 0.15 0.44 0.74 1.45
K=1 0.44 0.44 0.45 0.46 0.49
K=2 0.22 0.23 0.27 *0.30 *0.36
K=3 0.16 0.18 0.24 0.33 0.44
K=4 *0.07 *0.11 *0.23 0.37 0.50
K=5 0.09 0.13 0.26 0.40 0.56
SAC, Dijon, France April 23-27, 2006 21
Effect of varying k (case 3) N(0,COV), where COV = c * ΣU
||V||F 229 323 561 725 1025
c 0.07 0.15 0.44 0.74 1.45
K=1 0.50 0.55 0.73 0.88 *1.17
K=2 0.34 0.43 0.68 0.86 1.19
K=3 0.30 0.41 0.67 0.86 1.20
K=4 *0.27 *0.38 *0.65 *0.85 1.20
K=5 0.27 0.38 0.65 0.85 1.20
SAC, Dijon, France April 23-27, 2006 22
σ2=0.5
σ2=0.1
Effect of varying noise
σ2=1.0
||V||F/||U||F = 87.8%
SAC, Dijon, France April 23-27, 2006 23
Case 1
Effect of covariance matrix
Case 3
Case 2
||V||F/||U||F = 39.1%
SAC, Dijon, France April 23-27, 2006 24
Conclusion Spectral filtering based technique has been
investigated as a major means of point-wise data reconstruction.
We present the upper bound which enables attackers to determines how close the
estimated data achieved by attackers is from the original one
SAC, Dijon, France April 23-27, 2006 25
Future Work We are working on the lower bound
which represents the best estimate the attacker can achieve using SF
which can be used by data owners to determine how much noise should be added to preserve privacy
Bound analysis at point-wise level
SAC, Dijon, France April 23-27, 2006 26
Acknowledgement NSF Grant
CCR-0310974 IIS-0546027
Personnel Xintao Wu Songtao Guo Ling Guo
More Info http://www.cs.uncc.edu/~xwu/ [email protected],