On the Use of Spectral Filtering for Privacy Preserving Data Mining

SAC’06 April 23-27, 2006, Dijon, France

On the Use of Spectral Filtering for Privacy Preserving Data Mining

Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte

SAC, Dijon, France April 23-27, 2006 2

Source: http://www.privacyinternational.org/issues/foia/foia-laws.jpg

SAC, Dijon, France April 23-27, 2006 3Source: http://www.privacyinternational.org/survey/dpmap.jpg

HIPAA for health care California State Bill 1386

Grann-Leach-Bliley Act for financial

COPPA for childern’s online privacy

PIPEDA 2000

European Union (Directive 94/46/EC)


Mining vs. Privacy Data mining

The goal of data mining is summary results (e.g., classification, cluster, association rules etc.) from the data (distribution)

Individual Privacy Individual values in database must not be disclosed, or at least no

close estimation can be derived by attackers

Privacy Preserving Data Mining (PPDM) How to “perturb” data such that

we can build a good data mining model (data utility) while preserving individual’s privacy at the record level

(privacy)?


Outline Additive Randomization

Distribution Reconstruction Bayesian Method Agrawal & Srikant SIGMOD00 EM Method Agrawal & Aggawal PODS01

Individual Value Reconstruction Spectral Filtering H. Kargupta ICDM03 PCA Technique Du et al. SIGMOD05

Error Bound Analysis for Spectral Filtering

Upper Bound Conclusion and Future Work


Additive Randomization To hide the sensitive data by randomly modifying

the data values using some additive noise

Privacy preserving aims at

and

Utility preserving aims at The aggregate characteristics remain unchanged or can

be recovered

VUU ~

||ˆ|| UU ||~|| UU


Distribution Reconstruction The original density distribution can be reconstructed

effectively given the perturbed data and the noise's distribution --– Agrawal & Srikant SIGMOD 2000 Independent random noises with any distribution

n

ijXiiY

jXiiY

afayxf

afayxf

n 1 )())((

)())((1

fX0 := Uniform distribution

j := 0 // Iteration number repeat

fXj+1(a) :=

j := j+1

until (stopping criterion met)

It can not reconstruct individual value

0

200

400

600

800

1000

1200

20 60

Age

Num

ber

of P

eopl

e

OriginalRandomizedReconstructed


Individual Value Reconstruction Spectral Filtering, Kargupta et al. ICDM 2003

1. Apply EVD :2. Using some published information about V, extract the first k

components of as the principal components. – and are the corresponding eigenvectors. – forms an orthonormal basis of a subspace .

3. Find the orthogonal projection on to :4. Get estimate data set: ~

~ˆ PUU

TU

QQ~~~

~

U~

TXXP~~

~

ek ~~~

21 ]~~~[

~21 keeeX X

~

X~

keee ~,,~,~ 21

PCA Technique, Huang, Du and Chen, SIGMOD 05


Motivation Previous work on individual reconstruction are only

empirical The relationship between the estimation accuracy and the

noise was not clear

Two questions Attacker question: How close the estimated data using

SF is from the original one?

Data owner question: How much noise should be added to preserve privacy at a given tolerated level?


Our Work Investigate the explicit relationship between the estimation

accuracy and the noise Derive one upper bound of in terms of V

The upper bound determines how close the estimated data achieved by attackers is from the original one

It imposes a serious threat of privacy breaches

FUU ||ˆ||


Preliminary F-norm and 2-norm

Some properties and ,the square root of the largest

eigenvalue of ATA If A is symmetric, then ,the largest

eigenvalue of A

m

i

n

jijF aA

1 1

2|||| 2

2

02 ||||

||||max||||

x

AxA

x

FFF BAAB |||||||||||| 222 |||||||||||| BAAB

22 |||||||||||| AnAA F)(|||| max2 AAA T

)(|||| max2 AA


Matrix Perturbation Traditional Matrix perturbation theory

How the derived perturbation E affects the co-variance matrix A

Our scenario How the primary perturbation V affects the

data matrix U

VVVUUVUUVUVUUUA TTTTTT )()(~~~

A E+


Error Bound Analysis

Prop 1. Let covariance matrix of the perturbed data be . Given and

Prop 2. (eigenvalue of E)

VPPPUUPPUUU )~(~~~ˆ ~~

FFFF VPPPUUU ||||||~||||

~||||ˆ|| ~

2

2||

~|| ~

FPP

FE |||| 1 kk EAA ~

(eigengap)

n 21

]~,

~[ 1 niii


Theorem Given a date set and a noise set we have the

perturbed data set . Let be the estimation obtained from the Spectral Filtering, then

where is the derived perturbation on the original covariance matrix A = UUT

Proof is skipped

nmRU nmRV VUU ~ U

VVVUUVE TTT

F

Fk

FFF VP

EE

EUUU ||||

||||2)||||~(

||||2||~||||ˆ||

2


Special Cases When the noise matrix is generated by i.i.d.

Gaussian distribution with zero mean and known variance

When the noise is completely correlated with data

F

Fk

FFPF Vnk

VE

VUUU ||||/

||||2)||||~(

||||2||||||ˆ||

22

2

F

Fk

FFPF V

VE

VUUU ||||

||||2)||||~(

||||2||||||ˆ||

22

2


F

1

F

2

F

3

F

4

F

5

F

6

F

7

F

8

F

9

F

1

4

F

1

0

F

1

5

F

1

1

F

1

6

F

1

2

F

1

3

F

1

7

F

2

1

F

1

8

F

1

9

F

2

0

F

2

2

F

2

3

F

2

4

F

2

5

F

2

8

F

2

6

F

2

7

F

3

0

F

2

9

F

3

1

F

3

3

F

3

2

F

3

4

F

3

5

Experimental Results Artificial Dataset 35 correlated variables 30,000 tuples


Experimental Results Scenarios of noise addition

Case 1: i.i.d. Gaussian noise N(0,COV), where COV = diag(σ2,…, σ2)

Case 2: Independent Gaussian noise N(0,COV), where COV = c * diag(σ1

2, …, σn2)

Case 3: Correlated Gaussian noise N(0,COV), where COV = c * ΣU (or c * A……)

Measure Absolute error

Relative errorF

F

U

UUUUre

||||

||ˆ||)ˆ,(

FUUUUae ||ˆ||)ˆ,(


Determining k Determine k in Spectral Filtering

According to Matrix Perturbation Theory

Our heuristic approach: check

K =

2|||||}~

max{| Eii

2||||~

Ei

)||||~|min( 2Ei i


Effect of varying k (case 1) N(0,COV), where COV = diag(σ2,…, σ2) relative error

||V||F 229 323 561 725 1025

σ2 0.05 0.10 0.3 0.5 1.0

K=1 0.43 0.44 0.45 0.46 0.48

K=2 0.22 0.23 0.26 0.29 0.36

K=3 0.16 0.18 0.24 0.29 *0.31

K=4 *0.09 *0.12 *0.22 *0.28 0.40

K=5 0.10 0.14 0.25 0.32 0.45


Effect of varying k (case 2) N(0,COV), where COV = c * diag(σ1

2, σ22 …, σn

2)

relative error

||V||F 229 323 561 725 1025

c 0.07 0.15 0.44 0.74 1.45

K=1 0.44 0.44 0.45 0.46 0.49

K=2 0.22 0.23 0.27 *0.30 *0.36

K=3 0.16 0.18 0.24 0.33 0.44

K=4 *0.07 *0.11 *0.23 0.37 0.50

K=5 0.09 0.13 0.26 0.40 0.56


Effect of varying k (case 3) N(0,COV), where COV = c * ΣU

||V||F 229 323 561 725 1025

c 0.07 0.15 0.44 0.74 1.45

K=1 0.50 0.55 0.73 0.88 *1.17

K=2 0.34 0.43 0.68 0.86 1.19

K=3 0.30 0.41 0.67 0.86 1.20

K=4 *0.27 *0.38 *0.65 *0.85 1.20

K=5 0.27 0.38 0.65 0.85 1.20


σ2=0.5

σ2=0.1

Effect of varying noise

σ2=1.0

||V||F/||U||F = 87.8%


Case 1

Effect of covariance matrix

Case 3

Case 2

||V||F/||U||F = 39.1%


Conclusion Spectral filtering based technique has been

investigated as a major means of point-wise data reconstruction.

We present the upper bound which enables attackers to determines how close the

estimated data achieved by attackers is from the original one


Future Work We are working on the lower bound

which represents the best estimate the attacker can achieve using SF

which can be used by data owners to determine how much noise should be added to preserve privacy

Bound analysis at point-wise level


Acknowledgement NSF Grant

CCR-0310974 IIS-0546027

Personnel Xintao Wu Songtao Guo Ling Guo

More Info http://www.cs.uncc.edu/~xwu/ [email protected],


Questions?

Thank you!

Documents

On the Use of Spectral Filtering for Privacy Preserving Data Mining