27
SAC’06 April 23-27, 2006, Dijon, France On the Use of Spectral Filtering for Privacy Preserving Data Mining Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte

On the Use of Spectral Filtering for Privacy Preserving Data Mining

Embed Size (px)

DESCRIPTION

On the Use of Spectral Filtering for Privacy Preserving Data Mining. Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte. Source: http://www.privacyinternational.org/issues/foia/foia-laws.jpg. PIPEDA 2000. European Union (Directive 94/46/EC). HIPAA for health care - PowerPoint PPT Presentation

Citation preview

SAC’06 April 23-27, 2006, Dijon, France

On the Use of Spectral Filtering for Privacy Preserving Data Mining

Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte

SAC, Dijon, France April 23-27, 2006 2

Source: http://www.privacyinternational.org/issues/foia/foia-laws.jpg

SAC, Dijon, France April 23-27, 2006 3Source: http://www.privacyinternational.org/survey/dpmap.jpg

HIPAA for health care California State Bill 1386

Grann-Leach-Bliley Act for financial

COPPA for childern’s online privacy

PIPEDA 2000

European Union (Directive 94/46/EC)

SAC, Dijon, France April 23-27, 2006 4

Mining vs. Privacy Data mining

The goal of data mining is summary results (e.g., classification, cluster, association rules etc.) from the data (distribution)

Individual Privacy Individual values in database must not be disclosed, or at least no

close estimation can be derived by attackers

Privacy Preserving Data Mining (PPDM) How to “perturb” data such that

we can build a good data mining model (data utility) while preserving individual’s privacy at the record level

(privacy)?

SAC, Dijon, France April 23-27, 2006 5

Outline Additive Randomization

Distribution Reconstruction Bayesian Method Agrawal & Srikant SIGMOD00 EM Method Agrawal & Aggawal PODS01

Individual Value Reconstruction Spectral Filtering H. Kargupta ICDM03 PCA Technique Du et al. SIGMOD05

Error Bound Analysis for Spectral Filtering

Upper Bound Conclusion and Future Work

SAC, Dijon, France April 23-27, 2006 6

Additive Randomization To hide the sensitive data by randomly modifying

the data values using some additive noise

Privacy preserving aims at

and

Utility preserving aims at The aggregate characteristics remain unchanged or can

be recovered

VUU ~

||ˆ|| UU ||~|| UU

SAC, Dijon, France April 23-27, 2006 7

Distribution Reconstruction The original density distribution can be reconstructed

effectively given the perturbed data and the noise's distribution --– Agrawal & Srikant SIGMOD 2000 Independent random noises with any distribution

n

ijXiiY

jXiiY

afayxf

afayxf

n 1 )())((

)())((1

fX0 := Uniform distribution

j := 0 // Iteration number repeat

fXj+1(a) :=

j := j+1

until (stopping criterion met)

It can not reconstruct individual value

0

200

400

600

800

1000

1200

20 60

Age

Num

ber

of P

eopl

e

OriginalRandomizedReconstructed

SAC, Dijon, France April 23-27, 2006 8

Individual Value Reconstruction Spectral Filtering, Kargupta et al. ICDM 2003

1. Apply EVD :2. Using some published information about V, extract the first k

components of as the principal components. – and are the corresponding eigenvectors. – forms an orthonormal basis of a subspace .

3. Find the orthogonal projection on to :4. Get estimate data set: ~

~ˆ PUU

TU

QQ~~~

~

U~

TXXP~~

~

ek ~~~

21 ]~~~[

~21 keeeX X

~

X~

keee ~,,~,~ 21

PCA Technique, Huang, Du and Chen, SIGMOD 05

SAC, Dijon, France April 23-27, 2006 9

Motivation Previous work on individual reconstruction are only

empirical The relationship between the estimation accuracy and the

noise was not clear

Two questions Attacker question: How close the estimated data using

SF is from the original one?

Data owner question: How much noise should be added to preserve privacy at a given tolerated level?

SAC, Dijon, France April 23-27, 2006 10

Our Work Investigate the explicit relationship between the estimation

accuracy and the noise Derive one upper bound of in terms of V

The upper bound determines how close the estimated data achieved by attackers is from the original one

It imposes a serious threat of privacy breaches

FUU ||ˆ||

SAC, Dijon, France April 23-27, 2006 11

Preliminary F-norm and 2-norm

Some properties and ,the square root of the largest

eigenvalue of ATA If A is symmetric, then ,the largest

eigenvalue of A

m

i

n

jijF aA

1 1

2|||| 2

2

02 ||||

||||max||||

x

AxA

x

FFF BAAB |||||||||||| 222 |||||||||||| BAAB

22 |||||||||||| AnAA F)(|||| max2 AAA T

)(|||| max2 AA

SAC, Dijon, France April 23-27, 2006 12

Matrix Perturbation Traditional Matrix perturbation theory

How the derived perturbation E affects the co-variance matrix A

Our scenario How the primary perturbation V affects the

data matrix U

VVVUUVUUVUVUUUA TTTTTT )()(~~~

A E+

SAC, Dijon, France April 23-27, 2006 13

Error Bound Analysis

Prop 1. Let covariance matrix of the perturbed data be . Given and

Prop 2. (eigenvalue of E)

VPPPUUPPUUU )~(~~~ˆ ~~

FFFF VPPPUUU ||||||~||||

~||||ˆ|| ~

2

2||

~|| ~

FPP

FE |||| 1 kk EAA ~

(eigengap)

n 21

]~,

~[ 1 niii

SAC, Dijon, France April 23-27, 2006 14

Theorem Given a date set and a noise set we have the

perturbed data set . Let be the estimation obtained from the Spectral Filtering, then

where is the derived perturbation on the original covariance matrix A = UUT

Proof is skipped

nmRU nmRV VUU ~ U

VVVUUVE TTT

F

Fk

FFF VP

EE

EUUU ||||

||||2)||||~(

||||2||~||||ˆ||

2

SAC, Dijon, France April 23-27, 2006 15

Special Cases When the noise matrix is generated by i.i.d.

Gaussian distribution with zero mean and known variance

When the noise is completely correlated with data

F

Fk

FFPF Vnk

VE

VUUU ||||/

||||2)||||~(

||||2||||||ˆ||

22

2

F

Fk

FFPF V

VE

VUUU ||||

||||2)||||~(

||||2||||||ˆ||

22

2

SAC, Dijon, France April 23-27, 2006 16

F

1

F

2

F

3

F

4

F

5

F

6

F

7

F

8

F

9

F

1

4

F

1

0

F

1

5

F

1

1

F

1

6

F

1

2

F

1

3

F

1

7

F

2

1

F

1

8

F

1

9

F

2

0

F

2

2

F

2

3

F

2

4

F

2

5

F

2

8

F

2

6

F

2

7

F

3

0

F

2

9

F

3

1

F

3

3

F

3

2

F

3

4

F

3

5

Experimental Results Artificial Dataset 35 correlated variables 30,000 tuples

SAC, Dijon, France April 23-27, 2006 17

Experimental Results Scenarios of noise addition

Case 1: i.i.d. Gaussian noise N(0,COV), where COV = diag(σ2,…, σ2)

Case 2: Independent Gaussian noise N(0,COV), where COV = c * diag(σ1

2, …, σn2)

Case 3: Correlated Gaussian noise N(0,COV), where COV = c * ΣU (or c * A……)

Measure Absolute error

Relative errorF

F

U

UUUUre

||||

||ˆ||)ˆ,(

FUUUUae ||ˆ||)ˆ,(

SAC, Dijon, France April 23-27, 2006 18

Determining k Determine k in Spectral Filtering

According to Matrix Perturbation Theory

Our heuristic approach: check

K =

2|||||}~

max{| Eii

2||||~

Ei

)||||~|min( 2Ei i

SAC, Dijon, France April 23-27, 2006 19

Effect of varying k (case 1) N(0,COV), where COV = diag(σ2,…, σ2) relative error

||V||F 229 323 561 725 1025

σ2 0.05 0.10 0.3 0.5 1.0

K=1 0.43 0.44 0.45 0.46 0.48

K=2 0.22 0.23 0.26 0.29 0.36

K=3 0.16 0.18 0.24 0.29 *0.31

K=4 *0.09 *0.12 *0.22 *0.28 0.40

K=5 0.10 0.14 0.25 0.32 0.45

SAC, Dijon, France April 23-27, 2006 20

Effect of varying k (case 2) N(0,COV), where COV = c * diag(σ1

2, σ22 …, σn

2)

relative error

||V||F 229 323 561 725 1025

c 0.07 0.15 0.44 0.74 1.45

K=1 0.44 0.44 0.45 0.46 0.49

K=2 0.22 0.23 0.27 *0.30 *0.36

K=3 0.16 0.18 0.24 0.33 0.44

K=4 *0.07 *0.11 *0.23 0.37 0.50

K=5 0.09 0.13 0.26 0.40 0.56

SAC, Dijon, France April 23-27, 2006 21

Effect of varying k (case 3) N(0,COV), where COV = c * ΣU

||V||F 229 323 561 725 1025

c 0.07 0.15 0.44 0.74 1.45

K=1 0.50 0.55 0.73 0.88 *1.17

K=2 0.34 0.43 0.68 0.86 1.19

K=3 0.30 0.41 0.67 0.86 1.20

K=4 *0.27 *0.38 *0.65 *0.85 1.20

K=5 0.27 0.38 0.65 0.85 1.20

SAC, Dijon, France April 23-27, 2006 22

σ2=0.5

σ2=0.1

Effect of varying noise

σ2=1.0

||V||F/||U||F = 87.8%

SAC, Dijon, France April 23-27, 2006 23

Case 1

Effect of covariance matrix

Case 3

Case 2

||V||F/||U||F = 39.1%

SAC, Dijon, France April 23-27, 2006 24

Conclusion Spectral filtering based technique has been

investigated as a major means of point-wise data reconstruction.

We present the upper bound which enables attackers to determines how close the

estimated data achieved by attackers is from the original one

SAC, Dijon, France April 23-27, 2006 25

Future Work We are working on the lower bound

which represents the best estimate the attacker can achieve using SF

which can be used by data owners to determine how much noise should be added to preserve privacy

Bound analysis at point-wise level

SAC, Dijon, France April 23-27, 2006 26

Acknowledgement NSF Grant

CCR-0310974 IIS-0546027

Personnel Xintao Wu Songtao Guo Ling Guo

More Info http://www.cs.uncc.edu/~xwu/ [email protected],

SAC, Dijon, France April 23-27, 2006 27

Questions?

Thank you!