39
Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

Embed Size (px)

Citation preview

Page 1: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

Randomization based Privacy Preserving Data Mining

Xintao Wu

University of North Carolina at CharlotteAugust 30, 2012

Page 2: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

2

Scope

Page 3: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

3

Outline

Part I: Randomization for Numerical Data Additive noise Projection Modeling based

Part II: Randomization for Categorical Data Randomized Response Application to Market Basket Data Analysis

Page 4: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

4

Additive Noise Randomization Example

Bal income … IntP

1 10k 85k … 2k

2 15k 70k … 18k

3 50k 120k … 35k

4 45k 23k … 134k

. . . … .

N 80k 110k … 15k

10 85 2

15 70 18

50

45

80

120

23

110

35

134

15

= +

Y = X + E

7.334 3.759 0.099

4.199 7.537 7.939

9.199

6.208

9.048

8.447

7.313

5.692

3.678

1.939

6.318

17.334 88.759 2.099

19.199 77.537 25.939

59.199

51.208

89.048

128.447

30.313

115.692

38.678

135.939

21.318

Page 5: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

5

Additive Randomization (Z=X+Y)

50 | 40K | ... 30 | 70K | ... ...

...

Randomizer Randomizer

ReconstructDistribution

of Age

ReconstructDistributionof Salary

ClassificationAlgorithm

Model

65 | 20K | ... 25 | 60K | ... ...30

becomes 65

(30+35)

Alice’s age

Add random number to

Age

• R.Agrawal and R.Srikant SIGMOD 00

Page 6: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

6

Reconstruction Problem

• Original values x1, x2, ..., xn

from probability distribution X (unknown)

• To hide these values, we use y1, y2, ..., yn

from probability distribution Y

• Given x1+y1, x2+y2, ..., xn+yn

the probability distribution of Y

Estimate the probability distribution of X.

Page 7: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

7

• Converges to maximum likelihood estimate Agrawal and Aggarwal PODS 01

• Extension to muti-variate case Domingo-Ferrer et al. PSD04

Distribution Reconstruction Alg.

• Bootstrapping Algorithm

met) criterion (stopping until

1

)())((

)())((1)(

repeat

0

ondistributi uniform

1

1

0

jj

afayxf

afayxf

naf

j

f

n

ij

XiiY

jXiiYj

X

X

0

200

400

600

800

1000

1200

20 60

Age

Num

ber

of P

eopl

e

OriginalRandomizedReconstructed

Page 8: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

9

Individual Value Reconstruction (Additive Noise)

• Methods Spectral Filtering, Kargupta et al. ICDM03 PCA, Huang, Du, and Chen SIGMOD05 SVD, Guo, Wu and Li, PKDD06

• All aim to remove noise by projecting on lower dimensional space.

Page 9: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

10

Individual Reconstruction Algorithm

Apply EVD : Using some published information about V, extract the first k

components of as the principal components. λ1≥ λ2··· ≥ λk ≥ λe and e1, e2, · · · ,ek are the corresponding

eigenvectors. Qk = [e1 e2 · · · ek] forms an orthonormal basis of a subspace X.

Find the orthogonal projection on to X : Get estimate data set: PUU pˆ

TUp QQ

Up

TkkQQP

Up = U + VNoisePerturbed Original

Page 10: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

12

Why it works

• Original data are correlated • Noise are not correlated

noise

2nd principal vector

1st principal vectororiginal signal perturbed

+ =

2-d estimation1-d estimation

Page 11: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

14

Additive Noise vs. Projection

• Additive perturbation is not safe Spectral Filtering Technique

H.Kargupta et al. ICDM03 PCA Based Technique

Huang et al. SIGMOD05 SVD based & Bound Analysis

Guo et al. SAC06,PKDD06

• How about the projection based perturbation? Projection models Vulnerabilities Potential attacks

XX EEYY

NoisePerturbed Original

= +

RR XXYY

Perturbed Transformation Original

=

Page 12: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

15

Rotation Randomization Example

0.3333 0.6667 0.6667

-0.6667 0.6667 -0.3333

-0.6667 -0.3333 0.6667

10 15 50 45 80

85 70 120 23 110

2 18 35 134 15

61.33 63.67 110.00 119.67 63.33

49.33 30.67 55.00 -59.33 -31.67

-33.67 -21.33 -30.00 51.67 -51.67

=

Y = R X

Bal income … IntP

1 10k 85k … 2k

2 15k 70k … 18k

3 50k 120k … 35k

4 45k 23k … 134k

. . . … .

N 80k 110k … 15k

RRT = RTR = I

Page 13: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

16

Rotation Approach (R is orthonormal)

• When R is an orthonormal matrix (RTR = RRT = I) Vector length: |Rx| = |x| Euclidean distance: |Rxi - Rxj| = |xi - xj|

Inner product : <Rxi ,Rxj> = <xi , xj>

• Many clustering and classification methods are invariant to this rotation perturbation. Classification, Chen and Liu, ICDM 05 Distributed data mining, Liu and Kargupta, TKDE 06

Page 14: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

17

Example

866.0500.0

500.0866.0R

RXY

0.2902

0.2902

1.30

86

1.3086

RRT = RTR = I

Page 15: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

18

Weakness of Rotation

0.2902

1.30

86

0.2902

1.3086?

866.0500.0

500.0866.0RRegression

•Known sample attackKnown

Info Original data

Page 16: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

19

General Linear Transformation

• Y = R X + E When R = I: Y = X + E (Additive Noise Model) When RRT = RTR = I and E = 0: Y = RX (Rotation Model) R can be an arbitrary matrix

4.751 2.429 2.282

1.156 4.457 0.093

3.034 3.811 4.107

10 15 50 45 80

85 70 120 23 110

2 18 35 134 15

265.95 286.63 475.68 581.71 520.53

394.30 338.49 569.58 174.22 277.79

362.55 394.11 665.37 776.46 463.08

=

Y = R X

+

7.334 4.199 9.199 6.208 9.048

3.759 7.537 8.447 7.313 5.692

0.099 7.939 3.678 1.939 6.318

+ E

Page 17: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

20

Is Y = R X + E Safe?

• R can be an arbitrary matrix, hence regression based attack wont work

• How about noisy ICA direct attack?

Y = R X + E General Linear Transformation Model

X = A S + N Noisy ICA Model

Page 18: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

21

Scope (Part II)

ssn name zip race … age Sex Bal income … IntP

1 28223 Asian … 20 M 10k 85k … 2k

2 28223 Asian … 30 F 15k 70k … 18k

3 28262 Black … 20 M 50k 120k … 35k

4 28261 White … 26 M 45k 23k … 134k

. . . … . . . . … .

N 28223 Asian … 20 M 80k 110k … 15k

69% unique on zip and birth date

87% with zip, birth date and gender.

k-anonymity, L-diversity

SDC etc.

Our approach: Randomized Response

Page 19: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

22

Randomized Response ([ Stanley Warner; JASA 1965])

: Cheated in the exam : Didn’t cheat in the exam

Cheated in exam

Didn’t cheat

AAA

A

Randomization device

Do you belong to A? (p)

Do you belong to ?(1-p)A…

)1)(1( pp AA 12

ˆ

12

pp

pAW

1

“Yes” answer

“No” answer

As: Unbiased estimate of is: A

Procedure:

Purpose: Get the proportion( ) of population members that cheated in the exam.

A

Purpose

Page 20: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

23

Matrix Expression

• RR can be expressed by matrix as: ( 0: No 1:Yes)

=

Unbiased estimate of is:

P

ˆˆ 1P

0

1p1

0

1

p p1

p

Page 21: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

25

Vector Response

is the true proportions of the population

is the observed proportions in the survey

is the randomization device set by the interviewer.

),...,( 1 t

))(( jipP

),...,( 1 t

i

j

1 2 3 4

1 0.60 0.20 0.00 0.10

2 0.20 0.50 0.20 0.10

3 0.15 0.15 0.70 0.30

4 0.05 0.15 0.10 0.50

0.10

0.30

0.20

0.40

=

0.16

0.25

0.32

0.27

=P

Page 22: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

26

Analysis

)()ˆ( 1 ndisp 111 )()ˆ( PPndisp

P ˆˆ 1P

)( 111 PPn

21

)(11 n

1112 )( PPPPn

the dispersion matrix of the regular survey estimation

nonnegative definite, represents the components of

dispersion associated with RR experiment

diagonal matrix with elements

Page 23: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

27

Extension to Multi Attributes

,,...,, 21 mAAA m sensitive attributes: each has categories:

denote the true proportion corresponding to the combination

be vector with elements ,arranged

lexicographically.

e.g., if m =2, t1 =2 and t2=3

Simultaneous Model

Consider all variables as one compounded variable and apply the regular vector response RR technique

Sequential Model

jtjjtj AA ,...,1

mii ,...,1

),,...,(11 mmii AA

mii ,...,1 ),..,1( jj ti

)',,,,,( 232221131211

)...( 21 mPPP stands for Kronecker product

Page 24: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

28

Kronecker Product Example

2221

1211

aa

aa

333231

232221

131211

bbb

bbb

bbb

=

222221

212211

PaPa

PaPa=

332232223122332132213121

232222222122232122212121

132212221122132112211121

331232123112331132113111

231222122112231122112111

131212121112131112111111

babababababa

babababababa

babababababa

babababababa

babababababa

babababababa

:1P :2P

1P 2P

Page 25: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

29

Analysis

)()ˆ( 111 PPndisp

)( 1 mPP ˆ)(ˆ 111

mPP

Similarly, the dispersion matrix can be decomposed to two parts: one corresponds to that of the regular survey estimation and the other corresponds to the components of dispersion associated with RR experiment

Page 26: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

30

Outline(Part II)

Randomization for Categorical Data Randomized Response Model To what extent it affects mining results? To what extent it protects privacy?

Application in market basket 0-1 data analysis Data swapping Frequent itemsets or rule hiding Inverse frequent itemset mining (SDM05, ICDM05) Item randomization (PKDD07,PAKDD08)

Page 27: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

31

Market Basket Data

TID milk sugar bread … cereals

1 1 0 1 … 1

2 0 1 1 … 1

3 1 0 0 … 1

4 1 1 1 … 0

. . . . … .

N 0 1 1 … 0

1: presence 0: absence

Association rule (R.Agrawal SIGMOD 1993)

with support and confidence

• Other measures in MBA

Correlation, Lift, Interest, etc.

YX )(XYPs

)(

)(

XP

XYPc

Page 28: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

32

Item Perturbation

TID milk sugar bread … cereals

1 1 0 1 1

2 0 1 1 1

3 1 0 0 1

4 1 1 1 0

. . . . … .

N 0 1 1 0

TID milk sugar bread … cereals

1 0 1 1 1

2 1 1 1 0

3 1 1 1 1

4 0 0 1 1

. . . . … .

N 1 1 0 1

Original Data Randomized Data

Individual privacy is preserved!

Page 29: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

33

Research Problems

• How it affects the accuracy of discovered association rules?

• How it affects the accuracy of other measures?

• Two scenarios known distortion probability Unknown distortion probability

Page 30: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

34

Motivation example

TID milk sugar bread … cereals

1 1 0 1 1

2 0 1 1 1

3 1 0 0 1

4 1 1 1 0

. . . . … .

N 0 1 1 0

Original Data Randomized Data

TID milk sugar bread … cereals

1 0 1 1 1

2 1 1 1 0

3 1 1 1 1

4 1 0 1 1

. . . . … .

N 0 1 0 1

RR

A: Milk B: Cereals

8.02.0

2.08.0AP

9.01.0

1.09.0BP

0.415 0.043 0.458

0.183 0.359 0.542

0.598 0.402

A

AA

B B0.368 0.097 0.465

0.218 0.317 0.537

0.586 0.414

B B

A

),,,( 11100100 )ˆ,ˆ,ˆ,ˆ(ˆ

11100100

)ˆ,ˆ,ˆ,ˆ(ˆ)(ˆ 1110010011

BA PP=(0.415,0.043,0.183,0.359)’=(0.427,0.031,0.181,0.362)’

ABsABs

1110

11

ABc 0.662

1110

11

ˆˆ

ˆˆ

ABc 0.671

We can get the estimate, how accurate we can achieve?ABc ABc

=(0.368,0.097,0.218,0.316)’

Data minersData owners

Page 31: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

35

Motivation

31.5

35.936.3

22.1

12.3

23.8

%23min s

min26 ˆˆ sss

min6 ss

min2 ss Frequent set

Not frequent set

Estimated values

Original values

Rule 6 is falsely recognized from estimated value!

Lower& Upper bound

min2 ss l

min6 ss l

Frequent set with high confidence

Frequent set without confidence

Such errors can be avoided!

Both are frequent set

Page 32: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

36

Accuracy on Support S

• Estimate of support

• Variance of support

• Interquantile range (normal dist.)

ˆ)...(ˆˆ 111

1 kPPP

111 )ˆˆˆ()1()ˆv(oc PPn 5

11

10

01

00

10

566.6777.2478.1311.2

777.2667.5244.0134.3

478.1244.0902.2668.1

311.2134.3668.1113.7

ˆ

ˆ

ˆ

ˆ

)ˆv(oc

)ˆar(v 11)ˆ,ˆv(oc 1110

00 01 10 11

)ˆ(ˆˆ,)ˆ(ˆˆ1111 2/2/ kkkk iiaiiiiaii arzarz

0.362

0.346 0.378

Page 33: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

37

Accuracy on Confidence C

• Estimate of confidence A =>B

• Variance of confidence

• Interquantile range (ratio dist. is F(w))

Loose range derived on Chebyshev’s theorem

1110

11

ˆˆ

ˆ

ˆ

ˆˆ

A

AB

s

sc

)ˆ,ˆ(ˆˆ

ˆˆ2)ˆ(ˆ

ˆ

ˆ)ˆ(ˆ

ˆ

ˆ)ˆ(ˆ 10114

1

1110104

1

211

1141

210

ocararcar

)ˆar(v1

ˆ,)ˆar(v1

ˆ cccc

2/1 kwhere

Page 34: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

38

Chebyshev's theorem Let be a random variable with expected value and

variance .Then for any real

X 2 0k

2/1)Pr( kkX

Accuracy on Confidence c

2/11)Pr( kkX

1)1

Pr( X

2

1

k

A lower bound to the proportion of data that are within a certain

number of standard deviations from the mean

Page 35: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

39

Based on the theorem, the loose interquantile

range can be approximated as :

For the previous example with item A, B, the 95% interquantile range of is:

)ˆar(v1

)ˆ(ˆ,)ˆar(v1

)ˆ(ˆ ccEccE

%100)1(

ABc

]904.0,266.0[)ˆar(v05.0

1)ˆ(ˆ,)ˆar(v

05.0

1)ˆ(ˆ

ccEccE )662.0( ABc

Accuracy on Confidence c

Page 36: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

40

Accuracy vs. varying p for individual rule

(a) Support (b) Confidence

G => H (35.9%, 66.2%) from COIL

P P

Page 37: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

41

Accuracy Bounds

• With unknown distribution, Chebyshev theorm only gives loose bounds.

Support bounds vs. varying p for G => H

Page 38: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

42

Other measures

Accuracy Bounds

Page 39: Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

43

Future work

• Conduct accuracy vs. disclosure analysis for general categorical data

• Develop a general randomization framework which combines Additive noise for numerical data Randomized response for categorical data

• Build a prototype system for real world applications Various query, analysis, and mining tasks Complex privacy requirements in different scenarios

non-confidential correlated attributes potential combinations

Dividends + Wages + Interests = Total Income