Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012

Randomization based Privacy Preserving Data Mining

Xintao Wu

University of North Carolina at CharlotteAugust 30, 2012

2

Scope

3

Outline

Part I: Randomization for Numerical Data Additive noise Projection Modeling based

Part II: Randomization for Categorical Data Randomized Response Application to Market Basket Data Analysis

4

Additive Noise Randomization Example

Bal income … IntP

1 10k 85k … 2k

2 15k 70k … 18k

3 50k 120k … 35k

4 45k 23k … 134k

. . . … .

N 80k 110k … 15k

10 85 2

15 70 18

50

45

80

120

23

110

35

134

15

= +

Y = X + E

7.334 3.759 0.099

4.199 7.537 7.939

9.199

6.208

9.048

8.447

7.313

5.692

3.678

1.939

6.318

17.334 88.759 2.099

19.199 77.537 25.939

59.199

51.208

89.048

128.447

30.313

115.692

38.678

135.939

21.318

5

Additive Randomization (Z=X+Y)

50 | 40K | ... 30 | 70K | ... ...

...

Randomizer Randomizer

ReconstructDistribution

of Age

ReconstructDistributionof Salary

ClassificationAlgorithm

Model

65 | 20K | ... 25 | 60K | ... ...30

becomes 65

(30+35)

Alice’s age

Add random number to

Age

• R.Agrawal and R.Srikant SIGMOD 00

6

Reconstruction Problem

• Original values x1, x2, ..., xn

from probability distribution X (unknown)

• To hide these values, we use y1, y2, ..., yn

from probability distribution Y

• Given x1+y1, x2+y2, ..., xn+yn

the probability distribution of Y

Estimate the probability distribution of X.

7

• Converges to maximum likelihood estimate Agrawal and Aggarwal PODS 01

• Extension to muti-variate case Domingo-Ferrer et al. PSD04

Distribution Reconstruction Alg.

• Bootstrapping Algorithm

met) criterion (stopping until

1

)())((

)())((1)(

repeat

0

ondistributi uniform

1

1

0

jj

afayxf

afayxf

naf

j

f

n

ij

XiiY

jXiiYj

X

X

0

200

400

600

800

1000

1200

20 60

Age

Num

ber

of P

eopl

e

OriginalRandomizedReconstructed

9

Individual Value Reconstruction (Additive Noise)

• Methods Spectral Filtering, Kargupta et al. ICDM03 PCA, Huang, Du, and Chen SIGMOD05 SVD, Guo, Wu and Li, PKDD06

• All aim to remove noise by projecting on lower dimensional space.

10

Individual Reconstruction Algorithm

Apply EVD : Using some published information about V, extract the first k

components of as the principal components. λ1≥ λ2··· ≥ λk ≥ λe and e1, e2, · · · ,ek are the corresponding

eigenvectors. Qk = [e1 e2 · · · ek] forms an orthonormal basis of a subspace X.

Find the orthogonal projection on to X : Get estimate data set: PUU pˆ

TUp QQ

Up

TkkQQP

Up = U + VNoisePerturbed Original

12

Why it works

• Original data are correlated • Noise are not correlated

noise

2nd principal vector

1st principal vectororiginal signal perturbed

+ =

2-d estimation1-d estimation

14

Additive Noise vs. Projection

• Additive perturbation is not safe Spectral Filtering Technique

H.Kargupta et al. ICDM03 PCA Based Technique

Huang et al. SIGMOD05 SVD based & Bound Analysis

Guo et al. SAC06,PKDD06

• How about the projection based perturbation? Projection models Vulnerabilities Potential attacks

XX EEYY

NoisePerturbed Original

= +

RR XXYY

Perturbed Transformation Original

=

15

Rotation Randomization Example

0.3333 0.6667 0.6667

-0.6667 0.6667 -0.3333

-0.6667 -0.3333 0.6667

10 15 50 45 80

85 70 120 23 110

2 18 35 134 15

61.33 63.67 110.00 119.67 63.33

49.33 30.67 55.00 -59.33 -31.67

-33.67 -21.33 -30.00 51.67 -51.67

=

Y = R X

Bal income … IntP

1 10k 85k … 2k

2 15k 70k … 18k

3 50k 120k … 35k

4 45k 23k … 134k

. . . … .

N 80k 110k … 15k

RRT = RTR = I

16

Rotation Approach (R is orthonormal)

• When R is an orthonormal matrix (RTR = RRT = I) Vector length: |Rx| = |x| Euclidean distance: |Rxi - Rxj| = |xi - xj|

Inner product : <Rxi ,Rxj> = <xi , xj>

• Many clustering and classification methods are invariant to this rotation perturbation. Classification, Chen and Liu, ICDM 05 Distributed data mining, Liu and Kargupta, TKDE 06

17

Example

866.0500.0

500.0866.0R

RXY

0.2902

0.2902

1.30

86

1.3086

RRT = RTR = I

18

Weakness of Rotation

0.2902

1.30

86

0.2902

1.3086?

866.0500.0

500.0866.0RRegression

•Known sample attackKnown

Info Original data

19

General Linear Transformation

• Y = R X + E When R = I: Y = X + E (Additive Noise Model) When RRT = RTR = I and E = 0: Y = RX (Rotation Model) R can be an arbitrary matrix

4.751 2.429 2.282

1.156 4.457 0.093

3.034 3.811 4.107

10 15 50 45 80

85 70 120 23 110

2 18 35 134 15

265.95 286.63 475.68 581.71 520.53

394.30 338.49 569.58 174.22 277.79

362.55 394.11 665.37 776.46 463.08

=

Y = R X

+

7.334 4.199 9.199 6.208 9.048

3.759 7.537 8.447 7.313 5.692

0.099 7.939 3.678 1.939 6.318

+ E

20

Is Y = R X + E Safe?

• R can be an arbitrary matrix, hence regression based attack wont work

• How about noisy ICA direct attack?

Y = R X + E General Linear Transformation Model

X = A S + N Noisy ICA Model

21

Scope (Part II)

ssn name zip race … age Sex Bal income … IntP

1 28223 Asian … 20 M 10k 85k … 2k

2 28223 Asian … 30 F 15k 70k … 18k

3 28262 Black … 20 M 50k 120k … 35k

4 28261 White … 26 M 45k 23k … 134k

. . . … . . . . … .

N 28223 Asian … 20 M 80k 110k … 15k

69% unique on zip and birth date

87% with zip, birth date and gender.

k-anonymity, L-diversity

SDC etc.

Our approach: Randomized Response

22

Randomized Response ([ Stanley Warner; JASA 1965])

: Cheated in the exam : Didn’t cheat in the exam

Cheated in exam

Didn’t cheat

AAA

A

Randomization device

Do you belong to A? (p)

Do you belong to ?(1-p)A…

)1)(1( pp AA 12

ˆ

12

1ˆ

pp

pAW

1

“Yes” answer

“No” answer

As: Unbiased estimate of is: A

Procedure:

Purpose: Get the proportion( ) of population members that cheated in the exam.

A

…

Purpose

23

Matrix Expression

• RR can be expressed by matrix as: ( 0: No 1:Yes)

=

Unbiased estimate of is:

P

ˆˆ 1P

0

1p1

0

1

p p1

p

25

Vector Response

is the true proportions of the population

is the observed proportions in the survey

is the randomization device set by the interviewer.

),...,( 1 t

))(( jipP

),...,( 1 t

i

j

1 2 3 4

1 0.60 0.20 0.00 0.10

2 0.20 0.50 0.20 0.10

3 0.15 0.15 0.70 0.30

4 0.05 0.15 0.10 0.50

0.10

0.30

0.20

0.40

=

0.16

0.25

0.32

0.27

=P

26

Analysis

)()ˆ( 1 ndisp 111 )()ˆ( PPndisp

P ˆˆ 1P

)( 111 PPn

21

)(11 n

1112 )( PPPPn

the dispersion matrix of the regular survey estimation

nonnegative definite, represents the components of

dispersion associated with RR experiment

diagonal matrix with elements

27

Extension to Multi Attributes

,,...,, 21 mAAA m sensitive attributes: each has categories:

denote the true proportion corresponding to the combination

be vector with elements ,arranged

lexicographically.

e.g., if m =2, t1 =2 and t2=3

Simultaneous Model

Consider all variables as one compounded variable and apply the regular vector response RR technique

Sequential Model

jtjjtj AA ,...,1

mii ,...,1

),,...,(11 mmii AA

mii ,...,1 ),..,1( jj ti

)',,,,,( 232221131211

)...( 21 mPPP stands for Kronecker product

28

Kronecker Product Example

2221

1211

aa

aa

333231

232221

131211

bbb

bbb

bbb

=

222221

212211

PaPa

PaPa=

332232223122332132213121

232222222122232122212121

132212221122132112211121

331232123112331132113111

231222122112231122112111

131212121112131112111111

babababababa

babababababa

babababababa

babababababa

babababababa

babababababa

:1P :2P

1P 2P

29

Analysis

)()ˆ( 111 PPndisp

)( 1 mPP ˆ)(ˆ 111

mPP

Similarly, the dispersion matrix can be decomposed to two parts: one corresponds to that of the regular survey estimation and the other corresponds to the components of dispersion associated with RR experiment

30

Outline(Part II)

Randomization for Categorical Data Randomized Response Model To what extent it affects mining results? To what extent it protects privacy?

Application in market basket 0-1 data analysis Data swapping Frequent itemsets or rule hiding Inverse frequent itemset mining (SDM05, ICDM05) Item randomization (PKDD07,PAKDD08)

31

Market Basket Data

TID milk sugar bread … cereals

1 1 0 1 … 1

2 0 1 1 … 1

3 1 0 0 … 1

4 1 1 1 … 0

. . . . … .

N 0 1 1 … 0

1: presence 0: absence

…

Association rule (R.Agrawal SIGMOD 1993)

with support and confidence

• Other measures in MBA

Correlation, Lift, Interest, etc.

YX )(XYPs

)(

)(

XP

XYPc

32

Item Perturbation


1 1 0 1 1

2 0 1 1 1

3 1 0 0 1

4 1 1 1 0

. . . . … .

N 0 1 1 0


1 0 1 1 1

2 1 1 1 0

3 1 1 1 1

4 0 0 1 1

. . . . … .

N 1 1 0 1

Original Data Randomized Data

Individual privacy is preserved!

33

Research Problems

• How it affects the accuracy of discovered association rules?

• How it affects the accuracy of other measures?

• Two scenarios known distortion probability Unknown distortion probability

34

Motivation example


1 1 0 1 1

2 0 1 1 1

3 1 0 0 1

4 1 1 1 0

. . . . … .

N 0 1 1 0

Original Data Randomized Data


1 0 1 1 1

2 1 1 1 0

3 1 1 1 1

4 1 0 1 1

. . . . … .

N 0 1 0 1

RR

A: Milk B: Cereals

8.02.0

2.08.0AP

9.01.0

1.09.0BP

0.415 0.043 0.458

0.183 0.359 0.542

0.598 0.402

A

AA

B B0.368 0.097 0.465

0.218 0.317 0.537

0.586 0.414

B B

A

),,,( 11100100 )ˆ,ˆ,ˆ,ˆ(ˆ

11100100

)ˆ,ˆ,ˆ,ˆ(ˆ)(ˆ 1110010011

BA PP=(0.415,0.043,0.183,0.359)’=(0.427,0.031,0.181,0.362)’

ABsABs

1110

11

ABc 0.662

1110

11

ˆˆ

ˆˆ

ABc 0.671

We can get the estimate, how accurate we can achieve?ABc ABc

=(0.368,0.097,0.218,0.316)’

Data minersData owners

35

Motivation

31.5

35.936.3

22.1

12.3

23.8

%23min s

min26 ˆˆ sss

min6 ss

min2 ss Frequent set

Not frequent set

Estimated values

Original values

Rule 6 is falsely recognized from estimated value!

Lower& Upper bound

min2 ss l

min6 ss l

Frequent set with high confidence

Frequent set without confidence

Such errors can be avoided!

Both are frequent set

36

Accuracy on Support S

• Estimate of support

• Variance of support

• Interquantile range (normal dist.)

ˆ)...(ˆˆ 111

1 kPPP

111 )ˆˆˆ()1()ˆv(oc PPn 5

11

10

01

00

10

566.6777.2478.1311.2

777.2667.5244.0134.3

478.1244.0902.2668.1

311.2134.3668.1113.7

ˆ

ˆ

ˆ

ˆ

)ˆv(oc

)âr(v 11)ˆ,ˆv(oc 1110

00 01 10 11

)ˆ(ˆˆ,)ˆ(ˆˆ1111 2/2/ kkkk iiaiiiiaii arzarz

0.362

0.346 0.378

37

Accuracy on Confidence C

• Estimate of confidence A =>B

• Variance of confidence

• Interquantile range (ratio dist. is F(w))

Loose range derived on Chebyshev’s theorem

1110

11

ˆˆ

ˆ

ˆ

ˆˆ

A

AB

s

sc

)ˆ,ˆ(ˆˆ

ˆˆ2)ˆ(ˆ

ˆ

ˆ)ˆ(ˆ

ˆ

ˆ)ˆ(ˆ 10114

1

1110104

1

211

1141

210

ocararcar

)âr(v1

ˆ,)âr(v1

ˆ cccc

2/1 kwhere

38

Chebyshev's theorem Let be a random variable with expected value and

variance .Then for any real

X 2 0k

2/1)Pr( kkX

Accuracy on Confidence c

2/11)Pr( kkX

1)1

Pr( X

2

1

k

A lower bound to the proportion of data that are within a certain

number of standard deviations from the mean

39

Based on the theorem, the loose interquantile

range can be approximated as :

For the previous example with item A, B, the 95% interquantile range of is:

)âr(v1

)ˆ(ˆ,)âr(v1

)ˆ(ˆ ccEccE

%100)1(

ABc

]904.0,266.0[)âr(v05.0

1)ˆ(ˆ,)âr(v

05.0

1)ˆ(ˆ

ccEccE )662.0( ABc

Accuracy on Confidence c

40

Accuracy vs. varying p for individual rule

(a) Support (b) Confidence

G => H (35.9%, 66.2%) from COIL

P P

41

Accuracy Bounds

• With unknown distribution, Chebyshev theorm only gives loose bounds.

Support bounds vs. varying p for G => H

42

Other measures

Accuracy Bounds

43

Future work

• Conduct accuracy vs. disclosure analysis for general categorical data

• Develop a general randomization framework which combines Additive noise for numerical data Randomized response for categorical data

• Build a prototype system for real world applications Various query, analysis, and mining tasks Complex privacy requirements in different scenarios

non-confidential correlated attributes potential combinations

Dividends + Wages + Interests = Total Income

Documents

Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012