Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining?...

Preview:

Citation preview

Privacy PreservingData Mining

Moheeb Rajab

Agenda

Overview and Terminology Motivation Active Research Areas

Secure Multi-party Computation (SMC) Randomization approach

Limitations Summary and Insights

Overview

What is Data Mining? Extracting implicit un-obvious patterns and relationships from a

warehoused of data sets.

This information can be useful to increase the efficiency of theorganization and aids future plans.

Can be done at an organizational level. By Establishing a data Warehouse

Can be done also at a global Scale.

Data Mining System Architecture

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

Entity I

Entity II

Entity n

GlobalAggregato

r

Distributed Data Mining Architecture

Lower scale Mining

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

Challenges

Privacy Concerns Proprietary information disclosure Concerns about Association breaches Misuse of mining These Concerns provide the motivation

for privacy preserving data miningsolutions

Approaches to preserve privacy

Restrict Access to data (Protect Individualrecords)

Protect both the data and its source: Secure Multi-party computation (SMC) Input Data Randomization

There is no such one solution that fits allpurposes

SMC vs Randomization

Pinkas et al

Overhead

Accuracy

Privacy Randomization Schemes

SM

C

Secure Multi-party Computation

Multiple parties sharing the burden of creating the dataaggregate.

Final processing if needed can be delegated to anyparty.

Computation is considered secure if each party onlyknows its input and the result of its computation.

SMC

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

Each Party Knows its input and the result of the operation and nothing else

Key Assumptions

The ONLY information that can be leaked is theinformation that we can get as an overall output from thecomputation (aggregation) process

Users are not Malicious but can honestly curious All users are supposed to abide to the SMC protocol

Otherwise, for the case of having malicious participantsis not easy to model! [Penkas et al, Argawal]

“Tools for Privacy Preserving Distributed DataMining” Clifton et al [SIGKDD]

Secure Sum Given a number of values belonging

to n entities

We need to compute

Such that each entity ONLY knows its input and theresult of the computation (The aggregate sum of thedata)

nxxx ,........,,

21

!=

n

i

ix

1

Examples (Secure Sum)

Problem: Colluding members

Solution Divide values into shares and have each share permute a

disjoint path (no site has the same neighbor twice)

R = 15 10 20

15

45

50

MasterR + 30

R + 10

R +45

R +90

R + 1

40

Sum = R+140 -R

Split path solution

R1 = 15 10 20

15

45

50

R + 15

R + 5

R + 22.5 R +45

R + 7

0

Sum = R1+ 70 – R1 + R2+ 70 –R2= 140

R2 = 12

Secure Set Union

Consider n setsCompute,

Such that each entity ONLY knows U andnothing else.

nSSS ,........,,

21

UUU nSSSSU ,........,

321=

Secure Union Set

Using the properties of CommutativeEncryption

For any permutation i, j the following holds

)...)((...)...)((...11

MEEMEEnjjnii KKKK =

!<== )...))((...)...)((...( 2111

MEEMEEPnjjnii KKKK

Secure Set Union

Global Union Set U. Each site:

Encrypts its items Creates an array M[n] and adds it to U

Upon receiving U an entity should encrypt all items in U that it did notencrypt before.

In the end: all entries are encrypted with all keys Remove the duplicates:

Identical plain text will result the same cipher text regardless of the order of theuse of encryption keys.

Decryption U: Done by all entities in any order.

nKKK ,.....,,

21

Secure Union Set

0 1 0 … . . .. . … )...),((...1

MEEnii

KK

1 2 3 ….. . .. .. . . . n

A

A

C1

3

2 U= {E1(A)}

Problem:Computation Overhead, number of

exchanged messages O(n*m)

U= {E2(E1(A)),E2(C)}

U= {E3(E2(E1(A))),E3(E2(C)), E3(A)}

U= {E3(E2(E1(A))),E1(E3(E2(C))), E1(E3(A))}

U= {E3(E2(E1(A))),E1(E3(E2(C))), E2(E1(E3(A)))}

Problems with SMC

Scalability High Overhead Details of the trust model assumptions

Users are honest and follow the protocol

Randomization Approach

“Privacy Preserving Data Mining”, Argawal et. al [SIKDD]

Applied generally to provide estimates for data distributions ratherthan single point estimates

A user is allowed to alter the value provided to the aggregator

The alteration scheme should known to the aggregator

The aggregator Estimates the overall global distribution of input byremoving the randomization from the aggregate data

Randomization Approach (ctnd.)

Assumptions: Users are willing to divulge some form of their data The aggregator is not malicious but may honestly

curious (they follow the protocol)

Two main data perturbation schemes Value- class membership (Discretization) Value distortion

Randomization Methods

Value Distortion Method

Given a value the client is allowed to report a distorted valuewhere is a random variable drawn from a known

distribution

Uniform Distribution:

Gaussian Distribution:

ix

)( rxi + r

[ ]!!µ +"= ,,0

!µ ,0=

Quantifying the privacy of differentrandomization Schemes

6.8 x σ3.92 x σ1.34 x σGaussian

0.999 x 2α0.95 x 2α0.5 x 2αUniform

0.999 x W0.95 x W0.5 x WDiscretization

Distribution99.9 %95 %50 %Confidence (α)

W

− α + α

Gaussian Distribution provides the best accuracy at higher confidence levels

Problem Statement

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

EntityI

Entity II

Entity m

GlobalAggregator

kk yxyxyx1112121111

,.....,, +++

kk yxyxyx2222222121

,.....,, +++

mkmkmmmm yxyxyx +++ ,.....,,2211

nnyxyxyx +++ ,.....,,

2211 )(afY

Estimator

)(zf X

Reconstruction of the Original Distribution

Reconstruction problem can be viewed in in the generalframework of the “Inverse Problems”

Inverse Problems: describing system internal structurefrom indirect noisy data.

Bayesian Estimation is an Effective tools for suchsettings

Formal problem statement

Given one dimensional array of randomized data

Where ’s are iid random variables each with the same distributionas the random variable X

And ’s are realizations of a globally known random distributionwith CDF

Purpose: Estimate

nnyxyxyx +++ ,.....,,

2211

ix

iy

YF

XF

Background: Bayesian Inference

An Estimation method that involves collecting observational dataand use it a tool to adjust (either support of refute) a prior belief.

The previous knowledge (hypothesis) has an established probabilitycalled (prior probability)

The adjusted hypothesis given the new observational data is called(posterior probability)

Bayesian Inference Let the prior probability, then Bayes’ rule states

that the posterior probability of given anobservation (D) is given by:

Bayes rule is a cyclic application of the general form ofthe joint probability theorem:

)(

)()|()|( 00

0DP

HPHDPDHP =

)()|(),( 00 DPDHPHDP =

)( 0HP

)( 0H

Bayesian Inference ( Classical Example)

Two Boxes: Box-I : 30 Red balls and 10 White Balls Box-II: 20 Red balls and 20 White Balls

A Person draws a Red Ball, what is the probability thatthe Ball is from Box-I

Prior Probability P(Box-I) = 0.5

From the data we know that: P(Red|Box-I) = 30/40 = 0.75 P(Red|Box-II) = 20/40 = 0.5

Example (cntd.)

Now, given the new observation (The Red Ball)we want to know the posterior probability of Box-I(i.e P(Box-I | Red) )

)(

)()|()|(

REDP

IBoxPIBoxREDPREDIBoxP

!!=!

),(),()( IIBoxREDPIBoxREDPREDP !+!=

)()|()()|()( IIBoxPIIBoxREDPIBoxPIBoxREDPREDP !!+!!=

5.05.075.05.0)( !+!=REDP

Example (cntd)

Computing the joint probability:

Substituting,

The posterior probability of Box-I is amplifies by the observation ofthe Red Ball

)()|()()|()( IIBoxPIIBoxREDPIBoxPIBoxREDPREDP !!+!!=

5.05.075.05.0)( !+!=REDP

6.05.05.075.05.0

5.075.0)|( =

!+!

!=" REDIBoxP

Back: Formal problem statement

Given one dimensional array of randomized data

Where ‘s are iid random variables each with the same distributionas the random variable X

And ’s are realizations of a globally known random distributionwith CDF

Purpose: Estimate

nnyxyxyx +++ ,.....,,

2211

ix

iy

YF

XF

Continuous probability distributions

)()()(}{ zFzCDFdkkfzrP X

z

X ===! " #$0}{ == zrP

1)( =!+"

"#dkkfX

CDF and PDF

Estimation of

Bayes Rule:

Posterior Probability

Applying Bayes rule

)(

)()|()|( 00

0DP

HPHDPDHP =

!=

"#==+$

az

zXX dzwYXzfaF )|()( 111

!"# +

+ ==

a

YX

XYX

Xwf

dzzfzXwfaF

)(

)()|()(

1

11

11

111

XF

Estimation of

We want to evaluate

Substituting:

!"

"#

++ == dkkfkXwfwf XYXYX )()|()(11111 111

!!

"#

"

"#

+

+

=

==

a

XYX

XyX

X

dkkfkXwf

dzzfzXwfaF

)()|(

)()|()(

111

111

11

11

XF

)( 111wf YX +

Estimation of

Simplification (independence):

XF

!

!"

"#

"#

#

#

=

dkkfzwf

dzzfzwf

aF

XiY

a

XiY

X

)()(

)()(

)(

Estimation of

For all n observations:

!"

"

=#

#$

#$

$

$

=n

i

XiY

a

XiY

X

dkkfzwf

dzzfzwf

naF

1)()(

)()(1

)(

XF

Estimation of the PDF

Is just the derivative of the CDF

Xf

Xf

!"

=#

#$

$

$=

n

i

XiY

XiYX

dkkfzwf

dzzfzwf

naf

1)()(

)()(1)(

Algorithm

Uniform Distribution

While ( not Stopping Condition):

=:0

Xf

0:=j

!"

=#+

#$

+

$

$=

n

i j

XiY

j

XiYj

X

dzzfzwf

afawf

naf

1

1

)()(

)()(1:)(

1: += jj

Stopping Criteria

The algorithm should terminate if:

For each round a goodness of fit test is performed.

Iteration is stopped when the difference between thetwo estimates is too small (lower that a certainthreshold)

)()(1 afaf j

X

j

X !+

2!

Evaluation

Gaussian Randomizing Function µ=0, σ = 0.25

Evaluation

Uniform Randomizing Function [-0.5,0.5]

How is this different from KalmanEstimator?

Both are estimation techniques

Kalman is stateless

In Kalman filter case we knew the distribution andestimation is used to validate whether the trend of thedata matches that distribution

In Bayesian Inference the observation data is used toadjust the prior hypothesis (probability distribution)

Is the Problem Solved?

Suppose a client randomizes Age records using auniform random variable [-50,50]

If the aggregator receives value 120, with 100%confidence it knows that actual age is

Simply randomization does not guarantee absoluteprivacy

70!

How to achieve better randomizationscheme

“Limiting Privacy Breaches in PrivacyPreserving Data Mining” Evfimievski et al

Define an evaluation metric of how privacypreserving a scheme is.

Based on the developed metric, develop arandomization scheme that abides to this metric

How Privacy preserving is a scheme?

Information Theoretic Approach:

Computes the average information disclose in a randomizedattribute by computing the mutual information between theactual and the randomized attribute

Privacy breach Defines a criteria that should be satisfied for a randomization

scheme to be privacy preserving

What is a privacy breach?

A privacy breach occurs when thedisclosure of a randomized value to theaggregator reveals that a certain property

of the “individual” input holdswith high probability

iy

ix)(xQ

Privacy Breach

Back to Bayes’

Prior Probability where isthe property

Posterior probability:

))(( xQP )(xQ

)|)(( iyxQP

Amplification Is defined in terms of the transitive probability

where y is a fixed randomized outputvalue

Intuitive definition: if there are many ’s that can be mapped to y by

the randomizing scheme then disclosing y have giveslittle information about

We say we amplify the probability that

][ yxP !

ix

][ yxP !

ix

Amplification factor

Let, R a randomization operator

a randomized value of x.Revealing R will not cause privacy breach if :

!>"

"

)1(

)1(

2

1

1

2

p

p

p

p

yVy!

Summary

No one solution can fit all. Which area looks more promising? Can we create robust randomization

schemes to a wide scale of applicationsand different distributions of data?

How to deal with the case of Maliciousparticipants?

Recommended