Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining?...

Privacy PreservingData Mining

Moheeb Rajab

Agenda

Overview and Terminology Motivation Active Research Areas

Secure Multi-party Computation (SMC) Randomization approach

Limitations Summary and Insights

Overview

What is Data Mining? Extracting implicit un-obvious patterns and relationships from a

warehoused of data sets.

This information can be useful to increase the efficiency of theorganization and aids future plans.

Can be done at an organizational level. By Establishing a data Warehouse

Can be done also at a global Scale.

Data Mining System Architecture

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

Entity I

Entity II

Entity n

GlobalAggregato

Distributed Data Mining Architecture

Lower scale Mining

Challenges

Privacy Concerns Proprietary information disclosure Concerns about Association breaches Misuse of mining These Concerns provide the motivation

for privacy preserving data miningsolutions

Approaches to preserve privacy

Restrict Access to data (Protect Individualrecords)

Protect both the data and its source: Secure Multi-party computation (SMC) Input Data Randomization

There is no such one solution that fits allpurposes

SMC vs Randomization

Pinkas et al

Overhead

Accuracy

Privacy Randomization Schemes

Secure Multi-party Computation

Multiple parties sharing the burden of creating the dataaggregate.

Final processing if needed can be delegated to anyparty.

Computation is considered secure if each party onlyknows its input and the result of its computation.

Each Party Knows its input and the result of the operation and nothing else

Key Assumptions

The ONLY information that can be leaked is theinformation that we can get as an overall output from thecomputation (aggregation) process

Users are not Malicious but can honestly curious All users are supposed to abide to the SMC protocol

Otherwise, for the case of having malicious participantsis not easy to model! [Penkas et al, Argawal]

“Tools for Privacy Preserving Distributed DataMining” Clifton et al [SIGKDD]

Secure Sum Given a number of values belonging

to n entities

We need to compute

Such that each entity ONLY knows its input and theresult of the computation (The aggregate sum of thedata)

nxxx ,........,,

Examples (Secure Sum)

Problem: Colluding members

Solution Divide values into shares and have each share permute a

disjoint path (no site has the same neighbor twice)

R = 15 10 20

MasterR + 30

R + 10

Sum = R+140 -R

Split path solution

R1 = 15 10 20

R + 15

R + 22.5 R +45

Sum = R1+ 70 – R1 + R2+ 70 –R2= 140

R2 = 12

Secure Set Union

Consider n setsCompute,

Such that each entity ONLY knows U andnothing else.

nSSS ,........,,

UUU nSSSSU ,........,

Secure Union Set

Using the properties of CommutativeEncryption

For any permutation i, j the following holds

)...)((...)...)((...11

MEEMEEnjjnii KKKK =

!<== )...))((...)...)((...( 2111

MEEMEEPnjjnii KKKK

Secure Set Union

Global Union Set U. Each site:

Encrypts its items Creates an array M[n] and adds it to U

Upon receiving U an entity should encrypt all items in U that it did notencrypt before.

In the end: all entries are encrypted with all keys Remove the duplicates:

Identical plain text will result the same cipher text regardless of the order of theuse of encryption keys.

Decryption U: Done by all entities in any order.

nKKK ,.....,,

Secure Union Set

0 1 0 … . . .. . … )...),((...1

MEEnii

1 2 3 ….. . .. .. . . . n

2 U= {E1(A)}

Problem:Computation Overhead, number of

exchanged messages O(n*m)

U= {E2(E1(A)),E2(C)}

U= {E3(E2(E1(A))),E3(E2(C)), E3(A)}

U= {E3(E2(E1(A))),E1(E3(E2(C))), E1(E3(A))}

U= {E3(E2(E1(A))),E1(E3(E2(C))), E2(E1(E3(A)))}

Problems with SMC

Scalability High Overhead Details of the trust model assumptions

Users are honest and follow the protocol

Randomization Approach

“Privacy Preserving Data Mining”, Argawal et. al [SIKDD]

Applied generally to provide estimates for data distributions ratherthan single point estimates

A user is allowed to alter the value provided to the aggregator

The alteration scheme should known to the aggregator

The aggregator Estimates the overall global distribution of input byremoving the randomization from the aggregate data

Randomization Approach (ctnd.)

Assumptions: Users are willing to divulge some form of their data The aggregator is not malicious but may honestly

curious (they follow the protocol)

Two main data perturbation schemes Value- class membership (Discretization) Value distortion

Randomization Methods

Value Distortion Method

Given a value the client is allowed to report a distorted valuewhere is a random variable drawn from a known

distribution

Uniform Distribution:

Gaussian Distribution:

)( rxi + r

[ ]!!µ +"= ,,0

!µ ,0=

Quantifying the privacy of differentrandomization Schemes

6.8 x σ3.92 x σ1.34 x σGaussian

0.999 x 2α0.95 x 2α0.5 x 2αUniform

0.999 x W0.95 x W0.5 x WDiscretization

Distribution99.9 %95 %50 %Confidence (α)

− α + α

Gaussian Distribution provides the best accuracy at higher confidence levels

Problem Statement

EntityI

Entity II

Entity m

GlobalAggregator

kk yxyxyx1112121111

,.....,, +++

kk yxyxyx2222222121

,.....,, +++

mkmkmmmm yxyxyx +++ ,.....,,2211

nnyxyxyx +++ ,.....,,

2211 )(afY

Estimator

)(zf X

Reconstruction of the Original Distribution

Reconstruction problem can be viewed in in the generalframework of the “Inverse Problems”

Inverse Problems: describing system internal structurefrom indirect noisy data.

Bayesian Estimation is an Effective tools for suchsettings

Formal problem statement

Given one dimensional array of randomized data

Where ’s are iid random variables each with the same distributionas the random variable X

And ’s are realizations of a globally known random distributionwith CDF

Purpose: Estimate

nnyxyxyx +++ ,.....,,

Background: Bayesian Inference

An Estimation method that involves collecting observational dataand use it a tool to adjust (either support of refute) a prior belief.

The previous knowledge (hypothesis) has an established probabilitycalled (prior probability)

The adjusted hypothesis given the new observational data is called(posterior probability)

Bayesian Inference Let the prior probability, then Bayes’ rule states

that the posterior probability of given anobservation (D) is given by:

Bayes rule is a cyclic application of the general form ofthe joint probability theorem:

)()|()|( 00

HPHDPDHP =

)()|(),( 00 DPDHPHDP =

)( 0HP

Bayesian Inference ( Classical Example)

Two Boxes: Box-I : 30 Red balls and 10 White Balls Box-II: 20 Red balls and 20 White Balls

A Person draws a Red Ball, what is the probability thatthe Ball is from Box-I

Prior Probability P(Box-I) = 0.5

From the data we know that: P(Red|Box-I) = 30/40 = 0.75 P(Red|Box-II) = 20/40 = 0.5

Example (cntd.)

Now, given the new observation (The Red Ball)we want to know the posterior probability of Box-I(i.e P(Box-I | Red) )

)()|()|(

IBoxPIBoxREDPREDIBoxP

),(),()( IIBoxREDPIBoxREDPREDP !+!=

)()|()()|()( IIBoxPIIBoxREDPIBoxPIBoxREDPREDP !!+!!=

5.05.075.05.0)( !+!=REDP

Example (cntd)

Computing the joint probability:

Substituting,

The posterior probability of Box-I is amplifies by the observation ofthe Red Ball

)()|()()|()( IIBoxPIIBoxREDPIBoxPIBoxREDPREDP !!+!!=

5.05.075.05.0)( !+!=REDP

6.05.05.075.05.0

5.075.0)|( =

!=" REDIBoxP

Back: Formal problem statement

Given one dimensional array of randomized data

Where ‘s are iid random variables each with the same distributionas the random variable X

And ’s are realizations of a globally known random distributionwith CDF

Purpose: Estimate

nnyxyxyx +++ ,.....,,

Continuous probability distributions

)()()(}{ zFzCDFdkkfzrP X

X ===! " #$0}{ == zrP

1)( =!+"

"#dkkfX

CDF and PDF

Estimation of

Bayes Rule:

Posterior Probability

Applying Bayes rule

)()|()|( 00

HPHDPDHP =

"#==+$

zXX dzwYXzfaF )|()( 111

dzzfzXwfaF

)()|()(

Estimation of

We want to evaluate

Substituting:

++ == dkkfkXwfwf XYXYX )()|()(11111 111

dkkfkXwf

dzzfzXwfaF

)()|()(

)( 111wf YX +

Estimation of

Simplification (independence):

dkkfzwf

dzzfzwf

Estimation of

For all n observations:

dkkfzwf

dzzfzwf

Estimation of the PDF

Is just the derivative of the CDF

dkkfzwf

dzzfzwf

)()(1)(

Algorithm

Uniform Distribution

While ( not Stopping Condition):

dzzfzwf

)()(1:)(

1: += jj

Stopping Criteria

The algorithm should terminate if:

For each round a goodness of fit test is performed.

Iteration is stopped when the difference between thetwo estimates is too small (lower that a certainthreshold)

)()(1 afaf j

Evaluation

Gaussian Randomizing Function µ=0, σ = 0.25

Evaluation

Uniform Randomizing Function [-0.5,0.5]

How is this different from KalmanEstimator?

Both are estimation techniques

Kalman is stateless

In Kalman filter case we knew the distribution andestimation is used to validate whether the trend of thedata matches that distribution

In Bayesian Inference the observation data is used toadjust the prior hypothesis (probability distribution)

Is the Problem Solved?

Suppose a client randomizes Age records using auniform random variable [-50,50]

If the aggregator receives value 120, with 100%confidence it knows that actual age is

Simply randomization does not guarantee absoluteprivacy

How to achieve better randomizationscheme

“Limiting Privacy Breaches in PrivacyPreserving Data Mining” Evfimievski et al

Define an evaluation metric of how privacypreserving a scheme is.

Based on the developed metric, develop arandomization scheme that abides to this metric

How Privacy preserving is a scheme?

Information Theoretic Approach:

Computes the average information disclose in a randomizedattribute by computing the mutual information between theactual and the randomized attribute

Privacy breach Defines a criteria that should be satisfied for a randomization

scheme to be privacy preserving

What is a privacy breach?

A privacy breach occurs when thedisclosure of a randomized value to theaggregator reveals that a certain property

of the “individual” input holdswith high probability

ix)(xQ

Privacy Breach

Back to Bayes’

Prior Probability where isthe property

Posterior probability:

))(( xQP )(xQ

)|)(( iyxQP

Amplification Is defined in terms of the transitive probability

where y is a fixed randomized outputvalue

Intuitive definition: if there are many ’s that can be mapped to y by

the randomizing scheme then disclosing y have giveslittle information about

We say we amplify the probability that

][ yxP !

Amplification factor

Let, R a randomization operator

a randomized value of x.Revealing R will not cause privacy breach if :

Summary

No one solution can fit all. Which area looks more promising? Can we create robust randomization

schemes to a wide scale of applicationsand different distributions of data?

How to deal with the case of Maliciousparticipants?

Privacy Preserving Data Miningfabian/courses/CS600.624/slides/... · Overview What is Data Mining?...

Documents

A NOTE TO OUR READERS - Welcome to T Customs · Web viewAt any time during that period, warehoused goods may be re-exported without paying duty, or they may be withdrawn for consumption

Customs and Excise Act - policy.asiapacificenergy.org and Excise Act.pdf · Goods to be warehoused in packages in which imported, and may be required to be marked. 99. Penalty for

Buku Profil Data Kesehatan Indonesia Data data data data data 2011

REVISED STATUTES OF ANGUILLA CHAPTER C169gov.ai/customs/docs/C169-Customs-Act.pdfRemoval of goods from warehouse 60. Duty liability on warehoused goods 61. Removal of goods from warehouse

Warehouse Release Notices · 2018. 10. 30. · Searching for a Warehouse Release Notice ... Prescribed warehoused goods are currently restricted to cigarette and tobacco products

Customs Management Act · 2016. 4. 23. · CUSTOMS MANAGEMENT [CH.293 – 3LRO 1/2010 STATUTE LAW OF THE BAHAMAS 37. Warehoused goods may be delivered for exportation or use as stores

Installation Guide: Stainless Outdoor Kitchen Cabinets · 2019-06-24 · Cabinets should be warehoused in a cool, dry location or remove the vinyl protective coating for extended

K eyboar ds, CR Ts, LCDs and Noisy Computerscs.jhu.edu/~fabian/courses/CS600.624/slides/CRT-Emanations.pdf · ¥ ÒTher e is a lot of super vision in this unsuper vised ... ÒP erhaps

Sell Your Specialty Food€¦ · ness. You can take your talent, your recipes, your promotional genius, and your money and have your product produced, packaged, warehoused, and marketed

“Keyboard Acoustic Emanations Revisited” - cs.jhu.educs.jhu.edu/~fabian/courses/CS600.624/slides/emanations.pdfThreat Analysis Attacker must use labeled training data ... No mention

A NOTE TO OUR READERS - Apparel Sourcing …fadsr.com/doc/customs/importerguide.doc · Web viewAt any time during that period, warehoused goods may be re-exported without paying duty,

Real Estate Data Visualization-Print-v-1-82 · PDF fileIt’s About KPI's & Decision Support Understanding Real Estate Data Visualization | Page 2 of 4 | In such a model, data is warehoused

71L Import Declaration (N30) – warehoused goods · IMPORT DECLARATION (N30) 1 71L Import Declaration (N30) – warehoused goods 1 Name ... 51 Dumping Export Country Code If the

Client Puzzles - Department of Computer Sciencecs.jhu.edu/~fabian/courses/CS600.624/slides/Puzzle...Brute force searches puzzle solution, until puzzle difficulty is either greater

PROTECTIVE SERVICES FOR THE ELDERLY A WORKING PAPER … · sults for those warehoused in nursing homes, foster homes, men-tal institutions, or other custodial arrangements.-Second,

Secret Sharing and Visual Cryptographyfabian/courses/CS600.624/slides/VisualCrypto.pdf · Secret Sharing Threshold Secret Sharing (Shamir, Blakely 1979) Motivation – increase confidentiality

Oil Spill Contingency Planning - GI WACAF...options reflecting NEBA/SIMA Tiered preparedness and response What is the aim of TPR? Pre-positioned, and warehoused, oil spill combatting

Network Forensics and Next Generation Internet Attacksfabian/courses/CS600.624/slides/discussion.pdfBotnet Herders can hide behind VoIP. InfoWeek, 2/27/06 ... Slide by: Ed Knightly

WAREHOUSED IN EUROPE 2013 - Y-Proximité · Libbey feels very strongly about quality, as witnessed by its outstanding product guarantees. Also its design is distinctive, timeless

Scanned by CamScanner · bonded warehouse in exported, the licensee shall have to file a shipping bill and follow the procedure prescribed under the Warehoused Goods (Removal) Regulations