Privacy-Preserving Data Mining Jaideep Vaidya Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work...

Preview:

Citation preview

Privacy-PreservingData Mining

Jaideep Vaidya (jsvaidya@rbs.rutgers.edu)

Joint work with

Chris Clifton (Purdue University)

Outline

• Introduction– Privacy-Preserving Data Mining– Horizontal / Vertical Partitioning of Data– Secure Multi-party Computation

• Privacy-Preserving Outlier Detection

• Privacy-Preserving Association Rule Mining

• Conclusion

FutureBack in the good ol’ days

SafewayDominick’s

Jewel

Now

A “real” example

• Ford / Firestone– Individual databases– Possible to join both databases (find corresponding

transactions)– Commercial reasons to not share data– Valuable corporate information - Cost structures /

business structures

• Ford Explorers with Firestone tires → Tread Separation Problems (Accidents!)

• Might have been able to figure out a bit earlier (Tires from Decatur, Ill. Plant, certain situations)

Public (mis)Perception of Data Mining: Attack on Privacy

• Fears of loss of privacy constrain data mining– Protests over a National Registry

• In Japan

– Data Mining Moratorium Act• Would stop all data mining R&D by DoD

• Terrorism Information Awareness ended– Data Mining could be key technology

Is Data Mining a Threat?

• Data Mining summarizes data– (Possible?) exception: Anomaly / Outlier

detection

• Summaries aren’t private– Or are they?– Does generating them raise issues?

• Data mining can be a privacy solution– Data mining enables safe use of private data

Privacy-Preserving Data Mining

• How can we mine data if we cannot see it?• Perturbation

– Agrawal & Srikant, Evfimievski et al.– Extremely scalable, approximate results– Debate about security properties

• Cryptographic– Lindell & Pinkas, Vaidya & Clifton– Completely accurate, completely secure (tight bound

on disclosure), appropriate for small number of parties

• Condensation/Hybrid

Assumptions

• Data distributed– Each data set held by source authorized to

see it– Nobody is allowed to see aggregate data

• Knowing all data about an individual violates privacy

• Data holders don’t want to disclose data– Won’t collude to violate privacy

Gold Standard:Trusted Third Party

Horizontal Partitioning of Data

CC# Active? Delinquent? Amount

Bank of America

Chase Manhattan

123 Yes Yes <$300

324 No No $300-500

919 Yes No >$1000

3450 Yes Yes <$300

4127 No No $300-500

8772 Yes No >$1000

Medical Records

RPJ Yes Diabetic

CAC No Tumor No

PTR No Tumor Diabetic

Cell Phone Data

RPJ 5210 Li/Ion

CAC none none

PTR 3650 NiCd

Global Database ViewTID Brain Tumor? Diabetes? Model Battery

Vertical Partitioning of Data

Secure Multi-Party Computation (SMC)

• Given a function f and n inputs, distributed at n sites, compute

the result

while revealing nothing to any site except its own input(s) and the result.

xxx n,...,,

21

nxxxfy ,,, 21

Secure Multi-Party ComputationIt can be done!

• Yao’s Millionaire’s problem (Yao ’86)– Secure computation possible if function can be

represented as a circuit– Idea: Securely compute gate

• Continue to evaluate circuit

• Extended to multiple parties (BGW/GMW ’87)• Biggest Problem - Efficiency

– Will not work for lots of parties / large quantities of data

SMC – Models of Computation

• Semi-honest Model– Parties follow the protocol faithfully

• Malicious Model– Anything goes!

• Provably Secure

• In either case, input can always be modified

What is an Outlier?

• An object O in a dataset T is a DB(p,dt)-outlier if at least fraction p of the objects in T lie at distance greater than dt from O

• Centralized solution from Knorr and Ng– Nested loop comparison– Maintain count of objects inside threshold– If count exceeds threshold, declare non-outlier and move to next

• Clever processing order minimizes I/O cost

12

1

Privacy-Preserving Solution

• Key idea: share splitting– Computations leave results (randomly) split between

parties– Only outcome is if the count of points within distance

threshold exceeds outlier threshold

• Requires pairwise comparison of all points– But failure to compare all points reveals information

about non-outliers• This alone makes it possible to cluster points• This is a privacy violation

– Asymptotically equivalent to Knorr & Ng

Solution: Horizontal Partition

• Compare locally with your own points• For remote points, get random share of distance

– Calculate random share of “exceeds threshold or doesn’t”

• Sum shares and test if enough “close” points

1

1.5 -0.90.3 0.92.5 -0.71.5 3.2

323

-121

-31-312-1

24 -23

Random share of distance

• x2, y2 local; sum of xy is scalar product– Several protocols for share-splitting scalar product

(Du&Atallah’01; Vaidya&Clifton’02; Ioannidis, Grama, Atallah’02)

2 2

1

( , ) ( )m

r rr

Distance X Y x y

2 21 1 1 12x x y y

2 2

1 1 1

2m m m

r r r rr r r

x y x y

Shares of “Within Threshold”

• Goal: is x + y ≤ dt ?

• Essentially Yao’s Millionaires’ problem (Yao’86)– Represent function to be computed as circuit– Cryptographic protocol gives random shares

of each wire

• Solves “sum of shares from within dt exceeds minimum” as well

Vertically Partitioned Data

• Each party computes its part of distance–

• Secure comparison (circuit evaluation) gives each party shares of 1/0 (close/not)

• Sum and compare as with horizontal partitioning

2 2

1

( , ) ( )m

r rr

Distance X Y x y

2 2

1 1

( ) ( )a m

r r r rr r a

x y x y

Why is this Secure?

• Random shares indistinguishable from random values– Contain no knowledge in isolation– Assuming no collusion – so shares viewed in isolation

• Number of values (= number of shares) known– Nothing new revealed

• Too few close points is outlier definition– This is the desired result

• No knowledge that can’t be discovered from one’s own input and the result!

Conclusion (Outlier Detection)

• Outlier detection feasible without revealing anything but the outliers– Possibly expensive (quadratic)– But more efficient solution for this definition of outlier

inherently reveals potential privacy-violating information

• Key: Privacy of non-outliers preserved– Reason why outliers are outliers also hidden

• Allows search for “unusual” entities without disclosing private information about entities

• Association rules a common data mining task– Find A, B, C such that AB C holds frequently (e.g.

Diapers Beer)

• Fast algorithms for centralized and distributed computation– Basic idea: For AB C to be frequent, AB, AC, and

BC must all be frequent– Require sharing data

• Secure Multiparty Computation too expensive

Association Rules

Association Rule Mining

• Find out if itemset {A1, B1} is frequent (i.e. If support of {A1, B1} ≥ k)

A B

• Support of itemset is defined as number of transactions in which all attributes of the itemset are present

• For binary data, support =|Ai Λ Bi|.

Key A1

k1 1

k2 0

k3 0

k4 1

k5 1

Key B1

k1 0

k2 1

k3 0

k4 1

k5 1

• Idea based on TID-list representation of data– Represent attribute A as TID-list Atid

– Support of ABC is | Atid ∩ Btid ∩ Ctid |

• Use a secure protocol to find size of set intersection to find candidate sets

Association Rule Mining

Cardinality of Set Intersection

• Use a secure commutative hash function

• Pohlig-Hellman Encryption

• Each party generates own encryption key

• All parties encrypt all the input sets

sets allin objectscommon #Result

21

XEEEXEEE jilk

Cardinality of Set Intersection

• Hashing– All parties hash all sets with their key

• Initial intersection– Each party finds intersection of all sets

(except its own)

• Final intersection– Parties exchange the final intersection set,

and compute the intersection of all sets

X∩Y∩Z:λ,βX∩Y∩Z:λ,β

E1(E2(E3(Z)))E1(X)E1(E2(Y))

E3(E1(X))E3(E1(E2(Y)))E2(E3(Z))E2(E3(E1(X)))

Computing Size of Intersection

2Y

1X

3Z

E3(Z)E2(Y)

Z:α,β,κ,λ,γ

Y:λ,σ,φ,υ,βX:α,λ,σ,β

Z:α,β,κ,λ,γ

X:α,λ,σ,β Y:λ,σ,φ,υ,β

Y∩Z:λ,β

X∩Y:λ,σ,βX∩Z:α,β,λ

X∩Y∩Z:λ,β

Proof of Security

• Proof by Simulation• What is known

– The size of the intersection set

– Site i learns

• How it can be simulated– Protocol is symmetric, simulating view of one

party is sufficient

1

0

k

pp

S S

*, 0, , 1 , , 2p

p C

S C k i C C

Proof of Security

• Hashing– Party i receives encrypted set from party i-1– Can use random numbers to simulate this

• Intersection– Party i receives fully hashed sets of all parties

ABC

AB BCAC

A B C

2

4-2=2

3-2=1

8-2-2-0=4

7-2-1-0=4

6-2-1-2=1

2-2=0

|ABC| = 2, |AB| = 3, |AC| = 4, |BC| = 2, |A| = 6, |B| = 7, |C| = 8

Simulating Fully Encrypted Sets

R1

R2

R4

R5

R11

R12

R13

R14

R1

R2

R3

R7

R8

R9

R10

R1

R2

R3

R4

R5

R6

A CB

Optimized version

Association Rule Mining (Revisited)

• Naïve algorithm => Simply use APRIORI. A single set intersection determines the frequency of a single candidate itemset– Thousands of itemsets

• Key intuition– Set Intersection algorithm developed also allows

computation of intermediate sets– All parties get fully encrypted sets for all attributes– Local computation allows efficient discovery of all

association rules

Communication Cost

• k parties, m set size, p frequent attributes– k*(2k-2) = O(k2) messages– p*(2p-2)*m*encrypted message size = O(p2m)

bits– k rounds

• Independent of number of itemsets found

Other Results

• ID3 Decision Tree learning– Horizontal Partitioning: Lindell&Pinkas ’00– Also vertical partitioning (Du, Vaidya)

• Association Rules– Horizontal Partitioning: Kantarcıoğlu

• K-Means / EM Clustering• K-Nearest Neighbor• Naïve Bayes, Bayes network structure• And many more

Challenges

• What do the results reveal?

• A general approach (instead of per data mining technique)

• Experimental results

• Incentive Compatibility

• Note: Upcoming book in the Advances in Information Security series by Springer-Verlag

Questions

Recommended