51
page 1 March 3, 2005 10th Estonian Winter School in Computer Science Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy (Based on Slides of Kobbi Nissim) Benny Pinkas HP Labs, Israel

Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

  • Upload
    belita

  • View
    38

  • Download
    1

Embed Size (px)

DESCRIPTION

Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy (Based on Slides of Kobbi Nissim). Benny Pinkas HP Labs, Israel. d. Why not use cryptographic methods?. Many users contribute data. Cannot require them to participate in a cryptographic protocol. - PowerPoint PPT Presentation

Citation preview

Page 1: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 1March 3, 2005 10th Estonian Winter School in Computer Science

Privacy Preserving Data Mining

Lecture 3

Non-Cryptographic Approaches for Preserving Privacy

(Based on Slides of Kobbi Nissim)

Benny PinkasHP Labs, Israel

Page 2: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 2March 3, 2005 10th Estonian Winter School in Computer Science

Why not use cryptographic methods?

• Many users contribute data. Cannot require them to participate in a cryptographic protocol.– In particular, cannot require p2p communication between

users.• Cryptographic protocols incur considerable overhead.

d

Page 3: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 3March 3, 2005 10th Estonian Winter School in Computer Science

Data Privacy

DataData

usersusers

breacbreach h

privacprivacyy

access access mechanismmechanism

d

Page 4: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 4March 3, 2005 10th Estonian Winter School in Computer Science

Easy Tempting Solution

• But, ‘harmless’ attributes uniquely identify many patients (gender, age, approx weight, ethnicity, marital status…)

• Recall, DOB+gender+zip code identify people whp. • Worse:`rare’ attributes (e.g. disease with prob. 1/3000)

ddMr. Brown

Ms. John

Mr. Doe

A Bad SolutionA Bad Solution

IdeaIdea: a. Remove identifying information (name, : a. Remove identifying information (name, SSN, …)SSN, …)b. Publish data

Page 5: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 5March 3, 2005 10th Estonian Winter School in Computer Science

What is Privacy?

• Something should not be computable from query answers– E.g. Joe={Joe’s private data} – The definition should take into account the

adversary’s power (computational, #of queries, prior knowledge, …)

• Quite often it is much easier to say what is surely non-private– E.g. Strong breaking of privacy: adversary is able

to retrieve (almost) everybody’s private data

Intuition: Intuition: privacy privacy breached if it is possible breached if it is possible to compute someone’s to compute someone’s private information from private information from his identityhis identity

Page 6: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 6March 3, 2005 10th Estonian Winter School in Computer Science

The Data Privacy Game: an Information-Privacy Tradeoff

• Private functions: – want to hide x(DB)=dx

• Information functions:– want to revealf (q, DB) for queries q

• Here: explicit definition of private functions. – The question: which information functions may be allowed?

• Different from Crypto (secure function evaluation):– There, want to reveal f() (explicit definition of information

function)– want to hide all functions () not computable from f()– Implicit definition of private functions– The question whether f() should be revealed is not asked

xxffffff

Page 7: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 7March 3, 2005 10th Estonian Winter School in Computer Science

A simplistic model: Statistical Database (SDB)

d {0,1} {0,1}nn q q [n] [n]

queryquery

aaqq==iiqq ddii

answeranswerMr. Fox 0/1Mr. Fox 0/1

Ms. John Ms. John 0/10/1

Mr. Doe 0/1Mr. Doe 0/1

bits

Page 8: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 8March 3, 2005 10th Estonian Winter School in Computer Science

Approaches to SDB Privacy

• Studied extensively since the 70s• Perturbation

– Add randomness. Give `noisy’ or `approximate’ answers– Techniques:

• Data perturbation (perturb data and then answer queries as usual) [Reiss 84, Liew Choi Liew 85, Traub Yemini Wozniakowski 84] …

• Output perturbation (perturb answers to queries) [Denning 80, Beck 80, Achugbue Chin 79, Fellegi Phillips 74] …

– Recent interest: [Agrawal, Srikant 00] [Agrawal, Aggarwal 01],…• Query Restriction

– Answer queries accurately but sometimes disallow queries– Require queries to obey some structure [Dobkin Jones Lipton 79]

• Restricts number of queries– Auditing [Chin Ozsoyoglu 82, Kleinberg Papadimitriou Raghavan 01]

Page 9: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 9March 3, 2005 10th Estonian Winter School in Computer Science

Some Recent Privacy Definitions

X – data, Y – (noisy) observation of X[Agrawal, Srikant ‘00] Interval of confidence

– Let Y = X+noise (e.g. uniform noise in [-100,100].– Perturb input data. Can still estimate underlying

distribution.– Tradeoff: more noise less accuracy but more privacy.– Intuition: large possible interval privacy preserved

• Given Y, we know that within c% confidence X is in [a1,a2]. For example, for Y=200, with 50% X is in [150,250].

• a2-a1 defines the amount of privacy at c% confidence

– Problem: there might be some a-priori information about X• X = someone’s age & Y= -97

Page 10: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 10

March 3, 2005 10th Estonian Winter School in Computer Science

The [AS] scheme can be turned against itself

• Assume that N is large– Even if the data-miner doesn’t have a-priori

information about X, it can estimate it given the randomized data Y.

• The perturbation is uniform in [-1,1]

• [AS]: privacy interval 2 with confidence 100%

• Let fx(X)=50% for x[0,1], and 50% for x[4,5].

• But, after learning fx(X) the value of X can be easily localized within an interval of size at most 1.

– Problem: aggregate information provides information that can be used to attack individual data

Page 11: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 11

March 3, 2005 10th Estonian Winter School in Computer Science

Some Recent Privacy Definitions

X – data, Y – (noisy) observation of X• [Agrawal, Aggarwal ‘01] Mutual information

– Intuition: • High entropy is good. I(X;Y) = H(X)-H(X|Y) (mutual

information)

• small I(X;Y) (mutual information) privacy preserved (Y provides little information about X).

• Problem [EGS] : – Average notion. Privacy loss can happen with low

but significant probability, but without affecting I(X;Y).

– Sometimes I(X;Y) seems good but privacy is breached

Page 12: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 12

March 3, 2005 10th Estonian Winter School in Computer Science

Output Perturbation (Randomization Approach)

• Exact answer to query q:– aq =iq di

• Actual SDB answer: âq

• Perturbation : – For all q: | âq – aq | ≤

• Questions:– Does perturbation give any privacy?– How much perturbation is needed for privacy?– Usability

Page 13: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 13

March 3, 2005 10th Estonian Winter School in Computer Science

Privacy Preserved by Perturbation n

Database: dR{0,1}n (uniform input distribution!)

Algorithm: on query q,1. Let aq=iq di

2. If | aq - |q|/2 | < return âq = |q| / 2

3. Otherwise return âq = aq

n (lgn)2 Privacy is preserved– Assume poly(n) queries– If n (lgn)2, whp always use rule 2

• No information about d is given!

• (but database is completely useless…)

• Shows that sometimes perturbation n is enough for privacy. Can we do better?

q/2 aq

âq

q/2

Page 14: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 14

March 3, 2005 10th Estonian Winter School in Computer Science

strong breaking strong breaking of privacyof privacy

• The previous useless database achieves the best possible perturbation.

• Theorem [Dinur-Nissim]: Given any DB and any DB response algorithm with perturbation = o(n), there is a poly-time reconstruction algorithm that outputs a database d’, s.t. dist(d,d’) < o(n).

Perturbation << n Implies no Privacy

Page 15: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 15

March 3, 2005 10th Estonian Winter School in Computer Science

d dencode pert decode

aaqq11

aaqq22

aaqqtt

aaqq33

partial sums

ââqq11

ââqq22

ââqqtt

ââqq33

perturbed sums

The Adversary as a Decoding Algorithm

Page 16: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 16

March 3, 2005 10th Estonian Winter School in Computer Science

Proof of Theorem [DN03] The Adversary Reconstruction Algorithm

Observation: A solution always exists, e.g. x=d.

• Query phase:Query phase: Get â Get âqqj j for t for t randomrandom subsets q subsets q11,,

…,q…,qtt

• Weeding phase: Weeding phase: Solve the Linear Program Solve the Linear Program (over (over ):):

0 0 x xii 1 1

||iiqjqj x xii - â - âqj qj | |

• Rounding:Rounding: Let c Let cii = round(x = round(xii), output c), output c

Page 17: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 17

March 3, 2005 10th Estonian Winter School in Computer Science

Why does the Reconstruction Algorithm Work?

• Consider x{0,1}n s.t. dist(x,d)=c·n = (n)

• Observation:

– A random q contains c’·n coordinates in which x≠d

– The differences in the sum of these coordinates is, with constant probability, at least (n) (> = o(n) ).

– Such a q disqualifies x as a solution for the LP

• Since the total number of queries q is polynomial, then all such vectors x are disqualified with overwhelming probability.

Page 18: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 18

March 3, 2005 10th Estonian Winter School in Computer Science

Summary of Results (statistical database)

• [Dinur, Nissim 03] :– Unlimited adversary:

• Perturbation of magnitude (n) required

– Polynomial-time adversary:• Perturbation of magnitude (sqrt(n)) is required (shown

above)

– In both cases, adversary may reconstruct a good approximation for the database• Disallows even very weak notions of privacy

– Bounded adversary, restricted to T << n queries (SuLQ):• There is a privacy preserving access mechanism with

perturbation << sqrt(T)• Chance for usability• Reasonable model as database grows larger and larger

small

DB

small

DB

mediu

m

mediu

m

DB

DB

larg

e

larg

e

DB

DB

Page 19: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 19

March 3, 2005 10th Estonian Winter School in Computer Science

SuLQ for Multi-Attribute Statistical Database (SDB)

Query (Query (qq,, f f))

qq [ [nn]]

f f : {0,1}: {0,1}kk {0,1}{0,1}

Answer Answer aaq,fq,f==iiqq f(d f(dii))

n p

ers

on

sn

pers

on

s

k k attributesattributes

Database {Database {ddi,ji,j}}

ffffff

ff

aaq,fq,f

00 00 11 11 00

11 00 11 00 00

11 11 00 11 11

00 00 11 00 11

11 11 00 00 11

00 00 00 11 00

Row distributionRow distributionDD (D(D11,D,D22,,

…,D…,Dnn))

Page 20: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 20

March 3, 2005 10th Estonian Winter School in Computer Science

Privacy and Usability Concerns for the Multi-Attribute Model [DN]

• Rich set of queries: subset sums over any property of the k attributes– Obviously increases usability, but how is

privacy affected?• More to protect: functions of the k attributes• Relevant factors:

– What is the adversary’s goal?– Row dependency

• Vertically split data (between k or less databases):– Can privacy still be maintained with

independently operating databases?

Page 21: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 21

March 3, 2005 10th Estonian Winter School in Computer Science

Privacy Definition - Intuition

• 3-phase adversary– Phase 0: defines a target set G of poly(n)

functions g: {0,1}k {0,1}• Will try to learn some of this information

about someone

– Phase 1: adaptively queries the database T=o(n) times

– Phase 2: chooses an index i of a row it intends to attack and a function gG• Attack:

– given d-i

–try to guess g(di,1…di,k)

use all use all gained gained info to info to chooschoose i, ge i, g

Page 22: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 22

March 3, 2005 10th Estonian Winter School in Computer Science

The Privacy Definition

• P 0i,g – a-priori probability that g(di,1…di,k)=1

• p Ti,g – a-posteriori probability that g(di,1…di,k)=1

– Given answers to the T queries, and d-i

• Define conf(p) = log (p/(1-p)) – 1-1 relationship between p and conf(p) – conf(1/2)=0; conf(2/3)=1; conf(1)=

conf i,g = conf(pTi,g) – conf(p0

i,g)

• (,T) – privacy: (“relative privacy”)– For all distributions D1…Dn , row i, function g

and any adversary making at most T queries:Pr[conf i,g > ] = neg(n)

Page 23: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 23

March 3, 2005 10th Estonian Winter School in Computer Science

The SuLQ* Database

•Adversary restricted to T << n queries

•On query (q, f):• q [n]

• f : {0,1}k {0,1} (binary function)

– Let aq,f = iq f(di,1…di,k)– Let N Binomial(0, T )– Return aq,f+N

*SuLQ – Sub Linear Queries*SuLQ – Sub Linear Queries

Page 24: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 24

March 3, 2005 10th Estonian Winter School in Computer Science

Privacy Analysis of the SuLQ Database

• Pmi,g - a-posteriori probability that g(di,1…di,k)=1

– Given d-i and answers to the first m queries• conf(pm

i,g) Describes a random walk on the line with:– Starting point: conf(p0

i,g)

– Compromise: conf(pmi,g) – conf(p0

i,g) > • W.h.p. more than T steps needed to reach

compromise

conf(pconf(p00i,gi,g)) conf(pconf(p00

i,gi,g) +) +

Page 25: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 25

March 3, 2005 10th Estonian Winter School in Computer Science

Usability: One multi-attribute SuLQ DB

• Statistics of any property f of the k attributes– I.e. for what fraction of the

(sub)population does f(d1…dk) hold?– Easy: just put f in the query– Other applications:

• k independent multi-attribute SuLQ DBs

• Vertically partitioned SulQ DBs• Testing whether Pr[|] ≥ Pr[]+

– Caveat: we hide g() about a specific row (not about multiple rows)

00

11

11

00

11

00

11

11

00

11

11

00

00

11

11

11

11

00

11

11

11

00

00

11

00

00

11

00

11

11

Page 26: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 26

March 3, 2005 10th Estonian Winter School in Computer Science

Overview of Methods

• Input Perturbation

• Output Perturbation

• Query Restriction

SDBUser

(Restricted) Query

Exact ResponseOr Denial

SDB UserSDB’DataPerturbation

Query

Response

SDB User

(Restricted) Query

Perturbed Response

Page 27: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 27

March 3, 2005 10th Estonian Winter School in Computer Science

Query restriction

• The decision whether to answer or deny the query – Can be based on the content of the query and on

answers to previous queries– Or, can be based on the above and on the content

of the database

SDBUser

(Restricted) Query

Exact ResponseOr Denial

Page 28: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 28

March 3, 2005 10th Estonian Winter School in Computer Science

Auditing

• [AW89] classify auditing as a query restriction method:– “Auditing of an SDB involves keeping up-to-date logs

of all queries made by each user (not the data involved) and constantly checking for possible compromise whenever a new query is issued”

• Partial motivation: May allow for more queries to be posed, if no privacy threat occurs.

• Early work: Hofmann 1977, Schlorer 1976, Chin, Ozsoyoglu 1981, 1986

• Recent interest: Kleinberg, Papadimitriou, Raghavan 2000, Li, Wang, Wang, Jajodia 2002, Jonsson, Krokhin 2003

Page 29: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 29

March 3, 2005 10th Estonian Winter School in Computer Science

How Auditors may Inadvertently Compromise

Privacy

Page 30: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 30

March 3, 2005 10th Estonian Winter School in Computer Science

The Setting

• Dataset: d={d1,…,dn} – Entries di: Real, Integer, Boolean

• Query: q = (f ,i1,…,ik)– f : Min, Max, Median, Sum, Average, Count…

• Bad users will try to breach the privacy of individuals• Compromise uniquely determine di (very weak def)

Statisticaldatabase

f (di1,…,dik)

q = (f ,i1,…,ik)

Page 31: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 31

March 3, 2005 10th Estonian Winter School in Computer Science

Auditing

StatisticaldatabaseQuery log

q1,…,qi

Here’s a new query: qi+1

Here’s the answer

Query denied (as the answer would cause

privacy loss)

OR

Auditor

Page 32: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 32

March 3, 2005 10th Estonian Winter School in Computer Science

Example 1: Sum/Max auditing

Oh well…

q1 = sum(d1,d2,d3)

sum(d1,d2,d3) = 15

q2 = max(d1,d2,d3)

Denied (the answer would cause privacy loss)

q2 is denied iff d1=d2=d3 = 5

I win!

Auditor

di real, sum/max queries, privacy breached if some di learned

There must be a reason for the

denial…

Page 33: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 33

March 3, 2005 10th Estonian Winter School in Computer Science

Sounds Familiar?

Mr. Chairman, I would like to answer the committee's questions, but on the advice of my counsel I respectfully decline to answer the question based on the protection afforded me under the Constitution of the United States.

David Duncan, Former auditor for Enron and partner in Andersen:

Page 34: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 34

March 3, 2005 10th Estonian Winter School in Computer Science

Max Auditing

q1 = max(d1,d2,d3,d4)

M1234

di real

M123 / deniedIf denied: d4=M1234

M12 / deniedIf denied: d3=M123

Auditor

q2 = max(d1,d2,d3)

q2 = max(d1,d2)

d1 d2 d4 d6d3 d5 d7 d8 … dndn-1

Learn an item with prob ½

Page 35: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 35

March 3, 2005 10th Estonian Winter School in Computer Science

Boolean Auditing?

1 / denied

1 / denied

qi denied iff di = di+1 learn database/complement

Auditor

di Booleand1 d2 d4 d6d3 d5 d7 d8 … dndn-1

q1 = sum(d1,d2)

q2=sum(d2,d3)

Page 36: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 36

March 3, 2005 10th Estonian Winter School in Computer Science

The Problem

• The problem:– Query denials leak (potentially sensitive) information

• Users cannot decide denials by themselves

Possible assignments to {d1,…,dn}

Assignments consistent with (q1,…qi, a1,…,ai)

qi+1 denied

Page 37: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 37

March 3, 2005 10th Estonian Winter School in Computer Science

Solution to the problem: simulatable Auditing

An auditor is simulatable if a simulator exists s.t.:

Auditor

qi+1 qi+1

Deny/answer Deny/answer

Simulator

Simulation denials do not leak information

q1,…,qia1,…,ai

Statisticaldatabase

q1,…,qi

Page 38: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 38

March 3, 2005 10th Estonian Winter School in Computer Science

Why Simulatable Auditors do not Leak Information?

Possible assignments to {d1,…,dn}

Assignments consistent with (q1,…qi, a1,…,ai )

qi+1 denied/allowed

Page 39: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 39

March 3, 2005 10th Estonian Winter School in Computer Science

Simulatable auditing

Page 40: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 40

March 3, 2005 10th Estonian Winter School in Computer Science

Query Restriction for Sum Queries

• Given:– D={x1,..,xn} dataset, xi – S is a subset of X. Query: xiS xi

• Is it possible to compromise D?– Here compromise means: uniquely

determine xi from the queries

• Can compromise if subsets arbitrarily small:– sum(x9)= x9

Page 41: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 41

March 3, 2005 10th Estonian Winter School in Computer Science

Query Set Size Control

• Do not permit queries that involve a small subset of the database.

• Compromise still possible– Want to discover x:

sum(x,y1,..,yk) - sum(y1,..,yk) = x• Issue: Overlap• In general, overlap is not enough.

– Need to restrict also the number of queries– Note that overlap itself sometimes restricts

number of queries (e.g. size of queries = cn, overlap = const, only about 1/c possible queries)

Page 42: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 42

March 3, 2005 10th Estonian Winter School in Computer Science

Restricting Set-Sum Queries

• Restricting the sum queries based on– Number of database elements in the sum– Overlap with previous sum queries– Total number of queries

• Note that the criteria are known to the user– They do not depend on the contents of the

database

• Therefore, the user can simulate the denial/no-denial answer given by the DB– Simulatable auditing

Page 43: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 43

March 3, 2005 10th Estonian Winter School in Computer Science

Restricting Overlap and Number of Queries

• Assume:– |Query Qi| ≥ k– |Qi Qj| ≤ r– Adversary knows a-priori at most L values, L+1 <

k• Claim: Data cannot be compromised with fewer than

1+(2k-L)/r Sum Queries.1 0 0 0 1 1 1 11 0 0 1 0 0 1 0

x1x2x3..

xn

=

Q1Q2Q3...

Qt

≥ k

≤r

≥ k≥ k

≥ k

xl

≤r

≤r

Page 44: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 44

March 3, 2005 10th Estonian Winter School in Computer Science

Claim: Data cannot be compromised with fewer than 1+(2k-L)/r Sum Queries [Dobkin,Jones,Lipton][Reiss]– k < query size, r > overlap, L a-priori known items

• Suppose xc compromised after t queries where each query represented by:– Qi = xi1 + xi2 + … + xik for i =1, …, t

• Implies that:– xc = i=1,t i Qi = i=1,t i j=1,k xij

– Let i = 1 if x in query i, 0 otherwise– xc= i=1,t i =1,n i x = =1,n (i=1,t i i)x

Overlap + Number of Queries

Page 45: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 45

March 3, 2005 10th Estonian Winter School in Computer Science

We have: xc= =1,n (i=1,ti i)x

• In the above sum, (i=1,ti i) must be 0 for all x except for xc (in order for xc to be compromised)

• This happens iff i=0 for all i, or if i =j =1 and i j have opposite signs– or i =0, in which case the ith query didn’t matter

Overlap + Number of Queries

Page 46: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 46

March 3, 2005 10th Estonian Winter School in Computer Science

• Wlog, first query contains xc, second query is of opposite sign.

• In the first query, k elements are probed• The second query adds at least k-r elements • Elements from first and second queries cannot

be canceled within the same (additional) query (opposite signs requires)

• Therefore each new query cancels items from first or from second query, but not from both.

• Need to cancel 2k-r-L elements. – Need 2+(2k-r-L)/r queries, i.e. 1+(2k-L)/r.

Overlap + Number of Queries

Page 47: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 47

March 3, 2005 10th Estonian Winter School in Computer Science

Notes

• The number of queries satisfying |Qi|≥ k and |Qi Qj| ≤r is small– If k=n/c for some constant c and r=const, then

there are only ~c queries where no two queries overlap by more than 1.

– Hence , query sequence length may be uncomfortably short.

– Or, if r=k/c (overlap is a constant fraction of query size) then number of queries, 1+(2k-L)/r, is O( c).

Page 48: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 48

March 3, 2005 10th Estonian Winter School in Computer Science

Conclusions

• Privacy should be defined and analyzed rigorously– In particular, assuming randomization privacy is

dangerous• High perturbation is needed for privacy against

polynomial adversaries– Threshold phenomenon – above n: total privacy,

below n: no privacy (for poly-time adversary)– Main tool: a reconstruction algorithm

• Careless auditing might leak private information• Self Auditing (simulatable auditors) is safe

– Decision whether to allow a query based on previous `good’ queries and their answers• Without access to DB contents

• Users may apply the decision procedure by themselves

Page 49: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 49

March 3, 2005 10th Estonian Winter School in Computer Science

ToDo

• Come up with good model and requirements for database privacy– Learn from crypto– Protect against more general loss of privacy

• Simulatable auditors are a starting point for designing more reasonable audit mechanisms

Page 50: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 50

March 3, 2005 10th Estonian Winter School in Computer Science

References

• Course web page:– A Study of Perturbation Techniques for Data

Privacy, Cynthia Dwork and Nina Mishra and Kobbi Nissim, http://theory.stanford.edu/~nmishra/cs369-2004.html

– Privacy and Databaseshttp://theory.stanford.edu/~rajeev/privacy.html

Page 51: Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

page 51

March 3, 2005 10th Estonian Winter School in Computer Science

Foundations of CS at the Weizmann Institute

• Uri Feige• Oded Goldreich• Shafi Goldwasser• David Harel• Moni Naor

• David Peleg• Amir Pnueli • Ran Raz• Omer Reingold• Adi Shamir

• All students receive a fellowship

• Language of instruction English

Yellow crypto