34
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

Embed Size (px)

Citation preview

Page 1: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

1

IPAM 2010

Privacy Protection from Sampling and Perturbation in Surveys

Natalie Shlomo and Chris Skinner

Southampton Statistical Sciences Research Institute

Page 2: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

2

Topics:

1 Social surveys in official statistics

2 Releasing survey microdata

3 Disclosure limitation methods for survey microdata

4 Assessing disclosure risk in survey microdata

5. Misclassification/perturbation

6 Differential privacy definition

7 Differential privacy under:

Sampling

Misclassification / Perturbation

Sampling and Misclassification / Perturbation

8 Conclusions

Page 3: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

3

Social Surveys in Official Statistics

• Examples: Labour Force Survey, Family Expenditure Survey, Social Survey

• Unit of investigation household or individuals• Sample fractions small (generally from about 0.01% to

0.05%)• Multistage sampling

– PSU is an address– SSU is a household– USU is an individual (or all individuals) in a household

• Population characteristics are NOT known

Page 4: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

4

Releasing Survey Microdata

Agencies release microdata from surveys arising from social surveys:

• Safe settings – data is licensed out to users through data archives, microdata under contract, safe data labs

• Safe data – depending on the mode of release, Agencies

protect data through disclosure limitation methods

Page 5: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

5

Disclosure Limitation Methods in Survey Microdata

• Sampling from the population• Broad banding, eg. age groups, high geographies • Recoding by collapsing categories, eg. top or bottom categories• Categorizing continuous variables, eg. income or expenditures • Deletion of high-risk variables

Page 6: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

6

Disclosure Limitation Methods in Survey Microdata

High risk records with respect to population uniqueness might add perturbative method:

• Delete-impute method • Post-randomisation method• Data swapping

Reweighting and post-editing to preserve logical consistencies

Page 7: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

7

Assessing disclosure risk in samplemicrodata

Assumptions:

• Two types of variables:

Key variables - categorical variables that are identifying, visible, traceable and/or common with publicly available datasets containing population units

Key – cells of cross-classification of key variables k=1,…,K

Sensitive variables – all other variables in the dataset

Page 8: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

8

Assessing disclosure risk in samplemicrodata

Assumptions (cont):

• Main concern is risk of identification

- Notion of population uniqueness, eg. ability to link unit in microdata to a unit in the population through key variables which may be known to an intruder

- Risk of identification used in statistics acts, confidentiality pledges to respondents and codes of practice

- Identification is a pre-requisite for attribute and

inferential disclosure from sensitive variables (what an intruder may learn)

Page 9: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

9

Assessing disclosure risk in sample microdata

Assumptions (cont):

• Intruder has no direct knowledge of what units fall into the sample and sampling is part of the SDL mechanism

• Intruder has prior knowledge relating to some aspects of the population

• First assume no misclassification/perturbation error on the key variables and SDL method based on non-perturbative methods (we later drop this assumption)

Page 10: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

10

Assessing disclosure risk in sample microdata Notation: U population with elements i=1,…,Ns sample drawn with specified probability sampling scheme p(s) from U containing elements i=1,..,n

Let xi be 1*K vector indicating which cell population unit ibelongs to

Denote: ej the 1*K vector with a 1 in the jth position and zeros everywhere else so that for each i=1,…N we have xi=ej for some },...,2,1{ Kj

Page 11: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

11

Assessing disclosure risk in sample microdata Xu matrix of dimension N*K with rows Xi for elements in U

Xs matrix of dimension n*K with rows for elements of s

Xs is the data known to the Agency

Xu includes values of variables from non-sampled units which are typically unknown

Agency releases:

unknown

sTnK Xff 1),...,(f 1

UTNK XFF 1),..., 1(F

Page 12: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

12

Assessing disclosure risk in Sample Microdata

For identity disclosure interested in evidence that a particular record belongs to a known individual

Disclosure risk measures:

Probability of a population unique given a sample unique:

Expected number of correct matches for a sample unique:

Aggregate to obtain global risk measures

( 1| 1)k kp F f

)1|1

( kk

fF

E

Page 13: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

13

Assessing disclosure risk in sample microdata

Estimation based on probabilistic model since population counts are unknown

Natural assumption in contingency table literature:

From sampling: so:

And conditionally independent

Estimation by:

)(~ kk PoisF

),F(Bin~F|f kkkk )(~ kkk Poisf

))1((~| kkkk PoisfF

)1()1|1( kkefFp kk

)1()1(

1)1|/1( )1( kkefFE

kkkk

jj fF |

Page 14: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

14

Assessing disclosure risk in sample microdata

From use log linear models to estimate parameters:

based on the contingency table of sample counts spanned by the identifying key variables

Estimated parameters: and

Plug into estimated risk measures

)ˆexp(ˆ kku x

)(~ kkkk Poisf

kk xlog

kkk /ˆˆ

Page 15: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

15

Misclassification / Perturbation

Misclassification arises naturally from survey processes

Perturbation introduced as a SDL technique

Assume is a misclassified or perturbed version of

The SDL techniques leads to an arbitrary ordering of the records in the microdata, so the released data is

Misclassification/ perturbation assumed generated through a probability mechanism:

sX~

sTnK Xff

~1)

~,...,

~(f

~1

KjjniMexex jjjiji ,...,1,,...,1,)|~Pr( 212121

sX

Page 16: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

16

Assessing disclosure risk in perturbed sample microdata

Let be denoted for

Let their identities be denoted by

Let j(i) be the microdata label for population unit i.

We are interested in:

jX~X

~},...,1{ nj

nBB ,...,1

njNiiBp jij ,...,1,,...,1),Pr(

Page 17: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

17

Assessing disclosure risk in perturbedsample microdata

Suppose we observe (with identity ) and for known population unit

Identification risk

Where:

- is the event that population unit i is sampled - its value matches - no other population unit is both sampled and has a value of which matches

jX~

DX

},...,2,1{ ND

Ui iD EE )Pr(/)Pr(

DX

DX

jB

iE

)(

~ijX

X~

Page 18: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

18

Assessing disclosure risk in perturbedsample microdata

Let and and sample selected independently

We obtain: where and is the number of units in the population with

Identification risk

jjj

k jkkjj

k jkjjkkjjjjj

FM

MFM

MMFMM

~/

)/(

])1/(/[)]1/([

)1/()Pr( jkjjkji MME

jkj eX )(

~ki eX ~

l

Fjljjj

lM )1(

lF leX

Estimated by as before multiplied by )1~

|~

/1( jj fFE jjM

Page 19: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

19

Misclassification / Perturbation Matrix

Examples:

Natural misclassification errors – measured by quality studies, eg. miscoding in data capture, misreporting by respondents, etc.

Perturbation errors (categorical variables):

Random Data Swapping:

Probability of selecting any 2 records for swapping:

Let nj and nk be the number of records taking values j and k

2

n

2

2,

2

n

n

Mn

nnMM

j

jjkj

kjjk

QMAfter Q swaps:

Page 20: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

20

Misclassification / Perturbation

Examples (cont):

Pram: Use a probability mechanism to make random changes across categories

Define the property of invariance on M: where is the vector of sample proportions:

to ensure that perturbed marginal distribution is similar to the original marginal distribution

}{ jkMM

vvM

n

n

n

n K,...,1v

v

Page 21: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

21

Misclassification / Perturbation

Example of a non-perturbative method:

Recoding Categories 1 to a into 1:

otherwise

jandakjorjandakM jk 0

11,1,...,11

Page 22: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

22

Differential privacy definition

Dwork, et al. (2006) refer to a database and attribute (inferential) disclosure of a target unit where the intruder has knowledge of the entire database except for the target unit

No distinction between key variables and sensitive variables

In a survey setting, two notions of database: and

- defines the data collected

- includes non-sampled units which are unknown to the agency

UX sX

sX

UX

Page 23: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

23

Differential privacy definition

Assume:

Intruder has no knowledge of who is in the sample

Sampling part of SDL mechanism

Intruder’s prior knowledge relates to

Let denote the probability of f under the sampling mechanism, where is fixed:

- differential privacy holds if:

where maximum is over all pairs

which differ in one row and

all possible values of f

UX

(2)U

(1)U

X|P(f

)X|P(flnmax

)|Pr( UXf

UX

),( )2()1(UU XX for some 0

Page 24: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

24

Differential privacy definition

Let denote set of samples s with

for which

Then

Example: simple random sampling of size n

)(

]|Pr[f

fSs

U p(s)X

)(fS 0)( spfs

Tn X1

k

j j

j

U n

N

f

FX

1

]|Pr[f

Page 25: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

25

Does Sampling ensure differential privacy?

Let be a target unit and suppose the intruder wishes to learn about

Let be the count in cell j in the population excluding , so

Intruder knows all units in the population except the target unit, so is known

Suppose intruder knows that

Then since the intruder can infer and attribute disclosure arises

1)( 0 ijj Ff

jj Ff jxi 0

0i

0ix

)( 0ijF

0i

)(0

0 )( jxIFF ii

jj

)( 0ijF

- differential privacy does not hold

Page 26: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

26

Differential privacy under sampling

Is -differential privacy too strong a condition?

1.Small sample fractions in social surveys

2. Proportions typically released with sufficiently wide confidence intervals to ensure uncertainty, i.e. sample count generally not equal to population count

Dwork, et al. (2006) discuss low ‘sensitivity’ for a sample proportion requiring small perturbation

3. All possible ‘population databases’ are considered (even ones that are implausible)

Page 27: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

27

Differential privacy under sampling

4. Population database not known to the Agency and an

intruder would not know the whole database except one target unit (the population is a random variable)

5. All possible samples considered when in fact, only one is drawn (the sample is fixed) to which SDL methods applied

To obtain -differential privacy - need to avoid the case where and in particular no population uniques

jj Ff

Page 28: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

28

Differential privacy under sampling

How likely is it to get a population unique (leakage)?

Assume: N=100,000, n=2,000 (2% sample)11 key variables generated {0,1} with Bernoulli (0.2)

Count number of cases where (typically only relevant for ): Calculate the proportion of population uniques out of sample uniques

Average across 1000 populations and samples:‘leakage’=0.016,

jj Ff 1 jj Ff

Page 29: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

29

Differential privacy under sampling

Population not known and the probability of a population unique given a sample unique under SRS is estimated by:

)1()1|1Pr( kefF kk

Probability of Population Unique (1:50 Sample)

00.10.20.30.40.50.60.70.80.9

1

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

Lamda k

Pr(

PU

|SU

)

Page 30: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

30

Differential privacy under perturbation

Assume no sampling and independent misclassification

Suppose differs from only in row i, so that

where are the entries of

Ui

iiUU xxXX )|~Pr(]|~

Pr[ )1()1(

)2()1(ii xx

)2(

)1(

~

~

)2(

)1(

)2(

)1(

)|~Pr(

)|~Pr(

]|~

Pr[

]|~

Pr[

jj

jj

ii

ii

UU

UU

M

M

xx

xx

XX

XX

)2()1( ,,~

jjj)2()1( ,,~

iii xxx

)1(UX

)2(UX

Page 31: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

31

Differential privacy under perturbation

There exists a finite for which -differential privacy holds iff all elements of M are positive

Write f before and after change denoted by and where (Abowd and Vilhuber, 2008)

)1(f )2(f2|)2()1( ff|

]|

~Pr[

]|~

Pr[lnmax

)2(

)1(

ff

ff

])ln[min](ln[maxmaxlnmax]|

~Pr[

]|~

Pr[lnmax ~~~

~

~

,~)2(

)1(

)2(

)1(

)2()1( jjjjjjj

jj

jj

jjjUU

UU MMM

M

XX

XX

Page 32: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

32

Differential privacy under perturbation

Considering categorical key variables, some forms of random data swapping, PRAM and other synthetic data generating methods preserve differential privacy

Non-perturbative methods do not preserve differential privacy

Remark: often to preserve logical consistency of data, edit rules are used in the form of structural zeros in M

Page 33: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

33

Differential Privacy under sampling and perturbation

Assume two step process

- with probability

- generation of

Write:

Differential privacy will not hold for some iff there is a pair which differ only in one row and for which and

If all elements of M are positive for any s such that

sX 0)( sp

sX~

)()|~~(]|

~Pr[

~spXxXpX s

x ssU f

),( )2()1(

UU XX

0]|~

Pr[ )1( UXf 0]|~

Pr[ )2( UXf

x

ss XxXp~

0)|~~(

ff~

11 TK

TK

Page 34: 1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute

34

Conclusions Sampling does not preserve differential privacy when sample counts equal population counts

- generally Agencies avoid these cases by small sampling fractions and implementing SDL methods

Misclassification/perturbation preserves differential privacy if the probability mechanism has all positive elements