Upload
bruno-york
View
213
Download
1
Embed Size (px)
Citation preview
1
IPAM 2010
Privacy Protection from Sampling and Perturbation in Surveys
Natalie Shlomo and Chris Skinner
Southampton Statistical Sciences Research Institute
2
Topics:
1 Social surveys in official statistics
2 Releasing survey microdata
3 Disclosure limitation methods for survey microdata
4 Assessing disclosure risk in survey microdata
5. Misclassification/perturbation
6 Differential privacy definition
7 Differential privacy under:
Sampling
Misclassification / Perturbation
Sampling and Misclassification / Perturbation
8 Conclusions
3
Social Surveys in Official Statistics
• Examples: Labour Force Survey, Family Expenditure Survey, Social Survey
• Unit of investigation household or individuals• Sample fractions small (generally from about 0.01% to
0.05%)• Multistage sampling
– PSU is an address– SSU is a household– USU is an individual (or all individuals) in a household
• Population characteristics are NOT known
4
Releasing Survey Microdata
Agencies release microdata from surveys arising from social surveys:
• Safe settings – data is licensed out to users through data archives, microdata under contract, safe data labs
• Safe data – depending on the mode of release, Agencies
protect data through disclosure limitation methods
5
Disclosure Limitation Methods in Survey Microdata
• Sampling from the population• Broad banding, eg. age groups, high geographies • Recoding by collapsing categories, eg. top or bottom categories• Categorizing continuous variables, eg. income or expenditures • Deletion of high-risk variables
6
Disclosure Limitation Methods in Survey Microdata
High risk records with respect to population uniqueness might add perturbative method:
• Delete-impute method • Post-randomisation method• Data swapping
Reweighting and post-editing to preserve logical consistencies
7
Assessing disclosure risk in samplemicrodata
Assumptions:
• Two types of variables:
Key variables - categorical variables that are identifying, visible, traceable and/or common with publicly available datasets containing population units
Key – cells of cross-classification of key variables k=1,…,K
Sensitive variables – all other variables in the dataset
8
Assessing disclosure risk in samplemicrodata
Assumptions (cont):
• Main concern is risk of identification
- Notion of population uniqueness, eg. ability to link unit in microdata to a unit in the population through key variables which may be known to an intruder
- Risk of identification used in statistics acts, confidentiality pledges to respondents and codes of practice
- Identification is a pre-requisite for attribute and
inferential disclosure from sensitive variables (what an intruder may learn)
9
Assessing disclosure risk in sample microdata
Assumptions (cont):
• Intruder has no direct knowledge of what units fall into the sample and sampling is part of the SDL mechanism
• Intruder has prior knowledge relating to some aspects of the population
• First assume no misclassification/perturbation error on the key variables and SDL method based on non-perturbative methods (we later drop this assumption)
10
Assessing disclosure risk in sample microdata Notation: U population with elements i=1,…,Ns sample drawn with specified probability sampling scheme p(s) from U containing elements i=1,..,n
Let xi be 1*K vector indicating which cell population unit ibelongs to
Denote: ej the 1*K vector with a 1 in the jth position and zeros everywhere else so that for each i=1,…N we have xi=ej for some },...,2,1{ Kj
11
Assessing disclosure risk in sample microdata Xu matrix of dimension N*K with rows Xi for elements in U
Xs matrix of dimension n*K with rows for elements of s
Xs is the data known to the Agency
Xu includes values of variables from non-sampled units which are typically unknown
Agency releases:
unknown
sTnK Xff 1),...,(f 1
UTNK XFF 1),..., 1(F
12
Assessing disclosure risk in Sample Microdata
For identity disclosure interested in evidence that a particular record belongs to a known individual
Disclosure risk measures:
Probability of a population unique given a sample unique:
Expected number of correct matches for a sample unique:
Aggregate to obtain global risk measures
( 1| 1)k kp F f
)1|1
( kk
fF
E
13
Assessing disclosure risk in sample microdata
Estimation based on probabilistic model since population counts are unknown
Natural assumption in contingency table literature:
From sampling: so:
And conditionally independent
Estimation by:
)(~ kk PoisF
),F(Bin~F|f kkkk )(~ kkk Poisf
))1((~| kkkk PoisfF
)1()1|1( kkefFp kk
)1()1(
1)1|/1( )1( kkefFE
kkkk
jj fF |
14
Assessing disclosure risk in sample microdata
From use log linear models to estimate parameters:
based on the contingency table of sample counts spanned by the identifying key variables
Estimated parameters: and
Plug into estimated risk measures
)ˆexp(ˆ kku x
)(~ kkkk Poisf
kk xlog
kkk /ˆˆ
15
Misclassification / Perturbation
Misclassification arises naturally from survey processes
Perturbation introduced as a SDL technique
Assume is a misclassified or perturbed version of
The SDL techniques leads to an arbitrary ordering of the records in the microdata, so the released data is
Misclassification/ perturbation assumed generated through a probability mechanism:
sX~
sTnK Xff
~1)
~,...,
~(f
~1
KjjniMexex jjjiji ,...,1,,...,1,)|~Pr( 212121
sX
16
Assessing disclosure risk in perturbed sample microdata
Let be denoted for
Let their identities be denoted by
Let j(i) be the microdata label for population unit i.
We are interested in:
jX~X
~},...,1{ nj
nBB ,...,1
njNiiBp jij ,...,1,,...,1),Pr(
17
Assessing disclosure risk in perturbedsample microdata
Suppose we observe (with identity ) and for known population unit
Identification risk
Where:
- is the event that population unit i is sampled - its value matches - no other population unit is both sampled and has a value of which matches
jX~
DX
},...,2,1{ ND
Ui iD EE )Pr(/)Pr(
DX
DX
jB
iE
)(
~ijX
X~
18
Assessing disclosure risk in perturbedsample microdata
Let and and sample selected independently
We obtain: where and is the number of units in the population with
Identification risk
jjj
k jkkjj
k jkjjkkjjjjj
FM
MFM
MMFMM
~/
)/(
])1/(/[)]1/([
)1/()Pr( jkjjkji MME
jkj eX )(
~ki eX ~
l
Fjljjj
lM )1(
lF leX
Estimated by as before multiplied by )1~
|~
/1( jj fFE jjM
19
Misclassification / Perturbation Matrix
Examples:
Natural misclassification errors – measured by quality studies, eg. miscoding in data capture, misreporting by respondents, etc.
Perturbation errors (categorical variables):
Random Data Swapping:
Probability of selecting any 2 records for swapping:
Let nj and nk be the number of records taking values j and k
2
n
2
2,
2
n
n
Mn
nnMM
j
jjkj
kjjk
QMAfter Q swaps:
20
Misclassification / Perturbation
Examples (cont):
Pram: Use a probability mechanism to make random changes across categories
Define the property of invariance on M: where is the vector of sample proportions:
to ensure that perturbed marginal distribution is similar to the original marginal distribution
}{ jkMM
vvM
n
n
n
n K,...,1v
v
21
Misclassification / Perturbation
Example of a non-perturbative method:
Recoding Categories 1 to a into 1:
otherwise
jandakjorjandakM jk 0
11,1,...,11
22
Differential privacy definition
Dwork, et al. (2006) refer to a database and attribute (inferential) disclosure of a target unit where the intruder has knowledge of the entire database except for the target unit
No distinction between key variables and sensitive variables
In a survey setting, two notions of database: and
- defines the data collected
- includes non-sampled units which are unknown to the agency
UX sX
sX
UX
23
Differential privacy definition
Assume:
Intruder has no knowledge of who is in the sample
Sampling part of SDL mechanism
Intruder’s prior knowledge relates to
Let denote the probability of f under the sampling mechanism, where is fixed:
- differential privacy holds if:
where maximum is over all pairs
which differ in one row and
all possible values of f
UX
(2)U
(1)U
X|P(f
)X|P(flnmax
)|Pr( UXf
UX
),( )2()1(UU XX for some 0
24
Differential privacy definition
Let denote set of samples s with
for which
Then
Example: simple random sampling of size n
)(
]|Pr[f
fSs
U p(s)X
)(fS 0)( spfs
Tn X1
k
j j
j
U n
N
f
FX
1
]|Pr[f
25
Does Sampling ensure differential privacy?
Let be a target unit and suppose the intruder wishes to learn about
Let be the count in cell j in the population excluding , so
Intruder knows all units in the population except the target unit, so is known
Suppose intruder knows that
Then since the intruder can infer and attribute disclosure arises
1)( 0 ijj Ff
jj Ff jxi 0
0i
0ix
)( 0ijF
0i
)(0
0 )( jxIFF ii
jj
)( 0ijF
- differential privacy does not hold
26
Differential privacy under sampling
Is -differential privacy too strong a condition?
1.Small sample fractions in social surveys
2. Proportions typically released with sufficiently wide confidence intervals to ensure uncertainty, i.e. sample count generally not equal to population count
Dwork, et al. (2006) discuss low ‘sensitivity’ for a sample proportion requiring small perturbation
3. All possible ‘population databases’ are considered (even ones that are implausible)
27
Differential privacy under sampling
4. Population database not known to the Agency and an
intruder would not know the whole database except one target unit (the population is a random variable)
5. All possible samples considered when in fact, only one is drawn (the sample is fixed) to which SDL methods applied
To obtain -differential privacy - need to avoid the case where and in particular no population uniques
jj Ff
28
Differential privacy under sampling
How likely is it to get a population unique (leakage)?
Assume: N=100,000, n=2,000 (2% sample)11 key variables generated {0,1} with Bernoulli (0.2)
Count number of cases where (typically only relevant for ): Calculate the proportion of population uniques out of sample uniques
Average across 1000 populations and samples:‘leakage’=0.016,
jj Ff 1 jj Ff
29
Differential privacy under sampling
Population not known and the probability of a population unique given a sample unique under SRS is estimated by:
)1()1|1Pr( kefF kk
Probability of Population Unique (1:50 Sample)
00.10.20.30.40.50.60.70.80.9
1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
Lamda k
Pr(
PU
|SU
)
30
Differential privacy under perturbation
Assume no sampling and independent misclassification
Suppose differs from only in row i, so that
where are the entries of
Ui
iiUU xxXX )|~Pr(]|~
Pr[ )1()1(
)2()1(ii xx
)2(
)1(
~
~
)2(
)1(
)2(
)1(
)|~Pr(
)|~Pr(
]|~
Pr[
]|~
Pr[
jj
jj
ii
ii
UU
UU
M
M
xx
xx
XX
XX
)2()1( ,,~
jjj)2()1( ,,~
iii xxx
)1(UX
)2(UX
31
Differential privacy under perturbation
There exists a finite for which -differential privacy holds iff all elements of M are positive
Write f before and after change denoted by and where (Abowd and Vilhuber, 2008)
)1(f )2(f2|)2()1( ff|
]|
~Pr[
]|~
Pr[lnmax
)2(
)1(
ff
ff
])ln[min](ln[maxmaxlnmax]|
~Pr[
]|~
Pr[lnmax ~~~
~
~
,~)2(
)1(
)2(
)1(
)2()1( jjjjjjj
jj
jj
jjjUU
UU MMM
M
XX
XX
32
Differential privacy under perturbation
Considering categorical key variables, some forms of random data swapping, PRAM and other synthetic data generating methods preserve differential privacy
Non-perturbative methods do not preserve differential privacy
Remark: often to preserve logical consistency of data, edit rules are used in the form of structural zeros in M
33
Differential Privacy under sampling and perturbation
Assume two step process
- with probability
- generation of
Write:
Differential privacy will not hold for some iff there is a pair which differ only in one row and for which and
If all elements of M are positive for any s such that
sX 0)( sp
sX~
)()|~~(]|
~Pr[
~spXxXpX s
x ssU f
),( )2()1(
UU XX
0]|~
Pr[ )1( UXf 0]|~
Pr[ )2( UXf
x
ss XxXp~
0)|~~(
ff~
11 TK
TK
34
Conclusions Sampling does not preserve differential privacy when sample counts equal population counts
- generally Agencies avoid these cases by small sampling fractions and implementing SDL methods
Misclassification/perturbation preserves differential privacy if the probability mechanism has all positive elements