Upload
brenna-greive
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Issues in Designing a Confidentiality Preserving Model Server
by Philip M Steel & Arnold Reznek
Talk Outline
BackgroundBasic designDescription of operationConfidentiality outlineConstraints on universe formationOther constraintsSummary
Background
PUBLIC remote access to confidential dataRestriction of queries and responses rather
than the registering and monitoring the userCurrent population survey (CPS), employment
and economic well-being; demographic supplement
Software development by SynecticsHTML, mySQL, php, to develop the query …
SAS as the statistical package run against the data
Risk Model for Microdata
• Intruder has access to record linkage software and identified data sources
• Disclosure occurs if the intruder is successful in linking his identified data with the published microdata
Risk Model for a Model Server
• Intruder has access to record linkage software and identified data sources
• Intruder uses model server to reconstruct microdata for both the variables overlapping his data sources and a sensitive variable
• Disclosure occurs if the intruder is successful in linking his identified data with the reconstructed microdata and has valid estimate of a sensitive characteristic or value
Basic Design Choice
Enable: Choose which functions will operate– Must construct a friendly interface– Limited to the procedures developed– Safe from unknown code
Disable: Choose which functions will not operate– User free to program within disabling constraints– No limit on complexity– Must be monitored (human, program or mix)
Operation
User visits web site, chooses data set, explores data, chooses geography, analysis type
User chooses population, constructs model, selects output
Web site constructs code to send behind firewall
Code checked and run against data at Census
Results checked and returned to user
Structure of Confidentiality Rules
Data preparation
Data exploration
Model universe formation
Model Statement
Model Output
Data exploration rules
Users may request tables for categorical variables and numeric recodes up to e1 dimensions. (start e1=4 including geo)
User may transform numeric recodes using a limited set of functions: log, root, square.
Universe formation: Categorical Variables
Example: Hispanic heads of household with a college degree.
Conditions: X1=H,X2=1,X3=5 (table cell)
Implication: Data preparation must support safe lower dimensional tables
Universe formation rules: Categorical Variables
Limit on the number of categorical variables (u1=3)
Minimum on the size of the universe selected (u2=75)
Universe Formation: Numeric Variables
Example: Families in poverty
Condition: Family income<18,500. Or Family income<18,501?
Implication: Rounding or pre-assigned cutpoints.
Universe formation rules: Numeric variables
Users will select categorical variables first
Numeric variables can be used only at pre-assigned cutpoints.
The number of observations in the whole CPS universe between cutpoints shall be at least u3 for every numeric variable. (start u3=80)
Universe formation rules (cont)
If a cutpoint is used in universe formation then the difference in the size of the model universe obtained by incrementing the cutpoint up or down cannot be less than u4. (start u4=4)
The universe for the model must have at least u2 observations. (start u2=75)
There will be no cutpoints above the 97th percentile of nonzero points or the last half percentile of all points .
Model statements rules
At most m1 variables may be used in the model statement (start m1=20)
Dummy variables must distinguish at least m2 observations (start m2=20)
No interaction term may involve more than 4 variables. (m3=4)
No model involving 3 or more variables can be fully interacted. (m4=3)
Model Output
Residuals will be based on synthetic data
Limit on the number of significant digits?
R2 cannot be 1?
Rules for other diagnostics
Synthetic Residuals
Users may see synthetic bar charts or distributions and synthetic 2-way plots.
Synthetic data must be generated from fixed random number starts and topcoded (and bottom coded where appropriate) at 4 standard deviations from the mean.
Data preparation
The topcode for numeric data needs to be calculated
Cutpoints must be determinedSeparate lists of variables for exploration,
universe formation, dependent and independent variables, model estimation
Standard recodes addedInference from the collection of all 4-way
categorical tables checked
Major Hurdles
Implementing facility for dummy variables
Presentation of geographic options
Implementing synthetic residuals
Architecture for differing variable roles
Future development
Relaxation of top codes
Implementation of model variance estimation (NSO weighting)
Introduction of new dataset
Introduction of new statistical procedures
Facility to add contextual data or merge files
Use of non-sampled data
Overview
• Avoids (as much as possible) tests which accept or reject a users choice.
• Restricts the dimension of the data access.• Has some flexibility in setting system
confidentiality parameters.• Changes the intruder model.• Introduces a modification of k-anonymity.
My thanks to Jerry Reiter, Laura Zayatz and Stephen Wenck
http://204.52.186.190/
Contact: [email protected]