Progress on the SDC Strategy for the 2011 Census
23rd June 2008
Keith Spicer and Caroline Young
• Context• Work plan• Description of the short-listed
methods• Quantitative Evaluation – some
results!• Conclusions and Further Work
Outline
Context
•SDC for 2011 Census outputs is a major concern for users
•Different SDC methodologies were adopted for tabular 2001 Census outputs across UK
•Late addition of small cell adjustment by ONS/NISRA resulted in high level of user confusion and dissatisfaction
•Publicised commitment to aim for a common UK SDC methodology for all 2011 Census outputs
•Phase 1 (March ’06 – Jan ’07)– UK agreement of key SDC policy issues
•Phase 2 (Jan ’07 – Sept ’08) – Evaluation of all methods complying with agreed SDC
policy position in terms of risk/utility framework and feasibility of implementation
•Phase 3 (Sept ’08 – Spring/Summer ’09)– Recommendations and UK agreement of SDC
methodologies for 2011 Census tabular outputs•Phase 4 (Feb ’09 onwards)
– Evaluate and develop SDC methods for microdata, future work on output specification, system specification, development and testing
Workplan
Progress
•Development of SDC Strategy– UK SDC working group established to take forward
methodological work consisting of representatives from Wales, Northern Ireland and Scotland
– UKCDMAC subgroup set up to QA work•Methodological research:
– Determine the short-list of SDC methods (Aug ‘07)– Quantitative evaluation of short-list (complete Sep
’08)•Focus on tabular outputs whilst considering impact on other outputs (e.g. microdata)
Quantitative Evaluation
•Examine how methods protect and manage risk and how they impact on data utility
•Using a range of 2001 Census tables, varying parameters, different geographies
•Information Loss software used to evaluate each short-listed method
Short-listed Methods being considered for 2011 Census data
•Applied so that ‘safe’ tabular outputs can be released
• Record Swapping• Over-imputation
•ABS Cell Perturbation (developed by the Australian Bureau of Statistics)
•2001 Census SDC methods used as a baseline for comparison: Record Swapping and Small Cell Adjustment (SCA)
Short-listed SDC methods
•Record Swapping pre-tabular (applied
•Over-imputation directly to the microdata)
•ABS Cell Perturbation: post-tabular (applied to tables)
•SCA (a type of rounding) is also a post-tabular method
Record Swapping
• Swap the geographical location of a small number of households
• Households are paired according to similar characteristics (to avoid too much data distortion)
• Creates uncertainty in the data
• Can swap unique records only (those at greater risk)
B
Area B
A
Treatment:Find a different geographical Area Identify another individual in a different area with virtually all the same characteristics Swap the two records
Characteristics:
Age: 22,
Sex: Male,
Marital Status: Married
No of Cars: 3
Region: Area A
Characteristics
Age: 22,
Sex: Male,
Marital Status: Married
No of Cars: 1
Region: Area B
Matches all variables except No of Cars
Unique as only person with 3 cars in Area A
Swap records
Record Swapping
Over-Imputation
• Imputation is a standard procedure for census data used to insert plausible values for those missing due to non-response
• Since it is not known whether these records are true or false, can also be used for SDC
• Carried out by the Edit and Imputation team at ONS using CANCEIS• Algorithm: distance based nearest neighbour to use as a donor based on a set of matching variables
1) Blank out values for certain records in the data
2) Replace blanked out values with ‘imputed values’ using a nearest neighbour donor
25 male single 6 people in hhld
0 cars student
21 male single 6 people in hhld
0 cars student
Blank out age from record
Find a donor to impute age
Over-Imputation
Over-Imputation
• Which variables to impute?
• Risky variables? Ethnicity, elderly, other minority populations
• CANCEIS may impute exactly if using nearest neighbour donor
• Impute age (all donors) and small area geography (use only donors within same local authority): get a small margin of error
(ABS) Cell Perturbation
•Developed by the Australian Bureau of Statistics (ABS)
•Perturb each cell value in a table to create uncertainty around the true value
•Two stage method:– Stage 1: Adding Perturbation– Stage 2: Restoring Additivity
(ABS) Cell Perturbation
• Stage 1: Each cell is always perturbed in the same way using microdata keys – CONSISTENCY
• Stage 2: Restoring ADDITIVITY means consistency is lost slightly
• An improved approach is being developed in collaboration with Southampton University: optimise consistency and additivity – INVARIANT cell perturbation.
Results
• What is the effect on statistical quality of the data? – Tendency to increase correlations?– Tendency to distort distance metrics?– etc (many ways to measure infoloss)
• Impact on disclosure risk
• Examine different types of data
Results
• Only Over-Imputation, Record Swapping and Record Swapping with SCA have been evaluated so far.
• Both targeted and random approaches are being looked at.
• Note there are different ways of carrying out swapping and imputation, so interpretation of the results should take this into account.
• SJ EA; approx. 200,000 households and 500,000 persons
• Four census tables so far:(1) Country of birth by religion by sexIndividuals at ward level(2) Number of persons by accommodation typeHouseholds at OA and ED level(3) Age by religion by genderIndividuals at OA and ED level(4) Origin-destination tableFlows between home and travel to work location
Data for Analysis
;
Measures of Quality
• Impact on Tests for Independence:
Cramer’s V measure of association: where is the Pearson chi-square statistic
Also, the same measure for entropy and the Pearson Statistic
• Variance of Cell Counts:
For each row :
and
)1(),1min(
2
CRnCV
2
kc
korig
korig
k
korig DcD
nDV 2))((
1
1)(
),(1
)(1
rn
k
korig
kpert
rorig DDVR
nDV)(
)(),(
korig
kpertk
origkpert DV
DVDDVR
k
( ) ( )( , ) 100
( )pert orig
pert origorig
CV D CV DRCV D D
CV D
Measures of Utility
• Impact on Rank Correlations:
Sort original cell counts and define deciles Repeat on perturbed cell counts
where I is the indicator function and the number of rows
• Log Linear Analysis:
Ratio of the deviance (likelihood ratio test statistic) between perturbed table and original table for a given model:
2
2
orig
pert
L
LLR
origvpertv
r
n
k
pertk
origk
n
vvIRC
r
1
)(100
rn
Impact on Disclosure Risk
Disclosure Risk Remaining in Table
0 0.2 0.4 0.6 0.8 1
Random sw ap 20%
Random sw ap 20%w ith SCA
Target sw ap 20%
Random imput 20%
Target imput 20%
Probability of a small cell not being perturbed
table 3
table 2
table 1
Quality MeasuresRatio of Variance (after/before)
0.85
0.9
0.95
1
1.05
1.1
Randomsw ap 20%
Randomsw ap 20%w ith SCA
Targetsw ap 20%
Randomimput 20%
Target imput20%
Change in variance
table 1
table 2
table 3
Quality Measures
Percent Relative Difference in Level of Association
-14
-12
-10
-8
-6
-4
-2
0
2
4
6
Random swap20%
Random swap20% with SCA
Target swap 20% Random imput20%
Target imput20%
% change in Cramer's V table 1
table 2
table 3
• Swapping does not change the overall set of household locationsTotals and subtotals by geography preserved
• Over-Imputation does change set of locationsTotals and subtotals by geography not preserved
• Swapping has no impact on Origin-Destination total flows – NO PROTECTION
• Over-Imputation does not preserve O/D total flows – POOR QUALITY
Changes to Totals / Subtotals
Conclusions
• Decide whether to drop over-imputation: test on another EA?
• Quantitative Evaluation to be finished by September ’08
• ABS cell perturbation method currently being evaluated – results are looking good
Further Work
• Setting of parameter values for final method; e.g. level of perturbation
• Protection of microdata samples• Communal establishments• Output specification / geography• System specification, development and
testing
Contact Details
[email protected]@ons.gov.uk
Useful links:www.statistics.gov.uk/census/2011census/producingdata/outputconfidentiality.asp
www.statistics.gov.uk/census2001/discloseprotect.asp