Download ppt - Progress on the SDC Strategy for the 2011 Census 23 rd June 2008 Keith Spicer and Caroline Young

Progress on the SDC Strategy for the 2011 Census

23rd June 2008

Keith Spicer and Caroline Young

• Context• Work plan• Description of the short-listed

methods• Quantitative Evaluation – some

results!• Conclusions and Further Work

Outline

Context

•SDC for 2011 Census outputs is a major concern for users

•Different SDC methodologies were adopted for tabular 2001 Census outputs across UK

•Late addition of small cell adjustment by ONS/NISRA resulted in high level of user confusion and dissatisfaction

•Publicised commitment to aim for a common UK SDC methodology for all 2011 Census outputs

•Phase 1 (March ’06 – Jan ’07)– UK agreement of key SDC policy issues

•Phase 2 (Jan ’07 – Sept ’08) – Evaluation of all methods complying with agreed SDC

policy position in terms of risk/utility framework and feasibility of implementation

•Phase 3 (Sept ’08 – Spring/Summer ’09)– Recommendations and UK agreement of SDC

methodologies for 2011 Census tabular outputs•Phase 4 (Feb ’09 onwards)

– Evaluate and develop SDC methods for microdata, future work on output specification, system specification, development and testing

Workplan

Progress

•Development of SDC Strategy– UK SDC working group established to take forward

methodological work consisting of representatives from Wales, Northern Ireland and Scotland

– UKCDMAC subgroup set up to QA work•Methodological research:

– Determine the short-list of SDC methods (Aug ‘07)– Quantitative evaluation of short-list (complete Sep

’08)•Focus on tabular outputs whilst considering impact on other outputs (e.g. microdata)

Quantitative Evaluation

•Examine how methods protect and manage risk and how they impact on data utility

•Using a range of 2001 Census tables, varying parameters, different geographies

•Information Loss software used to evaluate each short-listed method

Short-listed Methods being considered for 2011 Census data

•Applied so that ‘safe’ tabular outputs can be released

• Record Swapping• Over-imputation

•ABS Cell Perturbation (developed by the Australian Bureau of Statistics)

•2001 Census SDC methods used as a baseline for comparison: Record Swapping and Small Cell Adjustment (SCA)

Short-listed SDC methods

•Record Swapping pre-tabular (applied

•Over-imputation directly to the microdata)

•ABS Cell Perturbation: post-tabular (applied to tables)

•SCA (a type of rounding) is also a post-tabular method

Record Swapping

• Swap the geographical location of a small number of households

• Households are paired according to similar characteristics (to avoid too much data distortion)

• Creates uncertainty in the data

• Can swap unique records only (those at greater risk)

B

Area B

A

Treatment:Find a different geographical Area Identify another individual in a different area with virtually all the same characteristics Swap the two records

Characteristics:

Age: 22,

Sex: Male,

Marital Status: Married

No of Cars: 3

Region: Area A

Characteristics

Age: 22,

Sex: Male,

Marital Status: Married

No of Cars: 1

Region: Area B

Matches all variables except No of Cars

Unique as only person with 3 cars in Area A

Swap records

Record Swapping

Over-Imputation

• Imputation is a standard procedure for census data used to insert plausible values for those missing due to non-response

• Since it is not known whether these records are true or false, can also be used for SDC

• Carried out by the Edit and Imputation team at ONS using CANCEIS• Algorithm: distance based nearest neighbour to use as a donor based on a set of matching variables

1) Blank out values for certain records in the data

2) Replace blanked out values with ‘imputed values’ using a nearest neighbour donor

25 male single 6 people in hhld

0 cars student

21 male single 6 people in hhld

0 cars student

Blank out age from record

Find a donor to impute age

Over-Imputation

Over-Imputation

• Which variables to impute?

• Risky variables? Ethnicity, elderly, other minority populations

• CANCEIS may impute exactly if using nearest neighbour donor

• Impute age (all donors) and small area geography (use only donors within same local authority): get a small margin of error

(ABS) Cell Perturbation

•Developed by the Australian Bureau of Statistics (ABS)

•Perturb each cell value in a table to create uncertainty around the true value

•Two stage method:– Stage 1: Adding Perturbation– Stage 2: Restoring Additivity

(ABS) Cell Perturbation

• Stage 1: Each cell is always perturbed in the same way using microdata keys – CONSISTENCY

• Stage 2: Restoring ADDITIVITY means consistency is lost slightly

• An improved approach is being developed in collaboration with Southampton University: optimise consistency and additivity – INVARIANT cell perturbation.

Results

• What is the effect on statistical quality of the data? – Tendency to increase correlations?– Tendency to distort distance metrics?– etc (many ways to measure infoloss)

• Impact on disclosure risk

• Examine different types of data

Results

• Only Over-Imputation, Record Swapping and Record Swapping with SCA have been evaluated so far.

• Both targeted and random approaches are being looked at.

• Note there are different ways of carrying out swapping and imputation, so interpretation of the results should take this into account.

• SJ EA; approx. 200,000 households and 500,000 persons

• Four census tables so far:(1) Country of birth by religion by sexIndividuals at ward level(2) Number of persons by accommodation typeHouseholds at OA and ED level(3) Age by religion by genderIndividuals at OA and ED level(4) Origin-destination tableFlows between home and travel to work location

Data for Analysis

;

Measures of Quality

• Impact on Tests for Independence:

Cramer’s V measure of association: where is the Pearson chi-square statistic

Also, the same measure for entropy and the Pearson Statistic

• Variance of Cell Counts:

For each row :

and

)1(),1min(

2

CRnCV

2

kc

korig

korig

k

korig DcD

nDV 2))((

1

1)(

),(1

)(1

rn

k

korig

kpert

rorig DDVR

nDV)(

)(),(

korig

kpertk

origkpert DV

DVDDVR

k

( ) ( )( , ) 100

( )pert orig

pert origorig

CV D CV DRCV D D

CV D

Measures of Utility

• Impact on Rank Correlations:

Sort original cell counts and define deciles Repeat on perturbed cell counts

where I is the indicator function and the number of rows

• Log Linear Analysis:

Ratio of the deviance (likelihood ratio test statistic) between perturbed table and original table for a given model:

2

2

orig

pert

L

LLR

origvpertv

r

n

k

pertk

origk

n

vvIRC

r

1

)(100

rn

Impact on Disclosure Risk

Disclosure Risk Remaining in Table

0 0.2 0.4 0.6 0.8 1

Random sw ap 20%

Random sw ap 20%w ith SCA

Target sw ap 20%

Random imput 20%

Target imput 20%

Probability of a small cell not being perturbed

table 3

table 2

table 1

Quality MeasuresRatio of Variance (after/before)

0.85

0.9

0.95

1

1.05

1.1

Randomsw ap 20%

Randomsw ap 20%w ith SCA

Targetsw ap 20%

Randomimput 20%

Target imput20%

Change in variance

table 1

table 2

table 3

Quality Measures

Percent Relative Difference in Level of Association

-14

-12

-10

-8

-6

-4

-2

0

2

4

6

Random swap20%

Random swap20% with SCA

Target swap 20% Random imput20%

Target imput20%

% change in Cramer's V table 1

table 2

table 3

• Swapping does not change the overall set of household locationsTotals and subtotals by geography preserved

• Over-Imputation does change set of locationsTotals and subtotals by geography not preserved

• Swapping has no impact on Origin-Destination total flows – NO PROTECTION

• Over-Imputation does not preserve O/D total flows – POOR QUALITY

Changes to Totals / Subtotals

Conclusions

• Decide whether to drop over-imputation: test on another EA?

• Quantitative Evaluation to be finished by September ’08

• ABS cell perturbation method currently being evaluated – results are looking good

Further Work

• Setting of parameter values for final method; e.g. level of perturbation

• Protection of microdata samples• Communal establishments• Output specification / geography• System specification, development and

testing

Contact Details

[email protected]@ons.gov.uk

Useful links:www.statistics.gov.uk/census/2011census/producingdata/outputconfidentiality.asp

www.statistics.gov.uk/census2001/discloseprotect.asp