22
Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with Multiply Imputed Synthetic Datasets for the German IAB Establishment Survey

Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

Embed Size (px)

Citation preview

Page 1: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

Jörg Drechsler(Institute for Employment Research,

Germany)

NTTS 2009

Brussels, 20. February 2009

Disclosure Control in Business Data Experiences with Multiply Imputed Synthetic Datasets for the German

IAB Establishment Survey

Page 2: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

2

Overview

Background

Multiple imputation for statistical disclosure control

Challenges for real data applications

Some preliminary results

Conclusions/Future Work

Page 3: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

3

SDC for Business Data

Public release of business data is often considered too risky

- Skewed distributions make identification of single units easy

- Information on businesses in the public domain

- High benefits from identifying a single unit

- High probability of inclusion for large establishments

Only coarsening and top-coding is not sufficient

Standard perturbation methods have to be applied on a high level

Release of high quality data is very difficult

Multiply imputed synthetic datasets as a possible solution

Page 4: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

4

Partially synthetic datasets (Little 1993)

only potentially identifying or sensitive variables are replaced

Page 5: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

5

Partially synthetic datasets (Little 1993)

only potentially identifying or sensitive variables are replaced

Page 6: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

6

Partially synthetic datasets (Little 1993)

only potentially identifying or sensitive variables are replaced

advantages: - synthesis can be tailored to the records at risk

- approach is applicable to continuous and discrete variables

- modeling tries to preserve the joint distribution of the data

Page 7: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

7

Challenges for real data applications

Missing data

Skip patterns

Logical constraints

Page 8: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

8

Missing Data

Missing data is a common problem in surveys (More than 200 variables with missings in our survey)

Most SDL techniques can not deal with missing values

Imputation in two stages for synthetic data:- multiply impute missing values on stage one- generate synthetic datasets for each one stage nest on stage two

New combining rules necessary (Reiter, 2004)

Page 9: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

9

Skip patterns

Joint modeling very difficult for datasets with skip patterns and different types of variables

Imputation by sequential regression (Raghunathan et al., 2001)

linear models for continuous variableslogit models for binary variablesmultinomial models for categorical variables

For skip patterns: Use logit model to decide if filtered questions are applicable Impute values only for records with a positive outcome from the logit

model

Page 10: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

10

Logical constraints

All continuous variables>0 Redraw from the model for negative units until restriction is always

fulfilled Only possible, if truncation point is at the far end of the distribution Otherwise, refine model

Y1>Y2, e.g. total nb of employees>nb of part time employees

x=Y2/Y1

Z=logit(x)

Use standard linear model on transformed variable

Backtransform imputed values to get final values

Page 11: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

11

The IAB Establishment Panel

Annually conducted establishment survey

Since 1993 in Western Germany, since 1996 in Eastern Germany

Population: All establishments with at least one employee covered by social security

Source: Official Employment Statistics

Sample of more than 16.000 establishments in the last wave

Contents: employment structure, changes in employment, investment, training, remuneration, working hours,

collective wage agreements, works councils

Page 12: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

12

Synthesis of the IAB Establishment Panel

We only synthesize the wave 2007

Missing values are imputed for all variables

Roughly 25 variables are synthesized

Combination of key variables and sensitive variables

Key variables: region, industry code, personnel structure,…

Sensitive variables: turnover, investments,…

For data quality evaluation, we only look at the synthesis step

Number of imputations for the synthesis: r=10

Page 13: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

13

Confidence interval overlap

Suggested by Karr et al. (2006)

Measure the overlap of CIs from the original data and CIs from the synthetic data

The higher the overlap, the higher the data utility

Compute the average relative CI overlap for any k

ksynksyn

koverkover

korigkorig

koverkoverk LU

LU

LU

LUJ

,,

,,

,,

,,

2

1

overUoverL

origL synL origUsynU

CI for the synthetic data

CI for the original data

Page 14: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

14

Two regression results

Regressions suggested by colleagues at the IAB

First regression:

- dependent variable: part-time yes/no

- probit regression on 19 explanatory variables + industry dummies

Second regression:

- Dependent variable: expected employment trend (decrease, no change, increase)

- ordered probit on 38 variables + industry dummies

Both regressions are computed separately for West and East Germany

Page 15: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

15

beat org. beta syn. J.k.beta z-score org. z-score syn.Intercept -0.800 -0.777 0.95 -7.18 -6.725-10 employees 0.448 0.449 0.95 8.63 7.7810-20 employees 0.666 0.593 0.68 11.16 10.4820-50 employees 0.806 0.754 0.79 13.30 12.16100-200 employees 0.904 0.874 0.92 9.37 8.87200-500 employees 1.134 1.092 0.91 10.02 9.49>500 employees 1.675 1.580 0.88 8.28 8.00growth in employment exp. 0.002 -0.003 0.97 0.05 -0.05decrease in emp. expected 0.092 0.114 0.93 1.17 1.45share of female workers 1.453 1.378 0.76 17.79 19.22share of employees with university degree 0.314 0.372 0.90 2.14 2.71share of low qualified workers 1.105 1.179 0.80 12.12 12.53share of temporary employees -0.310 -0.139 0.75 -1.65 -1.12share of agency workers -0.492 -0.514 0.96 -3.11 -3.24employment in the last 6 month 0.388 0.370 0.90 8.21 7.86dismissal in the last 6 months 0.290 0.267 0.87 6.31 5.83foreign ownership -0.115 -0.118 0.99 -1.35 -1.41good or very good profitability 0.034 0.034 0.99 0.86 0.85salary above collective wage agreement 0.007 0.010 0.99 0.12 0.18collective wage agreement 0.020 0.023 0.99 0.39 0.45

Regression results for West Germany

Average CI overlap: 0.89

Page 16: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

16

Regression results for East Germany

Average CI overlap: 0.92

beat org. beta syn. J.k.beta z-score org. z-score syn.Intercept -0.725 -0.675 0.89 -6.53 -6.025-10 employees 0.272 0.277 0.94 4.93 4.4410-20 employees 0.422 0.378 0.81 7.04 6.5520-50 employees 0.554 0.537 0.93 9.42 8.87100-200 employees 0.780 0.812 0.91 8.29 8.50200-500 employees 0.979 0.994 0.97 8.31 8.37>500 employees 1.412 1.410 0.99 5.72 5.64growth in employment exp. -0.034 -0.040 0.97 -0.62 -0.73decrease in emp. expected 0.040 0.042 0.99 0.51 0.54share of female workers 1.010 1.062 0.83 12.69 15.18share of employees with university degree 0.208 0.164 0.90 1.75 1.46share of low qualified workers 0.970 1.057 0.81 8.39 9.05share of temporary employees -0.072 -0.002 0.78 -0.46 -0.02share of agency workers -0.288 -0.243 0.93 -1.67 -1.42employment in the last 6 month 0.230 0.206 0.87 4.96 4.47dismissal in the last 6 months 0.300 0.296 0.98 6.43 6.36foreign ownership -0.166 -0.178 0.97 -1.73 -1.87good or very good profitability 0.098 0.100 0.99 2.40 2.45salary above collective wage agreement 0.092 0.092 1.00 1.19 1.19collective wage agreement 0.097 0.072 0.87 1.88 1.42

Page 17: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

17

Average CI overlap: 0.90 Minimum CI overlap: 0.58

results for the second regression

Page 18: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

18

Conclusions

Generating synthetic datasets is difficult and labour intensive

Synthetic datasets can handle many real data problems

Synthetic datasets seem to provide high data quality for our establishment survey

More data quality evaluations are necessary

Remaining disclosure risk needs to be quantified (Drechsler & Reiter, 2008)

Long term goal: release complete longitudinal data

Future Work

Page 19: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

19

Thank you for your attention

Page 20: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

20

Categorical Variables with a low number of observations

Standard approach: Multinomial/Dirichlet model

Covariates can only be incorporated indirectly by applying the model separately for different subgroups of the data

Provides good results for subgroups only if original dataset is large

Small datasets don’t provide enough observations to built models for different subgroups

Alternative: CART models

Suggested by Reiter (2005)

Page 21: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

21

CART Models

Flexible tool for estimating the conditional distribution of a univariate outcome given multivariate predictors

Partition the predictor space to form subsets with homogeneous outcomes

Partitions found by recursive binary splits of the predictors

L2

Root

L1

L3

X1<3

X2<5

Page 22: Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with

22

CART models for synthesis

Grow a tree using the original data

Define the minimum number of records in each leaf

Prune the tree if necessary

Use partially synthesized data to locate leaf for each unit

Draw new values for each unit by using the Bayesian Bootstrap for each leaf

Difficult to define optimal tree size