23
Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

Embed Size (px)

Citation preview

Page 1: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

Estimation of the Probit Model From Anonymized Micro Data

 Gerd Ronning and Martin RosemannUniversität Tübingen & IAW Tübingen

UNECE Work Session on Statistical Data Confidentiality, Geneva,

9 – 11 November 2005

Page 2: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

2

Agenda:

• The German anonymization project (see also the earlier presentation by Rainer Lenz)

• Main Results• Estimation of the probit model from anonymized

data

Page 3: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

3

Overview• German project run jointly by the German Statistical Office

and Institute for Applied Economic Research. • German law allows the Statistical Office to provide

scientific researchers with data which are only moderately anonymized

• These data are said to satisfy “factual anonymization” (in German “faktische Anonymisierung”).

• They can be seen as scientific-use files. • The main emphasis of the project is on data from

enterprises for which confidentiality is a more sensitive topic than for data from households

Page 4: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

4

Two objectives of anonymization

• Anonymization of data has to satisfy two objectives which are opposing each other: – (a) minimization of risk of disclosure, – (b) minimization of loss of data quality.

• A compromise has to be reached. • However, factual anonymization has to

guarantueed before we may consider the quality of these data.

• Alternative strategies may be possible and some may lead to a smaller loss of data quality.

Page 5: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

5

The business micro data used in the project

• Kostenstrukturerhebung im Verarbeitenden Gewerbe und Bergbau (1999) (cost structure),

• Umsatzsteuerstatistik (2000) (value added tax)• Einzelhandelsstatistik (1999) (retail business)

Only partly related:• IAB-Betriebspanel (IAB panel of firms)

Page 6: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

6

Different masking procedures

• We compare different masking procedures, in particular microaggregation and “addition” of noise (also in a multiplicative manner).

• For discrete variables we consider masking by post-randomization.

• Other masking procedures, in particular data swapping, have been found to imply too much distortion with respect to data quality.

Page 7: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

7

„Corrected“ estimation under anonymization

• We also consider the possibility of correcting the estimation procedures in linear and nonlinear models in such a way that consistent (unbiased) estimators are derived.

• Examples given below.

Page 8: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

8

Two different strategies of anonymization

• „Information reducing procedures“– Reduction with respect to observational units– Reduction with respect to certain variables– Reduction or coarsening of possible outcomes

• „Data modifying procedures“– Microaggregation– Noise addition– Post randomization (PRAM)

Page 9: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

9

Emphasis of project on „data modifying procedures“

• ....employing, however, some „information reducing procedures“ at the outset.

• For example, regional information was deleted with exception of „west“ and „east“ of Germany.

• „data modifying procedures“ have the advantage that impact on estimation of stochastic models can be formally analyzed.

• For example

Page 10: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

10

Examples of „corrected“ estimation

• For example, in the linear regression model microaggregation can easily be handled by specifying an adequate covariance structure.

• In case of addition of noise we have ‘errors in variables’ which ask for instrumental variable estimation.

• Alternatively we may use the “SIMEX” approach. • Post-randomization of a binary dependent

variable leads to a generalization of the probit model which allows consistent estimation.

Page 11: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

11

Problems of „Corrected“ estimation

• Effect of anonymization of a variable depends....• ...on the procedure ...• ...and whether we use the variable as regressor

or as regressand!• For example, if we post randomize a binary

variable, it can be used as dependent variable in the probit model or as a „dummy variable“ in the linear regression model.

• The first case will be discussed below in more detail.

Page 12: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

12

The probit model

Page 13: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

13

(Symmetric) Post randomization of binary variable

Page 14: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

14

ML Estimation of the probit model under PRAM (1)

Page 15: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

15

ML Estimation of the probit model under PRAM (2)

• Consistent Estimation of probit model under PRAM is possible if right hand regressors are left unprotected.

• As we will see, it is also possible to estimate consistently the probit model if only right hand variables are protected by addition of noise.

• However, no satisfactory procedure has been found so far for the most relevant case that both the dependent and the independent variables had been anonymized.

Page 16: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

16

Addition of noise (in the linear model)

• Additive error

• Multiplicative error

Page 17: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

17

Estimation under (additive) noise of regressor

• Inconsistency of estimate:

• Estimate from SIMEX procedure (adding error by purpose):

Page 18: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

18

Extrapolation in the SIMEX procedure

Page 19: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

19

Report on recent work from estimating the probit model from anonymized micro data (1)

• ML estimation of generalized probit model combined with SIMEX procedure did not work satisfactorily even in the case of no post randomizaion !

• However estimation of the generalized linear model for the special case representing the probit model gave good results for the case „noise addition but no PRAM“.

• STATA SIMEX Procedure !

Page 20: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

20

Report on recent work from estimating the probit model from anonymized micro data (2)

• So far we have no adequate estimation procedure for the case that both the dependent variable is masked by PRAM and the regressor variable(s) is (are) protected by noise addition.

• Note that we consider here a (frequently used) nonlinear model.

• However for linear models correcting estimation procedures seem to work fine.

• See the research report !

Page 21: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

21

Page 22: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

22

Concluding Remarks (Importance of the project) :

• For the first time simultaneous consideration of confidentiality issues and data quality aspects as seen from user‘s side..

• For the first time consideration of impacts of anonymization on statistical inference.

• Use of real data sets from German statistical office.• Use of modern matching algorithms in simulating

scenarios for disclosure. See earlier presentation by Rainer Lenz !

• Use of commercial data bases for simulating external knowledge.

Page 23: Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical

UNECE Work Session on Statistical Data Confidentiality, Geneva, 9 – 11 November 2005

23

Future Research

• So far only cross-section data.• Extension to the case of panel data.• Multiple imputation as a masking procedure.• A project will start very soon !