11
Disclosure detection & control in research environments Felix Ritchie

Disclosure detection & control in research environments Felix Ritchie

Embed Size (px)

Citation preview

Page 1: Disclosure detection & control in research environments Felix Ritchie

Disclosure detection & control in research environments

Felix Ritchie

Page 2: Disclosure detection & control in research environments Felix Ritchie

Why are research environments special?

• Little disclosure control on input

• Few limits on processing

• Unpredictable, complex outputs– an infinity of “special cases”

Manual review for disclosiveness required

Page 3: Disclosure detection & control in research environments Felix Ritchie

Problems of reviewing research outputs

• Limited application of rules

• How do we ensure– consistency?– transparency?– security?

• How do we do this with few resources?

Page 4: Disclosure detection & control in research environments Felix Ritchie

Classifying the research zoo

• Some outputs inherently “safe”• Some inherently “unsafe”

• Concentrate on the unsafe– Focus training– Define limits– Discourage use

Page 5: Disclosure detection & control in research environments Felix Ritchie

Safe versus unsafe

• Safe outputs– Will be released unless certain conditions arise

• Unsafe outputs– Won’t be released unless demonstrated to be safe

Examples:

* = conditions for release apply

Unsafe Safe Indeterminate

Quantiles Linear regression* Herfindahl indexes

Graphs Panel data estimates

Aggregated tables

Cross-product matrices

Estimated covariances*

Page 6: Disclosure detection & control in research environments Felix Ritchie

Determining safety

• Key is to understand whether the underlying functional form is safe or unsafe

• Each output type assessed for risk of– Primary disclosure– Disclosure by differencing

Page 7: Disclosure detection & control in research environments Felix Ritchie

Example:linear aggregates of data are unsafe

• Inherent disclosiveness:

cXXf )(

)()()( YXfYfXf – Differencing is feasible

each data point needs to be assessedfor threshold/dominance limits

=> resource problem for large datasets

• Disclosure by differencing:

Page 8: Disclosure detection & control in research environments Felix Ritchie

Example:linear regression coefficients are safe

• Let yXXXyXf 'ˆ),( 1

),(),(),( zyXfzXfyXf

cXyXf ),(

can’t identify single data point

• But

No risk of differencing

• Exceptions– All right hand variables public and an excellent fit (easily

tested, can generate automatic limits on prediction)– All observations on a single person/company– Must be a valid regression

Page 9: Disclosure detection & control in research environments Felix Ritchie

Example:cross-product/variance-covariance matrices

– Can’t create a table for X unless Z=X and W=I

weighted covariance matrix is safe

211 ˆ'ˆ XXVyXXX

211 ̂ WXXZZWXXV

• Cross product matrix M = (X’X) is unsafe• Frequencies/totals identified by interaction with constant• And for any other categorical variables

• What about variance-covariance matrices?

– V is unsafe – can be inverted to produce M– But in the more general case

Page 10: Disclosure detection & control in research environments Felix Ritchie

Example:Herfindahl indices

• Safe as long as at least 3 firms in the industry?• No:

– Quadratic term exacerbates dominance– If second-largest share is much smaller, H share of

largest firm– Standard dominance rule of largest unit<45% share

doesn’t prevent this

• Current tests for safety not very satisfactory

i

iiii

i xxssH /2

• Composite index of industrial concentration

Page 11: Disclosure detection & control in research environments Felix Ritchie

Questions?

Felix Ritchie

Microdata Analysis and User Support

Office for National Statistics

[email protected]

+44 1633 45 5846