Disclosure detection & control in research environments
Felix Ritchie
Why are research environments special?
• Little disclosure control on input
• Few limits on processing
• Unpredictable, complex outputs– an infinity of “special cases”
Manual review for disclosiveness required
Problems of reviewing research outputs
• Limited application of rules
• How do we ensure– consistency?– transparency?– security?
• How do we do this with few resources?
Classifying the research zoo
• Some outputs inherently “safe”• Some inherently “unsafe”
• Concentrate on the unsafe– Focus training– Define limits– Discourage use
Safe versus unsafe
• Safe outputs– Will be released unless certain conditions arise
• Unsafe outputs– Won’t be released unless demonstrated to be safe
Examples:
* = conditions for release apply
Unsafe Safe Indeterminate
Quantiles Linear regression* Herfindahl indexes
Graphs Panel data estimates
Aggregated tables
Cross-product matrices
Estimated covariances*
Determining safety
• Key is to understand whether the underlying functional form is safe or unsafe
• Each output type assessed for risk of– Primary disclosure– Disclosure by differencing
Example:linear aggregates of data are unsafe
• Inherent disclosiveness:
cXXf )(
)()()( YXfYfXf – Differencing is feasible
each data point needs to be assessedfor threshold/dominance limits
=> resource problem for large datasets
• Disclosure by differencing:
Example:linear regression coefficients are safe
• Let yXXXyXf 'ˆ),( 1
),(),(),( zyXfzXfyXf
cXyXf ),(
can’t identify single data point
• But
No risk of differencing
• Exceptions– All right hand variables public and an excellent fit (easily
tested, can generate automatic limits on prediction)– All observations on a single person/company– Must be a valid regression
Example:cross-product/variance-covariance matrices
– Can’t create a table for X unless Z=X and W=I
weighted covariance matrix is safe
211 ˆ'ˆ XXVyXXX
211 ̂ WXXZZWXXV
• Cross product matrix M = (X’X) is unsafe• Frequencies/totals identified by interaction with constant• And for any other categorical variables
• What about variance-covariance matrices?
– V is unsafe – can be inverted to produce M– But in the more general case
Example:Herfindahl indices
• Safe as long as at least 3 firms in the industry?• No:
– Quadratic term exacerbates dominance– If second-largest share is much smaller, H share of
largest firm– Standard dominance rule of largest unit<45% share
doesn’t prevent this
• Current tests for safety not very satisfactory
i
iiii
i xxssH /2
• Composite index of industrial concentration
Questions?
Felix Ritchie
Microdata Analysis and User Support
Office for National Statistics
+44 1633 45 5846