71
Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Embed Size (px)

Citation preview

Page 1: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Balancing Research & Privacy

E. C. HedbergArizona State University

& NORC at the University of Chicago

Page 2: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Today

• Topics for conversation– Why is external research important– Data products typically produced– What is disclosure?– Methods to avoid disclosure (aggregate tables)– Methods to avoid disclosure (individual level data)

• Research results about a common disclosure method

• Pilot work on creation of synthetic data

Page 3: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Access

Current state of researcher access:• It is very hard to get

access to SLDS data• In the 8 years of

working in this area, I have had access to 16 states

• Very rare to have this level of access

Page 4: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Privacy

• Researchers, Educators, and Parents, are increasingly concerned about what student data elements are recorded and who has access to them

• FERPA regulations set a high bar for the release of information– Must remove personally identifiable information (PII)

• But what constitutes PII?

– Privacy Technical Assistance Center (PTAC) offers some advice

Page 5: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

FERPA Section 34 CFR § 99.31(b)(1)

Page 6: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Search “student data” in news

Page 7: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

PTAC Advice

Page 8: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Access

• There area a wide variety of interpretations to FERPA– Some states allow data use through the audit and

evaluation exception – Some states don’t allow researchers access at all

Page 9: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

WHO CARES ABOUT RESEARCH?

Page 10: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Research

A valid question is: “why allow research with state longitudinal data systems (SLDS) at all?”

Page 11: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Research

• Premise: Good policy is based on the best available evidence as to:1. Facts on the ground2. The mechanisms of achievement3. The results of previous policies

• If we want to enact good policy, we need to (at least) know these three things

Page 12: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Research

• SLDS data provide a good source, and sometimes the only source, of evidence to support positions

• The budgets for national surveys of education achievement are declining– Research about current mechanisms is more difficult

• States such as Arizona usually make up a small portion of those surveys due to sampling plans

Page 13: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Research

• The only way to evaluate facts on the ground in Arizona is with Arizona data

• The only way to evaluate Arizona policies is with Arizona data

• Data from a sample of districts is not necessarily representative.

• A complex, representative, sample can be just as expensive as the SLDS

Page 14: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Research

• Finally, there is return on investment• Nationally, over 600 million federal dollars

have been invested in SLDS• Who is going to analyze all this data?• Much of it can be analyzed by the state…• … but it is also efficient and prudent to partner

with trained researchers.

Page 15: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Research Ecosystem

• Arizona SLDS provides a key resource to support policy investigation to improve education for Arizona residents

• Arizona can partner with ASU and UofA researchers

• Researchers, in turn, get credit for their work, get tenure, and provide return on investment for Arizona

Page 16: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Research Ecosystem

• However, this ecosystem is based on a risky exchange of information

• Private data, protected by FERPA, is the key resource.

• The safest thing to do is to not collect it: but that cripples the ability of Arizona to use evidence to support policies

Page 17: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Key Question

How can we balance research and privacy?

Page 18: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Types of data products

• The research ecosystem is supported by several types of data products

Page 19: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Types of Data Products

• Aggregated Tables• Individual level data for research• Also:• Research centers (e.g., Texas;

http://www.utaustinerc.org)• Web based interfaces to analyze data on a server

(e.g. Rhode Island; http://ridatahub.org)

Page 20: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Disclosure Risk

• So, what are we worried about?• We are concerned that an “intruder” will be

able to identify individuals and obtain sensitive information (score, income level, etc.) about them.– Identification through the use of published tables– Identification through access of individual level

data

Page 21: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Disclosure Risk

• In survey research, the bar is if someone knows that a person is in the sample, can they be identified.

• In administrative data, since (almost) all are in the data, the bar is far lower for risk

Page 22: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

AGGREGATE TABLES

Page 23: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Aggregate Tables

• Descriptive tables indicating counts or other statistics broken down by other nominal characteristics

• Each table needs to balance disclosure risk with data utility

Page 24: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Example

• Random sample from ECLS• Reading level by poverty by gender…by race

Page 25: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Problem?

Page 26: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Problem?

Page 27: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Problem?

Page 28: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Conceptual DiagramTaken from

Duncan, G. T., Fienberg, S. E., Krishnan, R., Padman, R., & Roehrig, S. F. (2001). Disclosure limitation methods and information loss for tabular data. Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, 135-166.

Page 29: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Options

• One option is to enact cell suppression– If a cell in a table is based on n or less

observations, the cell is suppressed• This is easy to implement, but has problems– It is often possible to reproduce the cell count

using other cells and marginal totals– Enacting complementary suppression to avoid

such tactics is often complicated and removes more data

Page 30: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Alternatives to Cell Suppression

• Rounding– All cells in a table are rounded to mask true values– Problems:• Can destroy even more information than cell

suppression• Hard to define the rounding rules• Tables may be inconsistent

Page 31: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Overall

• Data products such as aggregate tables should be vetted by a specialist data auditor– Pre-specified level of risk discussed– Procedures such as linear programming are used

to analyze cells to quantify risk.– Problems:• Is is an expensive position or service

Page 32: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

MICRO-DATA FILES

Page 33: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Rounding, Perturbing

• One option is to limit small cells by rounding covariates to larger units so that large tables that identify individuals is not possible– Problems:• Destroys data• May limit analyses

Page 34: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Micro-aggregation

• Individuals are grouped based on nominal groups or through a cluster analysis

• Mean scores are assigned to each group• Groups are analyzed using weights• See, e.g.,

– Sande, G. (2001). Methods for Data Directed Microaggregation in One Dimension. Proceedings of New Techniques and Technologies for Statistics/Exchange of Technology and Know-how, 18-22.

– Domingo-Ferrer, J., & Mateo-Sanz, J. M. (2002). Practical data-oriented microaggregation for statistical disclosure control. Knowledge and Data Engineering, IEEE Transactions on, 14(1), 189-201.

Page 35: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

CONSEQUENCES OF REDACTION

Page 36: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Data Redaction

• One common safeguard to securing privacy is to redact data of unique individuals

• This strategy is harmful to the analysis, however

Page 37: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Data Redaction

• Common practice is to redact “small cells” from data before giving it to researchers

• For each demographic combination within a district:school:grade: – if 5 or less students have that combination (gender,

disability status, race/ethnicity, English learner, poverty status) test scores removed from data

• This presents major problems for even basic analyses

Page 38: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Data Redaction Test

• 6 States agreed to participate in a study about consequences of data redaction– Names withheld for presentation

• Original data provided that was not redacted• Analyses performed using original data• Redaction rules applied• Reanalysis and comparison of results• Math and Reading, grades 3-8 analyzed

Page 39: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

5th Graders

Page 40: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

5th grade redaction rates

Page 41: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

5th grade redaction rates

• Redaction process can remove up to 35 percent of the data!

• For minority groups, much of the data can be removed.

Page 42: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Data redaction Consequences

• Mean differences are exaggerated• Intraclass correlations increase• The cause is the removal of heterogeneous

schools

Page 43: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Bias in mean differences

Page 44: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Bias in mean differences

Page 45: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Bias in mean differences

Page 46: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Group Correlation

Black 0.45

Hispanic 0.50

Poor 0.65

The level of bias in the mean estimate from the redacted sample is positively correlated with the rate of redaction of that particular groupUnit of analysis: state-subject-grade combinations

Bias is related to the level of redaction

Page 47: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Bias is related to the level of redaction

Page 48: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Bias in design parameters

Page 49: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Alternatives to Data Redaction

• Hedges and Hedberg have three active grants looking at alternative methodologies to data redaction– Spencer Foundation Pilot grant– IES Methodology grant– NSF Education and Human Resources grant

• The spencer grant is completing now, IES and NSF are in data gathering stages

Page 50: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Pilot test of synthetic data

• Data from the State of Arkansas, 2010• Examine 5th grade literacy scores• Use data with pretests for 4th and 3rd grade.

Page 51: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Pilot test of synthetic data

• Micro-data with sensitive columns (i.e., test scores)

• Replace sensitive columns with synthetic data that preserves the variation and co-variation with covariates

• Uses a model based approach similar to imputation to produce synthetic test scores

Page 52: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Two different tries

Simple model• Race, gender and teacher

effects• Fast to implement

Complex model• Race, gender, teacher,

district effects• Pretests• Race by teacher and district

effects• Gender by teacher and

district effects

Page 53: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Results of Pilot

Page 54: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Results of Pilot

• Simple model based synthetic data estimates the mean

Page 55: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Results of Pilot

Page 56: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Results of Pilot

• Simple model based synthetic data doesn’t do so well on the variance: gross underestimation

Page 57: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Results of Pilot

Page 58: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Results of Pilot

• Complex model based synthetic data does OK on estimating the mean

Page 59: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Results of Pilot

Page 60: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Results of Pilot

• But the complex model based synthetic data over-estimates the variance

Page 61: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

PILOT TEST ON MEAN DIFFERENCES

Page 62: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Results of Pilot

Page 63: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Results of Pilot

Page 64: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Results of Pilot

• Simple model based synthetic data underestimates the standard error of the Black/White Difference

Page 65: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Results of Pilot

Page 66: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Results of Pilot

Page 67: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Results of Pilot

• Complex model based synthetic data overestimates the standard error

Page 68: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Pilot test of synthetic data

• These are not the only options for models • Also, there are some technical details about

the simulation procedures that we are glossing over; we have more options here as well

Page 69: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Alternatives to Data Redaction

• We are examining two other alternatives to data redaction– Masking, perturbing, and coarsening the data– NORC’s X-ID system of micro grouping (micro-

aggregation; http://xid.norc.org)

Page 70: Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

NORC XID