22
Really Useful Synthetic Data A Framework to Evaluate the Quality of Differentially Private Synthetic Data Christian Arnold (Cardiff University) Marcel Neunhoeffer (University of Mannheim) June 9th, 2020

Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

Really Useful Synthetic DataA Framework to Evaluate the Quality of Differentially Private Synthetic Data

Christian Arnold (Cardiff University)Marcel Neunhoeffer (University of Mannheim)June 9th, 2020

Page 2: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

Vision

Page 3: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

So You Want to Share Sensitive Data...

Use cases, e.g.

• Science: Reproduce studies• Government: Accountability• Business: Externalise services

Really Useful Synthetic Data | Microsoft | June 9th, 2020 1

Page 4: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

How Do We Currently Solve this Dilemma?

Don’t publish data. Publish results without the data themselves. Sign contracts to work with such data. Remove ID variables and share anonymized data. Privacy preserving synthetic data as a promising idea.

Really Useful Synthetic Data | Microsoft | June 9th, 2020 2

Page 5: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

Synthetic data forsharing statistical

information

Principled dataprotection with

differential privacy

⇒ Differentially private synthetic data to protectprivacy with principled guarantees while

maximising data utility

Really Useful Synthetic Data | Microsoft | June 9th, 2020 2

Page 6: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

Synthetic data forsharing statistical

information

Principled dataprotection with

differential privacy

⇒ Differentially private synthetic data to protectprivacy with principled guarantees while

maximising data utility

Really Useful Synthetic Data | Microsoft | June 9th, 2020 2

Page 7: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

Synthetic data forsharing statistical

information

Principled dataprotection with

differential privacy

⇒ Differentially private synthetic data to protectprivacy with principled guarantees while

maximising data utilityReally Useful Synthetic Data | Microsoft | June 9th, 2020 2

Page 8: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

Core Challenges

Page 9: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

DP Data Synthesis is a Multi Phase Inference Problem

Inference usually

Inference with DP synthetic data

Really Useful Synthetic Data | Microsoft | June 9th, 2020 3

Page 10: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

DP Data Synthesis is a Multi Phase Inference Problem

Inference usually Inference with DP synthetic data

Really Useful Synthetic Data | Microsoft | June 9th, 2020 3

Page 11: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

Typical Data Challenges Applied Data Analysts Are Facing

Continuous dataReal values.

Discrete dataIntegers.

Structural zerosNon-logical combinations of data (e.g. pregnant men).

Missing dataMissing data entries. Leads to biased inference if not handled appropriately.

Nested dataObservations are nested in others. E.g. individuals in groups or observations over time.In effect, observations are not iid.

Really Useful Synthetic Data | Microsoft | June 9th, 2020 4

Page 12: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

DP Data Synthesizers Ideally Perform Well in Different Settings

DP privacy guaranteesAccepted ε values range from [0.01, 5]

Sample sizesTypical sample sizes range from[500, 100′000]

Really Useful Synthetic Data | Microsoft | June 9th, 2020 5

Page 13: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

* savvy scientist builds DP data synthesizer *

Really Useful Synthetic Data | Microsoft | June 9th, 2020 5

Page 14: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

Now That We Have DP SyntheticData, How Can We Know It IsUseful?

Page 15: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

A General Benchmark for Differentially Private Data Synthesizers

General Utility Specific Utility

Training DataSimilarity

Marginals: Wasserstein randomi-sation test

Regression coefficients: Averagepercent bias

Joint distribution: pMSE Variances: Average variance andcovariance ratio

Generalisa-tion Similarity

Marginals: Wasserstein randomi-sation test

Regression coefficients: Averagepercent bias

Joint distribution: pMSE Variances: Coverage

Prediction RMSE w.r.t. furthersamples from population

Really Useful Synthetic Data | Microsoft | June 9th, 2020 6

Page 16: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

Training Data Similarity

General utility

• pMSE for joint distribution• Wasserstein randomisation test for marginals

Specific utility

• Average percent bias of regression coefficients• Average variance and covariance ratio of Variances

Really Useful Synthetic Data | Microsoft | June 9th, 2020 7

Page 17: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

Generalisation Similarity

General utility

• pMSE for joint distribution• Wasserstein randomisation test for marginals

Specific utility

• Average percent bias of regression coefficients• Coverage for Variances• Prediction RMSE w.r.t. further samples from population

Really Useful Synthetic Data | Microsoft | June 9th, 2020 8

Page 18: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

An Application

Page 19: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

A Visualization for All Utility Measures

Training Wasserstein distance ratio

Training pMSE ratio

Generalisation Wasserstein distance ratio

Generalisation pMSE ratio

Generalisation Coverage Rate

Generalisation Coef. Bias (%)

Generalisation Prediction RMSE

Training Covariance Ratio

Training Coef. Bias (%)

Really Useful Synthetic Data | Microsoft | June 9th, 2020 9

Page 20: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

Outlook

Page 21: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

What We Would Like to Do Next

Utility measures

• A utility challenge for submitting DP data synthesizers to OpenDP?• Include further application specific utility measures.

Statistical properties of DP data synthesizers

• What are the statistical properties of DP GANs (or other synthesizers)?• What do we know about statistical validity of DP data synthesis?

Really Useful Synthetic Data | Microsoft | June 9th, 2020 10

Page 22: Really Useful Synthetic Data - A Framework to Evaluate the Quality …christianarnold.org/wp-content/uploads/2020/07/private... · 2020. 7. 7. · GeneralUtility SpecificUtility Training

Dr Christian Arnold

Dr Christian ArnoldCardiff [email protected]@chrisguarnold

Marcel Neunhoeffer

Marcel NeunhoefferUniversity of [email protected]@mneunho

Really Useful Synthetic Data | Microsoft | June 9th, 2020 10