Upload
sona
View
21
Download
1
Tags:
Embed Size (px)
DESCRIPTION
PrivBayes: Private Data Release via Bayesian Networks. Jun Zhang , Graham Cormode , Cecilia M. Procopiuc , Divesh Srivastava , Xiaokui Xiao. Overview. The Problem: Private Data Release Differential Privacy Challenges The Algorithm: PrivBayes Bayesian Network Details of PrivBayes - PowerPoint PPT Presentation
Citation preview
PrivBayes: Private Data Release via Bayesian Networks
Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, Xiaokui Xiao
The Problem: Private Data Release◦ Differential Privacy◦ Challenges
The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes
Function : Linear vs. Logarithmic Experiments
Overview
The Problem: Private Data Release◦ Differential Privacy◦ Challenges
The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes
Function : Linear vs. Logarithmic Experiments
Overview
Data Release
sensitivedatabase
company institute
public adversary
Private Data Release
sensitivedatabase
adversary
syntheticdatabase
companysimilar properties
accurate inference
How can we design such a private data release algorithm?
The Problem: Private Data Release◦ Differential Privacy◦ Challenges
The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes
Function : Linear vs. Logarithmic Experiments
Overview
Definition of -Differential Privacy◦ A randomized data release algorithm satisfies -differential
privacy, if for any two neighboring datasets and for any possible synthetic data ,
Differential Privacy [TCC’06]
Name Has cancer?
Alice Yes
Bob No
Chris Yes
Denise Yes
Eric No
Frank Yes
Name Has cancer?
Alice Yes
Bob No
Chris Yes
Denise Yes
Eric No
Frank No
A general approach to achieve differential privacy is injecting Laplace noise to the output, in order to cover the impact of any individual!
More details in Preliminaries part of the paper
Differential Privacy [TCC’06]
Our Target
Design a data release algorithm with differential privacy guarantee.
The Problem: Private Data Release◦ Differential Privacy◦ Challenges
The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes
Function : Linear vs. Logarithmic Experiments
Overview
To build a synthetic data, we need to understand the tuple distribution of the sensitive data.
Challenges of Private Data Release
sensitivedatabase
syntheticdatabase
convert
full-dimtuple distribution
noisydistribution
+ noise sample
Example: Database has 10M tuples, 10 attributes (dimensions), and 20 values per attribute:
Scalability: full distribution has cells◦most of them have non-zero counts after noise injection◦ privacy is expensive (computation, storage)
Signal-to-noise: avg. information in each cell is ; avg. noise is (for )
Challenges of Private Data Release
Previous solutions suffer from either scalability or signal-to-noise problem
The Problem: Private Data Release◦ Differential Privacy◦ Challenges
The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes
Function : Linear vs. Logarithmic Experiments
Overview
PrivBayes: Dimension Reduction
sensitivedatabase
syntheticdatabase
convert
noisydistribution
+ noise sample
a set of low-dim distributions
noisy low-dim distributions
+ noiseconvert
approximate
full-dimtuple distribution
sample
The advantages of using low-dimensional distributions◦ easy to compute◦ small domain -> high signal density -> robust against noise
But, how to find a set of low-dim distributions that provides a good approximation to full distribution?
PrivBayes: Dimension Reduction
The Problem: Private Data Release◦ Differential Privacy◦ Challenges
The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes
Function : Linear vs. Logarithmic Experiments
Overview
A -dimensional database:
Bayesian Network
age workclass
education title
income
Pr [𝑎𝑔𝑒 ] Pr [𝑤𝑜𝑟𝑘∨𝑎𝑔𝑒 ]
Pr [𝑒𝑑𝑢∨𝑎𝑔𝑒 ] Pr [𝑡𝑖𝑡𝑙𝑒∨𝑤𝑜𝑟𝑘 ]
Pr [ 𝑖𝑛𝑐𝑜𝑚𝑒∨𝑤𝑜𝑟𝑘 ]
A -dimensional database:
Bayesian Network
age workclass
education title
income
Pr [𝑎𝑔𝑒 ] ⋅Pr [𝑤𝑜𝑟𝑘∨𝑎𝑔𝑒 ] ⋅Pr [𝑒𝑑𝑢∨𝑎𝑔𝑒 ]⋅Pr [𝑡𝑖𝑡𝑙𝑒∨𝑤𝑜𝑟𝑘 ] ⋅Pr [ 𝑖𝑛𝑐𝑜𝑚𝑒∨𝑤𝑜𝑟𝑘 ]
Pr [∗ ]≈
Bayesian Network
age workclass
education title
income
Pr [∗ ]≈ Pr [𝑎𝑔𝑒 ] ⋅Pr [𝑒𝑑𝑢∨𝑎𝑔𝑒 ] ⋅Pr [𝑤𝑜𝑟𝑘∨𝑎𝑔𝑒 ,𝑒𝑑𝑢 ]⋅Pr [𝑡𝑖𝑡𝑙𝑒∨𝑒𝑑𝑢,𝑤𝑜𝑟𝑘 ] ⋅Pr [ 𝑖𝑛𝑐𝑜𝑚𝑒∨𝑤𝑜𝑟𝑘 , 𝑡𝑖𝑡𝑙𝑒 ]
Quality of Bayesian network decides the quality of approximation
The Problem: Private Data Release◦ Differential Privacy◦ Challenges
The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes
Function : Linear vs. Logarithmic Experiments
Overview
STEP 1: Choose a suitable Bayesian network ◦must in a differentially private way
STEP 2: Compute conditional distributions implied by ◦straightforward to do under differential privacy ◦inject noise – Laplace mechanism
STEP 3: Generate synthetic data by sampling from ◦post-processing: no privacy issues
Outline of the Algorithm
Finding optimal -degree Bayesian network was solved in [Chow-Liu’68]. It is a DAG of maximum in-degree , and maximizes the sum of mutual information of its edges
Optimal Bayesian Network
𝐼 ( 𝑋 ,𝑌 )=∑𝑦∈𝑌
∑𝑥∈𝑋
Pr [𝑥 , 𝑦 ] log( Pr [𝑥 , 𝑦 ]Pr [𝑥 ] Pr [ 𝑦 ] ) .
∑( 𝑋 ,𝑌 ) : edge
𝐼 (𝑋 ,𝑌 ) ,
where
Finding optimal -degree Bayesian network was solved in [Chow-Liu’68]. It is a DAG of maximum in-degree , and maximizes the sum of mutual information of its edges
Optimal Bayesian Network
finding the maximum spanning tree, where the weight of edge is mutual information .
Build a -degree BN for database
Build a Bayesian Network
Alan 0 0 0 0
Bob 0 0 0 0
Cykie 1 1 1 0
David 0 0 0 0
Eric 1 1 0 0
Frank 1 1 0 0
George 0 0 0 0
Helen 1 1 1 0
Ivan 0 0 0 0
Jack 1 1 0 0
Start from a random attribute
Build a Bayesian Network
A C
B D
Select next tree edge by its mutual information
Build a Bayesian Network
A C
B D
0.5
0.5
0.5 0.2
0.3
0.5 0.5
candidates:
Alan 0 0 0 0
Bob 0 0 0 0
Cykie 1 1 1 0
David 0 0 0 0
Eric 1 1 0 0
Frank 1 1 0 0
George 0 0 0 0
Helen 1 1 1 0
Ivan 0 0 0 0
Jack 1 1 0 0
Select next tree edge by its mutual information
Build a Bayesian Network
A C
B D
candidates:
Select next tree edge by its mutual information
Build a Bayesian Network
A C
B D
Select next tree edge by its mutual information
Build a Bayesian Network
A C
B D
candidates:
Select next tree edge by its mutual information
Build a Bayesian Network
A C
B D
DONE!
It is NP-hard to train the optimal -degree Bayesian network, when [JMLR’04].
Most approximation algorithms are too complicated to be converted into private algorithms.
In our paper, we find a way to extend the Chow-Liu solution (-degree) to higher degree cases.
In this talk, we focus on -degree cases for simplicity.
-degree Bayesian Network
Do it under Differential Privacy!
(Non-private) select the edge with maximum (Private) is data-sensitive -> the best edge is also data-sensitive
Private Bayesian Network
Solution: randomized edge selection!
Exponential Mechanism [FOCS’07]
Databases𝐷
Edges𝑒
define How good edge is as the result of selection, given database
Return with probability: Pr [𝑒 ]∝exp (𝜀2 ⋅ 𝑞 (𝐷 ,𝑒 )Δ (𝑞 ) )
Δ (𝑞)=max𝐷 ,𝐷′ ,𝑒
‖𝑞 (𝐷 ,𝑒)−𝑞 (𝐷′ ,𝑒)‖1where
n oiseinfo
Do it under Differential Privacy!
Select edges with exponential mechanism◦ define (edge) = (edge)◦we prove , where . (Lemma 1)
Private Bayesian Network
Pr [𝑒 ]∝ exp( 𝜀2 ⋅ 𝐼 (𝑒 )log𝑛/𝑛 ) n oiseinfo
Problem solved?
NO
Sensitivity (noise scale) is too large for
The Problem: Private Data Release◦ Differential Privacy◦ Challenges
The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes
Function : Linear vs. Logarithmic Experiments
Overview
Basic Facts
Functions Range(scale of info)
Sensitivity(scale of noise)
and have a strong positive correlation
IDEA: define score to agree with at maximum valuesand interpolate linearly in-between
Function
: “optimal” dbnsover thatmaximize ΠPr [𝑥 , 𝑦 ] how far?
𝐹=−12
minΠ :𝑜𝑝𝑡𝑖𝑚𝑎𝑙
‖Pr [𝑥 , 𝑦 ]−Π‖1 Range of :
Sensitivity of :
Function
𝐹=−12
minΠ :𝑜𝑝𝑡𝑖𝑚𝑎𝑙
‖Pr [𝑥 , 𝑦 ]−Π‖1
0.5 0.2
0.3
0.5
0.5
0.5
0.51.60.4
𝐹=−0.2
𝐼=0.4 𝐼=1𝐼=1
vs.
𝐼
𝐹 and of random distributions
correlation coefficient
The Problem: Private Data Release◦ Differential Privacy◦ Challenges
The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes
Function : Linear vs. Logarithmic Experiments
Overview
vs.
Adult dataset
We use four datasets in our experiments◦Adult, NLTCS, TPC-E, BR2000
Adult dataset◦ census data of 45,222 individuals◦ 15 attributes: age, workclass, education, marital status, etc.◦ tuple domain size (full-dimensional): about
Dataset
Counting Queries
Query: all -way marginals Query: all -way marginals
Multiple SVMs
Adult, gender Adult, education
Query: build 4 classifiers
Multiple SVMs
Adult, gender Adult, education
Query: build 4 classifiers
Differential privacy can be applied effectively for data release
Key ideas of the solution:◦Bayesian networks for dimension reduction◦ carefully designed linear quality for exponential mechanism
Many open problems remain:◦ extend to other forms of data: graph data, mobility data◦ obtain alternate (workable) privacy definitions
Concluding Remarks
Thanks!
Appendix
Privacy, accuracy, and consistency too: a holistic solution to contingency table release [PODS’07]◦ incurs an exponential running time◦ only optimized for low-dimensional marginals
Differentially private publication of sparse data [ICDT’12]◦ achieves scalability, but no help for signal-to-noise problem
Differentially private spatial decompositions [ICDE’12]◦ coarsens the histogram H to control nr. cells◦ has some limits, e.g., range queries, ordinal domain
Previous Work
Assume that . A distribution maximizes the mutual information between and if and only if◦, for any ;◦For each , there is at most one with .
: Optimal Distributions
two score functions for real and
neighboring databases and
Sensitivity (noise) max of derivative and
Analogy: Logarithmic vs. Linear
Interactive Model
database differentially privatealgorithm
query privacy budget
noisy answer
1. risk of privacy breach cumulates after answering multiple queries
2. It requires specific DP algorithm for every particular query
user
Non-interactive Model: Data Release
private data releaseprivacy budget
query
noisy answer
synthetic data
Reusability: only access sensitive data once
Generality: support most queries