PrivBayes: Private Data Release via Bayesian Networks

Preview:

DESCRIPTION

PrivBayes: Private Data Release via Bayesian Networks. Jun Zhang , Graham Cormode , Cecilia M. Procopiuc , Divesh Srivastava , Xiaokui Xiao. Overview. The Problem: Private Data Release Differential Privacy Challenges The Algorithm: PrivBayes Bayesian Network Details of PrivBayes - PowerPoint PPT Presentation

Citation preview

PrivBayes: Private Data Release via Bayesian Networks

Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, Xiaokui Xiao

The Problem: Private Data Release◦ Differential Privacy◦ Challenges

The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes

Function : Linear vs. Logarithmic Experiments

Overview

The Problem: Private Data Release◦ Differential Privacy◦ Challenges

The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes

Function : Linear vs. Logarithmic Experiments

Overview

Data Release

sensitivedatabase

company institute

public adversary

Private Data Release

sensitivedatabase

adversary

syntheticdatabase

companysimilar properties

accurate inference

How can we design such a private data release algorithm?

The Problem: Private Data Release◦ Differential Privacy◦ Challenges

The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes

Function : Linear vs. Logarithmic Experiments

Overview

Definition of -Differential Privacy◦ A randomized data release algorithm satisfies -differential

privacy, if for any two neighboring datasets and for any possible synthetic data ,

Differential Privacy [TCC’06]

Name Has cancer?

Alice Yes

Bob No

Chris Yes

Denise Yes

Eric No

Frank Yes

Name Has cancer?

Alice Yes

Bob No

Chris Yes

Denise Yes

Eric No

Frank No

A general approach to achieve differential privacy is injecting Laplace noise to the output, in order to cover the impact of any individual!

More details in Preliminaries part of the paper

Differential Privacy [TCC’06]

Our Target

Design a data release algorithm with differential privacy guarantee.

The Problem: Private Data Release◦ Differential Privacy◦ Challenges

The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes

Function : Linear vs. Logarithmic Experiments

Overview

To build a synthetic data, we need to understand the tuple distribution of the sensitive data.

Challenges of Private Data Release

sensitivedatabase

syntheticdatabase

convert

full-dimtuple distribution

noisydistribution

+ noise sample

Example: Database has 10M tuples, 10 attributes (dimensions), and 20 values per attribute:

Scalability: full distribution has cells◦most of them have non-zero counts after noise injection◦ privacy is expensive (computation, storage)

Signal-to-noise: avg. information in each cell is ; avg. noise is (for )

Challenges of Private Data Release

Previous solutions suffer from either scalability or signal-to-noise problem

The Problem: Private Data Release◦ Differential Privacy◦ Challenges

The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes

Function : Linear vs. Logarithmic Experiments

Overview

PrivBayes: Dimension Reduction

sensitivedatabase

syntheticdatabase

convert

noisydistribution

+ noise sample

a set of low-dim distributions

noisy low-dim distributions

+ noiseconvert

approximate

full-dimtuple distribution

sample

The advantages of using low-dimensional distributions◦ easy to compute◦ small domain -> high signal density -> robust against noise

But, how to find a set of low-dim distributions that provides a good approximation to full distribution?

PrivBayes: Dimension Reduction

The Problem: Private Data Release◦ Differential Privacy◦ Challenges

The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes

Function : Linear vs. Logarithmic Experiments

Overview

A -dimensional database:

Bayesian Network

age workclass

education title

income

Pr [𝑎𝑔𝑒 ] Pr [𝑤𝑜𝑟𝑘∨𝑎𝑔𝑒 ]

Pr [𝑒𝑑𝑢∨𝑎𝑔𝑒 ] Pr [𝑡𝑖𝑡𝑙𝑒∨𝑤𝑜𝑟𝑘 ]

Pr [ 𝑖𝑛𝑐𝑜𝑚𝑒∨𝑤𝑜𝑟𝑘 ]

A -dimensional database:

Bayesian Network

age workclass

education title

income

Pr [𝑎𝑔𝑒 ] ⋅Pr [𝑤𝑜𝑟𝑘∨𝑎𝑔𝑒 ] ⋅Pr [𝑒𝑑𝑢∨𝑎𝑔𝑒 ]⋅Pr [𝑡𝑖𝑡𝑙𝑒∨𝑤𝑜𝑟𝑘 ] ⋅Pr [ 𝑖𝑛𝑐𝑜𝑚𝑒∨𝑤𝑜𝑟𝑘 ]

Pr [∗ ]≈

Bayesian Network

age workclass

education title

income

Pr [∗ ]≈ Pr [𝑎𝑔𝑒 ] ⋅Pr [𝑒𝑑𝑢∨𝑎𝑔𝑒 ] ⋅Pr [𝑤𝑜𝑟𝑘∨𝑎𝑔𝑒 ,𝑒𝑑𝑢 ]⋅Pr [𝑡𝑖𝑡𝑙𝑒∨𝑒𝑑𝑢,𝑤𝑜𝑟𝑘 ] ⋅Pr [ 𝑖𝑛𝑐𝑜𝑚𝑒∨𝑤𝑜𝑟𝑘 , 𝑡𝑖𝑡𝑙𝑒 ]

Quality of Bayesian network decides the quality of approximation

The Problem: Private Data Release◦ Differential Privacy◦ Challenges

The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes

Function : Linear vs. Logarithmic Experiments

Overview

STEP 1: Choose a suitable Bayesian network ◦must in a differentially private way

STEP 2: Compute conditional distributions implied by ◦straightforward to do under differential privacy ◦inject noise – Laplace mechanism

STEP 3: Generate synthetic data by sampling from ◦post-processing: no privacy issues

Outline of the Algorithm

Finding optimal -degree Bayesian network was solved in [Chow-Liu’68]. It is a DAG of maximum in-degree , and maximizes the sum of mutual information of its edges

Optimal Bayesian Network

𝐼 ( 𝑋 ,𝑌 )=∑𝑦∈𝑌

∑𝑥∈𝑋

Pr [𝑥 , 𝑦 ] log( Pr [𝑥 , 𝑦 ]Pr [𝑥 ] Pr [ 𝑦 ] ) .

∑( 𝑋 ,𝑌 ) : edge

𝐼 (𝑋 ,𝑌 ) ,

where

Finding optimal -degree Bayesian network was solved in [Chow-Liu’68]. It is a DAG of maximum in-degree , and maximizes the sum of mutual information of its edges

Optimal Bayesian Network

finding the maximum spanning tree, where the weight of edge is mutual information .

Build a -degree BN for database

Build a Bayesian Network

Alan 0 0 0 0

Bob 0 0 0 0

Cykie 1 1 1 0

David 0 0 0 0

Eric 1 1 0 0

Frank 1 1 0 0

George 0 0 0 0

Helen 1 1 1 0

Ivan 0 0 0 0

Jack 1 1 0 0

Start from a random attribute

Build a Bayesian Network

A C

B D

Select next tree edge by its mutual information

Build a Bayesian Network

A C

B D

0.5

0.5

0.5 0.2

0.3

0.5 0.5

candidates:

Alan 0 0 0 0

Bob 0 0 0 0

Cykie 1 1 1 0

David 0 0 0 0

Eric 1 1 0 0

Frank 1 1 0 0

George 0 0 0 0

Helen 1 1 1 0

Ivan 0 0 0 0

Jack 1 1 0 0

Select next tree edge by its mutual information

Build a Bayesian Network

A C

B D

candidates:

Select next tree edge by its mutual information

Build a Bayesian Network

A C

B D

Select next tree edge by its mutual information

Build a Bayesian Network

A C

B D

candidates:

Select next tree edge by its mutual information

Build a Bayesian Network

A C

B D

DONE!

It is NP-hard to train the optimal -degree Bayesian network, when [JMLR’04].

Most approximation algorithms are too complicated to be converted into private algorithms.

In our paper, we find a way to extend the Chow-Liu solution (-degree) to higher degree cases.

In this talk, we focus on -degree cases for simplicity.

-degree Bayesian Network

Do it under Differential Privacy!

(Non-private) select the edge with maximum (Private) is data-sensitive -> the best edge is also data-sensitive

Private Bayesian Network

Solution: randomized edge selection!

Exponential Mechanism [FOCS’07]

Databases𝐷

Edges𝑒

define How good edge is as the result of selection, given database

Return with probability: Pr [𝑒 ]∝exp (𝜀2 ⋅ 𝑞 (𝐷 ,𝑒 )Δ (𝑞 ) )

Δ (𝑞)=max𝐷 ,𝐷′ ,𝑒

‖𝑞 (𝐷 ,𝑒)−𝑞 (𝐷′ ,𝑒)‖1where

n oiseinfo

Do it under Differential Privacy!

Select edges with exponential mechanism◦ define (edge) = (edge)◦we prove , where . (Lemma 1)

Private Bayesian Network

Pr [𝑒 ]∝ exp( 𝜀2 ⋅ 𝐼 (𝑒 )log𝑛/𝑛 ) n oiseinfo

Problem solved?

NO

Sensitivity (noise scale) is too large for

The Problem: Private Data Release◦ Differential Privacy◦ Challenges

The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes

Function : Linear vs. Logarithmic Experiments

Overview

Basic Facts

Functions Range(scale of info)

Sensitivity(scale of noise)

and have a strong positive correlation

IDEA: define score to agree with at maximum valuesand interpolate linearly in-between

Function

: “optimal” dbnsover thatmaximize ΠPr [𝑥 , 𝑦 ] how far?

𝐹=−12

minΠ :𝑜𝑝𝑡𝑖𝑚𝑎𝑙

‖Pr [𝑥 , 𝑦 ]−Π‖1 Range of :

Sensitivity of :

Function

𝐹=−12

minΠ :𝑜𝑝𝑡𝑖𝑚𝑎𝑙

‖Pr [𝑥 , 𝑦 ]−Π‖1

0.5 0.2

0.3

0.5

0.5

0.5

0.51.60.4

𝐹=−0.2

𝐼=0.4 𝐼=1𝐼=1

vs.

𝐼

𝐹 and of random distributions

correlation coefficient

The Problem: Private Data Release◦ Differential Privacy◦ Challenges

The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes

Function : Linear vs. Logarithmic Experiments

Overview

vs.

Adult dataset

We use four datasets in our experiments◦Adult, NLTCS, TPC-E, BR2000

Adult dataset◦ census data of 45,222 individuals◦ 15 attributes: age, workclass, education, marital status, etc.◦ tuple domain size (full-dimensional): about

Dataset

Counting Queries

Query: all -way marginals Query: all -way marginals

Multiple SVMs

Adult, gender Adult, education

Query: build 4 classifiers

Multiple SVMs

Adult, gender Adult, education

Query: build 4 classifiers

Differential privacy can be applied effectively for data release

Key ideas of the solution:◦Bayesian networks for dimension reduction◦ carefully designed linear quality for exponential mechanism

Many open problems remain:◦ extend to other forms of data: graph data, mobility data◦ obtain alternate (workable) privacy definitions

Concluding Remarks

Thanks!

Appendix

Privacy, accuracy, and consistency too: a holistic solution to contingency table release [PODS’07]◦ incurs an exponential running time◦ only optimized for low-dimensional marginals

Differentially private publication of sparse data [ICDT’12]◦ achieves scalability, but no help for signal-to-noise problem

Differentially private spatial decompositions [ICDE’12]◦ coarsens the histogram H to control nr. cells◦ has some limits, e.g., range queries, ordinal domain

Previous Work

Assume that . A distribution maximizes the mutual information between and if and only if◦, for any ;◦For each , there is at most one with .

: Optimal Distributions

two score functions for real and

neighboring databases and

Sensitivity (noise) max of derivative and

Analogy: Logarithmic vs. Linear

Interactive Model

database differentially privatealgorithm

query privacy budget

noisy answer

1. risk of privacy breach cumulates after answering multiple queries

2. It requires specific DP algorithm for every particular query

user

Non-interactive Model: Data Release

private data releaseprivacy budget

query

noisy answer

synthetic data

Reusability: only access sensitive data once

Generality: support most queries

Recommended