52
PrivBayes: Private Data Release via Bayesian Networks Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, Xiaokui Xiao

PrivBayes: Private Data Release via Bayesian Networks

  • Upload
    sona

  • View
    21

  • Download
    1

Embed Size (px)

DESCRIPTION

PrivBayes: Private Data Release via Bayesian Networks. Jun Zhang , Graham Cormode , Cecilia M. Procopiuc , Divesh Srivastava , Xiaokui Xiao. Overview. The Problem: Private Data Release Differential Privacy Challenges The Algorithm: PrivBayes Bayesian Network Details of PrivBayes - PowerPoint PPT Presentation

Citation preview

Page 1: PrivBayes: Private Data Release via Bayesian Networks

PrivBayes: Private Data Release via Bayesian Networks

Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, Xiaokui Xiao

Page 2: PrivBayes: Private Data Release via Bayesian Networks

The Problem: Private Data Release◦ Differential Privacy◦ Challenges

The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes

Function : Linear vs. Logarithmic Experiments

Overview

Page 3: PrivBayes: Private Data Release via Bayesian Networks

The Problem: Private Data Release◦ Differential Privacy◦ Challenges

The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes

Function : Linear vs. Logarithmic Experiments

Overview

Page 4: PrivBayes: Private Data Release via Bayesian Networks

Data Release

sensitivedatabase

company institute

public adversary

Page 5: PrivBayes: Private Data Release via Bayesian Networks

Private Data Release

sensitivedatabase

adversary

syntheticdatabase

companysimilar properties

accurate inference

How can we design such a private data release algorithm?

Page 6: PrivBayes: Private Data Release via Bayesian Networks

The Problem: Private Data Release◦ Differential Privacy◦ Challenges

The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes

Function : Linear vs. Logarithmic Experiments

Overview

Page 7: PrivBayes: Private Data Release via Bayesian Networks

Definition of -Differential Privacy◦ A randomized data release algorithm satisfies -differential

privacy, if for any two neighboring datasets and for any possible synthetic data ,

Differential Privacy [TCC’06]

Name Has cancer?

Alice Yes

Bob No

Chris Yes

Denise Yes

Eric No

Frank Yes

Name Has cancer?

Alice Yes

Bob No

Chris Yes

Denise Yes

Eric No

Frank No

Page 8: PrivBayes: Private Data Release via Bayesian Networks

A general approach to achieve differential privacy is injecting Laplace noise to the output, in order to cover the impact of any individual!

More details in Preliminaries part of the paper

Differential Privacy [TCC’06]

Page 9: PrivBayes: Private Data Release via Bayesian Networks

Our Target

Design a data release algorithm with differential privacy guarantee.

Page 10: PrivBayes: Private Data Release via Bayesian Networks

The Problem: Private Data Release◦ Differential Privacy◦ Challenges

The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes

Function : Linear vs. Logarithmic Experiments

Overview

Page 11: PrivBayes: Private Data Release via Bayesian Networks

To build a synthetic data, we need to understand the tuple distribution of the sensitive data.

Challenges of Private Data Release

sensitivedatabase

syntheticdatabase

convert

full-dimtuple distribution

noisydistribution

+ noise sample

Page 12: PrivBayes: Private Data Release via Bayesian Networks

Example: Database has 10M tuples, 10 attributes (dimensions), and 20 values per attribute:

Scalability: full distribution has cells◦most of them have non-zero counts after noise injection◦ privacy is expensive (computation, storage)

Signal-to-noise: avg. information in each cell is ; avg. noise is (for )

Challenges of Private Data Release

Previous solutions suffer from either scalability or signal-to-noise problem

Page 13: PrivBayes: Private Data Release via Bayesian Networks

The Problem: Private Data Release◦ Differential Privacy◦ Challenges

The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes

Function : Linear vs. Logarithmic Experiments

Overview

Page 14: PrivBayes: Private Data Release via Bayesian Networks

PrivBayes: Dimension Reduction

sensitivedatabase

syntheticdatabase

convert

noisydistribution

+ noise sample

a set of low-dim distributions

noisy low-dim distributions

+ noiseconvert

approximate

full-dimtuple distribution

sample

Page 15: PrivBayes: Private Data Release via Bayesian Networks

The advantages of using low-dimensional distributions◦ easy to compute◦ small domain -> high signal density -> robust against noise

But, how to find a set of low-dim distributions that provides a good approximation to full distribution?

PrivBayes: Dimension Reduction

Page 16: PrivBayes: Private Data Release via Bayesian Networks

The Problem: Private Data Release◦ Differential Privacy◦ Challenges

The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes

Function : Linear vs. Logarithmic Experiments

Overview

Page 17: PrivBayes: Private Data Release via Bayesian Networks

A -dimensional database:

Bayesian Network

age workclass

education title

income

Pr [𝑎𝑔𝑒 ] Pr [𝑤𝑜𝑟𝑘∨𝑎𝑔𝑒 ]

Pr [𝑒𝑑𝑢∨𝑎𝑔𝑒 ] Pr [𝑡𝑖𝑡𝑙𝑒∨𝑤𝑜𝑟𝑘 ]

Pr [ 𝑖𝑛𝑐𝑜𝑚𝑒∨𝑤𝑜𝑟𝑘 ]

Page 18: PrivBayes: Private Data Release via Bayesian Networks

A -dimensional database:

Bayesian Network

age workclass

education title

income

Pr [𝑎𝑔𝑒 ] ⋅Pr [𝑤𝑜𝑟𝑘∨𝑎𝑔𝑒 ] ⋅Pr [𝑒𝑑𝑢∨𝑎𝑔𝑒 ]⋅Pr [𝑡𝑖𝑡𝑙𝑒∨𝑤𝑜𝑟𝑘 ] ⋅Pr [ 𝑖𝑛𝑐𝑜𝑚𝑒∨𝑤𝑜𝑟𝑘 ]

Pr [∗ ]≈

Page 19: PrivBayes: Private Data Release via Bayesian Networks

Bayesian Network

age workclass

education title

income

Pr [∗ ]≈ Pr [𝑎𝑔𝑒 ] ⋅Pr [𝑒𝑑𝑢∨𝑎𝑔𝑒 ] ⋅Pr [𝑤𝑜𝑟𝑘∨𝑎𝑔𝑒 ,𝑒𝑑𝑢 ]⋅Pr [𝑡𝑖𝑡𝑙𝑒∨𝑒𝑑𝑢,𝑤𝑜𝑟𝑘 ] ⋅Pr [ 𝑖𝑛𝑐𝑜𝑚𝑒∨𝑤𝑜𝑟𝑘 , 𝑡𝑖𝑡𝑙𝑒 ]

Quality of Bayesian network decides the quality of approximation

Page 20: PrivBayes: Private Data Release via Bayesian Networks

The Problem: Private Data Release◦ Differential Privacy◦ Challenges

The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes

Function : Linear vs. Logarithmic Experiments

Overview

Page 21: PrivBayes: Private Data Release via Bayesian Networks

STEP 1: Choose a suitable Bayesian network ◦must in a differentially private way

STEP 2: Compute conditional distributions implied by ◦straightforward to do under differential privacy ◦inject noise – Laplace mechanism

STEP 3: Generate synthetic data by sampling from ◦post-processing: no privacy issues

Outline of the Algorithm

Page 22: PrivBayes: Private Data Release via Bayesian Networks

Finding optimal -degree Bayesian network was solved in [Chow-Liu’68]. It is a DAG of maximum in-degree , and maximizes the sum of mutual information of its edges

Optimal Bayesian Network

𝐼 ( 𝑋 ,𝑌 )=∑𝑦∈𝑌

∑𝑥∈𝑋

Pr [𝑥 , 𝑦 ] log( Pr [𝑥 , 𝑦 ]Pr [𝑥 ] Pr [ 𝑦 ] ) .

∑( 𝑋 ,𝑌 ) : edge

𝐼 (𝑋 ,𝑌 ) ,

where

Page 23: PrivBayes: Private Data Release via Bayesian Networks

Finding optimal -degree Bayesian network was solved in [Chow-Liu’68]. It is a DAG of maximum in-degree , and maximizes the sum of mutual information of its edges

Optimal Bayesian Network

finding the maximum spanning tree, where the weight of edge is mutual information .

Page 24: PrivBayes: Private Data Release via Bayesian Networks

Build a -degree BN for database

Build a Bayesian Network

Alan 0 0 0 0

Bob 0 0 0 0

Cykie 1 1 1 0

David 0 0 0 0

Eric 1 1 0 0

Frank 1 1 0 0

George 0 0 0 0

Helen 1 1 1 0

Ivan 0 0 0 0

Jack 1 1 0 0

Page 25: PrivBayes: Private Data Release via Bayesian Networks

Start from a random attribute

Build a Bayesian Network

A C

B D

Page 26: PrivBayes: Private Data Release via Bayesian Networks

Select next tree edge by its mutual information

Build a Bayesian Network

A C

B D

0.5

0.5

0.5 0.2

0.3

0.5 0.5

candidates:

Alan 0 0 0 0

Bob 0 0 0 0

Cykie 1 1 1 0

David 0 0 0 0

Eric 1 1 0 0

Frank 1 1 0 0

George 0 0 0 0

Helen 1 1 1 0

Ivan 0 0 0 0

Jack 1 1 0 0

Page 27: PrivBayes: Private Data Release via Bayesian Networks

Select next tree edge by its mutual information

Build a Bayesian Network

A C

B D

candidates:

Page 28: PrivBayes: Private Data Release via Bayesian Networks

Select next tree edge by its mutual information

Build a Bayesian Network

A C

B D

Page 29: PrivBayes: Private Data Release via Bayesian Networks

Select next tree edge by its mutual information

Build a Bayesian Network

A C

B D

candidates:

Page 30: PrivBayes: Private Data Release via Bayesian Networks

Select next tree edge by its mutual information

Build a Bayesian Network

A C

B D

DONE!

Page 31: PrivBayes: Private Data Release via Bayesian Networks

It is NP-hard to train the optimal -degree Bayesian network, when [JMLR’04].

Most approximation algorithms are too complicated to be converted into private algorithms.

In our paper, we find a way to extend the Chow-Liu solution (-degree) to higher degree cases.

In this talk, we focus on -degree cases for simplicity.

-degree Bayesian Network

Page 32: PrivBayes: Private Data Release via Bayesian Networks

Do it under Differential Privacy!

(Non-private) select the edge with maximum (Private) is data-sensitive -> the best edge is also data-sensitive

Private Bayesian Network

Solution: randomized edge selection!

Page 33: PrivBayes: Private Data Release via Bayesian Networks

Exponential Mechanism [FOCS’07]

Databases𝐷

Edges𝑒

define How good edge is as the result of selection, given database

Return with probability: Pr [𝑒 ]∝exp (𝜀2 ⋅ 𝑞 (𝐷 ,𝑒 )Δ (𝑞 ) )

Δ (𝑞)=max𝐷 ,𝐷′ ,𝑒

‖𝑞 (𝐷 ,𝑒)−𝑞 (𝐷′ ,𝑒)‖1where

n oiseinfo

Page 34: PrivBayes: Private Data Release via Bayesian Networks

Do it under Differential Privacy!

Select edges with exponential mechanism◦ define (edge) = (edge)◦we prove , where . (Lemma 1)

Private Bayesian Network

Pr [𝑒 ]∝ exp( 𝜀2 ⋅ 𝐼 (𝑒 )log𝑛/𝑛 ) n oiseinfo

Problem solved?

NO

Sensitivity (noise scale) is too large for

Page 35: PrivBayes: Private Data Release via Bayesian Networks

The Problem: Private Data Release◦ Differential Privacy◦ Challenges

The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes

Function : Linear vs. Logarithmic Experiments

Overview

Page 36: PrivBayes: Private Data Release via Bayesian Networks

Basic Facts

Functions Range(scale of info)

Sensitivity(scale of noise)

and have a strong positive correlation

Page 37: PrivBayes: Private Data Release via Bayesian Networks

IDEA: define score to agree with at maximum valuesand interpolate linearly in-between

Function

: “optimal” dbnsover thatmaximize ΠPr [𝑥 , 𝑦 ] how far?

𝐹=−12

minΠ :𝑜𝑝𝑡𝑖𝑚𝑎𝑙

‖Pr [𝑥 , 𝑦 ]−Π‖1 Range of :

Sensitivity of :

Page 38: PrivBayes: Private Data Release via Bayesian Networks

Function

𝐹=−12

minΠ :𝑜𝑝𝑡𝑖𝑚𝑎𝑙

‖Pr [𝑥 , 𝑦 ]−Π‖1

0.5 0.2

0.3

0.5

0.5

0.5

0.51.60.4

𝐹=−0.2

𝐼=0.4 𝐼=1𝐼=1

Page 39: PrivBayes: Private Data Release via Bayesian Networks

vs.

𝐼

𝐹 and of random distributions

correlation coefficient

Page 40: PrivBayes: Private Data Release via Bayesian Networks

The Problem: Private Data Release◦ Differential Privacy◦ Challenges

The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes

Function : Linear vs. Logarithmic Experiments

Overview

Page 41: PrivBayes: Private Data Release via Bayesian Networks

vs.

Adult dataset

Page 42: PrivBayes: Private Data Release via Bayesian Networks

We use four datasets in our experiments◦Adult, NLTCS, TPC-E, BR2000

Adult dataset◦ census data of 45,222 individuals◦ 15 attributes: age, workclass, education, marital status, etc.◦ tuple domain size (full-dimensional): about

Dataset

Page 43: PrivBayes: Private Data Release via Bayesian Networks

Counting Queries

Query: all -way marginals Query: all -way marginals

Page 44: PrivBayes: Private Data Release via Bayesian Networks

Multiple SVMs

Adult, gender Adult, education

Query: build 4 classifiers

Page 45: PrivBayes: Private Data Release via Bayesian Networks

Multiple SVMs

Adult, gender Adult, education

Query: build 4 classifiers

Page 46: PrivBayes: Private Data Release via Bayesian Networks

Differential privacy can be applied effectively for data release

Key ideas of the solution:◦Bayesian networks for dimension reduction◦ carefully designed linear quality for exponential mechanism

Many open problems remain:◦ extend to other forms of data: graph data, mobility data◦ obtain alternate (workable) privacy definitions

Concluding Remarks

Thanks!

Page 47: PrivBayes: Private Data Release via Bayesian Networks

Appendix

Page 48: PrivBayes: Private Data Release via Bayesian Networks

Privacy, accuracy, and consistency too: a holistic solution to contingency table release [PODS’07]◦ incurs an exponential running time◦ only optimized for low-dimensional marginals

Differentially private publication of sparse data [ICDT’12]◦ achieves scalability, but no help for signal-to-noise problem

Differentially private spatial decompositions [ICDE’12]◦ coarsens the histogram H to control nr. cells◦ has some limits, e.g., range queries, ordinal domain

Previous Work

Page 49: PrivBayes: Private Data Release via Bayesian Networks

Assume that . A distribution maximizes the mutual information between and if and only if◦, for any ;◦For each , there is at most one with .

: Optimal Distributions

Page 50: PrivBayes: Private Data Release via Bayesian Networks

two score functions for real and

neighboring databases and

Sensitivity (noise) max of derivative and

Analogy: Logarithmic vs. Linear

Page 51: PrivBayes: Private Data Release via Bayesian Networks

Interactive Model

database differentially privatealgorithm

query privacy budget

noisy answer

1. risk of privacy breach cumulates after answering multiple queries

2. It requires specific DP algorithm for every particular query

user

Page 52: PrivBayes: Private Data Release via Bayesian Networks

Non-interactive Model: Data Release

private data releaseprivacy budget

query

noisy answer

synthetic data

Reusability: only access sensitive data once

Generality: support most queries