Privacy by Learning the Database

Privacy by Learningthe Database

Moritz Hardt DIMACS, October 24, 2012

Isn’t privacy the opposite of learningthe database?

Curator

Analyst

data set D= multi-set

over universe U

query set Q

privacy-preservingstructure S

accurate on Q

. . .1 2 3 N4 5

Data set D as N-dimensional histogram where N=|U|

D[i] = # elements in Dof type i

Normalized histogram = distribution over universe

Statistical query q (aka linear/counting):

. . .1 2 3 N4 5

Vector q in [0,1]N

q(D) := <q,D>

q(D) in [0,1]

Why statistical queries?

• Perceptron, ID3 decision trees, PCA/SVM, k-means clustering [BlumDworkMcSherryNissim’05]

• Any SQ-learning algorithm [Kearns’98]– includes “most” known PAC-learning algorithms

Lots of data analysis reduces to multiple statistical queries

Curator’s wildest dream:

This seems hard!

Curator’s 2nd attempt:

Intuition:Entropy implies privacy

Two pleasant surprises

Approximately solved by multiplicative weights update [Littlestone89,...]

Can easily be made differentially private

Why did learning theorists care to solve privacy problems 20 years ago?

Answer:Entropy implies generalization

Learnerexample set Q

hypothesis haccurate on all

examples

Maximizing entropy implieshypothesis generalizes

Unknown concept

Sensitive databaseQueries labeled by

answer on DB

Synopsisapproximates DB on

query set

Must Preserve Privacy

Unknown conceptExamples labeled by concept

Hypothesisapproximates target concept on examples

Must Generalize

Privacy Learning

How can we solve this?

Concave maximizations.t. linear constraints

EllipsoidWe’ll take adifferent route.

Start with uniform D0

“What’s wrong with it?”Query q violates constraint!

Minimize entropy losss.t. correction

Closed form expression for Dt+1? Well...

Closed form expression for Dt+1? YES!

Approximate Think

Multiplicative Weights Update

. . .1 2 3 N4 5

At step t

. . .1 2 3 N4 5

At step tSuppose q(Dt) < q(D)

. . .1 2 3 N4 5

After step t

Multiplicative Weights Update

Algorithm:D0 uniformFor t = 1...T Find bad query q Dt+1 = Update(Dt,q)

How quickly do we run out ofbad queries?

Progress Lemma:

if q bad

Facts:

Progress Lemma:

if q bad

At most steps

Error bound

Algorithm:D0 uniformFor t = 1...T Find bad query q Dt+1 = Update(Dt,q)

What about privacy?

Only step that interacts with D

Differential Privacy [Dwork-McSherry-Nissim-Smith-06]

Two data sets D,D’ are called neighboring if they differ in one element.

Definition (Differential Privacy):A randomized algorithm M(D) is called (ε,δ)-differentially privateif for any two neighboring data sets D,D’ and all events S:

Laplacian Mechanism [DMNS’06]

Given query q:1. Compute q(D)2. Output q(D) + Lap(1/ε0n)

Fact: Satisfies ε0-differential privacy

Note: Sensitivity of q is 1/n

Query selection

… q1 q2 q3 qk

|q(D)-q(Dt)|

Query selection

… q1 q2 q3 qk

|q(D)-q(Dt)|

Add Lap(1/ε0n)

Pick maximal violation

Query selection

… q1 q2 q3 qk

|q(D)-q(Dt)|

Pick maximal violation

Query selection

… q1 q2 q3 qk

|q(D)-q(Dt)|

Lemma [McSherry-Talwar’07]:Selected index satisfies ε0-differential privacyand w.h.p Violation >

Algorithm:D0 uniformFor t = 1...T Noisy selection of q Dt+1 = Update(Dt,q)

Now: Each step satisfies ε0-differential privacy!

What is the total privacy guarantee?

Also use noisy answer in update rule

New error bound:

T-fold composition ofε0-differential privacy satisfies:

Answer 1 [DMNS’06]:

ε0T-differential privacy

Answer 2 [DRV’10]:

(ε,δ)-differential privacy

Note: for small enough ε

Composition Theorem

s Erro

ize T,

Theorem 1. On databases of size nMW achieves ε-differential privacywith

Theorem 2. MW achieves (ε, δ)-differential privacy with

Optimal dependence on |Q| and n

Offline (non-interactive)

Online (interactive)

?✔H-Ligett-McSherry12,Gupta-H-Roth-Ullman11

See also: Roth-Roughgarden10, Dwork-Rothblum-Vadhan10,Dwork-Naor-Reingold-Rothblum-Vadhan09, Blum-Ligett-Roth08

H-Rothblum10

Algorithm:Given query qt:

• If |qt (Dt)- qt (D) | < α/2 + Lap(1/ε0n)– Output qt (Dt)

• Otherwise– Output qt (D) + Lap(1/ε0n)– Dt+1 = Update(Dt, qt )

Private MW Online [H-Rothblum’10]

Achieves same error bounds!

Overview: Privacy Analysis

• Offline setting: T << n steps– Simple analysis using Composition Theorems

• Online setting: k >> n invocations of Laplace– Composition Thms don’t suggest small error!

• Idea: Analyze privacy loss like lazy random walk (goes back to Dinur-Dwork-Nissim’03)

Privacy Loss as a lazy random walk

Number of Steps

Privacy loss

Number of Steps

lazy lazy lazy lazy lazy

Privacy loss

busy busy busy busy busy

busy round = noisy answer close to forcing update

Number of Steps

Privacy loss

Number of Steps

Privacy loss

Number of Steps

Privacy loss

Number of Steps

Privacy loss

1 1 1 1

Number of Steps

Privacy loss

1 1 1 1 1 1 1 1

Number of Steps

Privacy loss

1 1 1 1 1 1 1 1

W.h.p. boundedby O(sqrt(#busy))

Formalizing the random walk

Imagine output of PMW is 0/1 indicator vector

where vt=1 if round t update, 0 otherwise

Recall: Very few updates! Vector is sparse.

Theorem: Vector v is (ε,δ)-diffpriv.

Let D,D’ be neighboring DBs

Let P,Q be corresponding output distributions

Lemma: (3) implies (ε,δ)-diffpriv.

Approach:1.Sample v from P2.Consider X = log(P(v)/Q(v))3.Argue Pr{ |X| > ε } ≤ δ

Intution:X = privacy

Privacy loss in round t

We’ll show:1. Xt = 0 if t not busy2.|Xt| ≤ ε0 if t busy 3. Number of busy rounds O(#updates)

Total privacy loss

DRV’10E[X1+...+Xk] ≤ O(ε0

2#updates)

AzumaStrong concentrationaround expectation

Defining “busy” eventUpdate condition:

Busy event

Offline (non-interactive) Online (interactive)

✔ ✔

What we can do

• Offline/batch setting: every set of linear queries• Online/interactive setting: every sequence of

adaptive and adversarial linear queries• Theoretical performance: Nearly optimal in the

worst case– For instance-by-instance guarantee see H-Talwar10,

Nikolov-Talwar (upcoming!), different techniques• Practical performance: Compares favorably to

previous work! See Katrina’s talk.

Are we done?

What we would like to do

Running time: Linear dependence on |U||U| exponential in #attributes of data

Can we get poly(n)?No, in the worst-case for synthetic data [DNRRV09]even for simple query classes [Ullman-Vadhan10]

No, in interactive setting without restricting query class [Ullman12]

What can we do about it?

Look beyond the worst-case!Find meaningful assumptionson data, queries, models etc

Design better heuristics!

In this talk:Get more mileage out of learning theory!

Sensitive databaseQueries labeled by

answer on DB

Synopsisapproximates DB on

query set

Unknown conceptExamples labeled by concept

Hypothesisapproximates target concept on examples

Privacy Learning

Can we turn this into an efficient reduction?

Yes. [H-Rothblum-Servedio’12]

Informal Theorem: There is an efficient differentially private release mechanism for a query class Q provided that there is an efficient PAC-learning algorithm for related concept class Q’• Interfaces nicely with existing learning

algorithms:– Learning based on polynomial threshold

functions [Klivans-Servedio]– Harmonic Sieve [Jackson] and extension [Jackson,

Klivans, Servedio]

Database as a function

Observation:Enough to learn Ft for t=α,2α,...,(1-α)in order to approximate F

Query q q(D)

High-level idea

Learning algorithm

labeled examples

Observation: If all labels are privacy-preserving,then so will be hypothesis h

Hypothesis h such that

Main hurdles

• Privacy requires noise, noise might defeat learning algorithm

• Can only generate |D| examples efficiently before running out of privacy

Learning algorithm

Threshold Oracle

Compute a=F(x)+N If |a-t| tiny: output “fail”Else if a>t: output 1 Else if a<t: output 0

Ensures:1. Privacy2. “Removes” noise3. Complexity independent of |D|

Generate samples:

1. Pick x1,x2,..,.xm

2. Receive b1,b2,...,bm from TO3. Remove all “failed” examples4. Pass on remaining labeled examples to learner

(y1,l1),....,(yr,lr)

“F(x)>t”?

b in {0,1,fail}

Application: Boolean Conjunctions

Important class of queries in differential privacy [BCDKMT07,KRSU10,GHRU11,HMT12,...]

Salary > $50k Syphilis Height > 6’1 Weight < 180 Male

True False True False True

True True True True True

False False False True False

True False False True True

False False False False False

Example Conjunction: “(Salary > $50k) AND (Male)”Evaluates to on this database

Universe U = {0,1}d

Informal Corollary (Subexponential algorithm for conjunctions).There is a differentially private release algorithm with running time poly(|D|) such that for any distribution over Boolean conjunctions the algorithm is w.h.p. α-accurate provided that:

Informal Corollary (Small width).There is a differentially private release algorithm with running time poly(|D|) such that for any distribution over width-k Boolean conjunctions the algorithm is w.h.p. α-accurate provided that:

Previous:2O(d)

Previous:dO(k)

Follow-up work

• Thaler-Ullman-Vadhan12: Can remove distributional relaxation and get exp(O(d1/2)) complexity for all Boolean conjunctions

Idea: Use polynomial encodings from learning algorithm directly

Summary

• Derived simple and powerful private data release algorithm from first principles

• Privacy/learning analogy as a guiding principle– Can be turned into efficient reduction

• Can we use these ideas outside theory and in new settings?

Thank you

Open problems

• Is PMW close to instance optimal?• Is there a converse to privacy-to-learning

reduction?• No barriers for cut/spectral analysis of

graphs/matrices (universe small)• Releasing k-way conjunctions in time poly(n),

error poly(d,k)

Privacy by Learning the Database

Documents

Privacy-Preserving Deep Learning - SHOKRI · in modern deep learning, ... The existing literature on privacy protection in machine learning ... tion to privacy-preserving deep learning

Issues for Multimedia Privacy & Security ---- Video Content Privacy Protection, Copyright Protection & Database Access Control Jianping Fan Dept of Computer

Privacy Implications of Database Ranking - VLDB · privacy implications of database ranking. ... copyright holder by emailing info@vldb.org. Articles from this volume ... From a conceptual

FedV: Privacy-Preserving Federated Learning over

Razi Mokatren Golan Salman Privacy in a Demographic Database

Privacy and Federated Learning - GitHub Pages

security and privacy in dbms and in sql database

Learning Language and Multimodal Privacy-Preserving

Privacy, Security, and Machine Learning for Mobile … Privacy, Security...Privacy, Security, and Machine Learning for Mobile Health Applications ... depicts a Body Area Network (BAN),

Privacy-Preserving Data Sharing and Matchingcs.anu.edu.au/~Peter.Christen/publications/csl2009...(based on machine learning, AI, data mining, database, or information retrieval techniques)

Practical privacy issues around Learning Analytics

Privacy constraint processing in a privacy-enhanced ...bxt043000/Publications/Journal-Papers/DAS/J35...privacy-enhanced database management system and discusses algorithms for privacy

Practical and Privacy-Preserving Policy Compliance for ... · privacy risks. To mitigate these risks, database-management systems can use privacy-preserving data-retrieval protocols

TrustedDB a Trusted Hardware-Based Database With Privacy

Distributed Learning, Communication Complexity and Privacy

Database Security and Privacy

TrustedDB: A Trusted Hardware based Database with Privacy ... · ... A Trusted Hardware based Database with Privacy and Data Conﬁdentiality ... pensive cryptographic coprocessors

Privacy Issues in Online Learning

Database Access Control & Privacy: Is There A Common Ground?

UCI Database Group Privacy in Database-as-a-Service(DAS) Model