43
Data Privacy and Security Master Degree in Data Science Sapienza University of Rome Academic Year 2017-2018 Instructor : Daniele Venturi (Some slides from a talk by Cynthia Dwork)

Data Privacy and Security - Altervistadanieleventuri.altervista.org/files/05_Diff_Priv.pdfNoisy ArgMax •Compute 𝑔1 ,…,𝑔 –How many people play ping pong, how many go running,

Embed Size (px)

Citation preview

Data Privacy and Security

Master Degree in Data Science

Sapienza University of Rome

Academic Year 2017-2018

Instructor: Daniele Venturi(Some slides from a talk by Cynthia Dwork)

Differential PrivacyData Privacy and Security

2

Part V: Differential Privacy

Data Exploitation

Differential PrivacyData Privacy and Security

3

• Availability of lots of data

– Social networks, financial data, medical records…

• All these data are an asset

– We would like to exploit them

𝑥

𝑦

Data collection

Examples

Differential PrivacyData Privacy and Security

4

• Finding statistical correlations

– Genotype/phenotype association

– Correlating medical outcomes with risk factors

• Publishing aggregate statistics

• Noticing events/outliers

– Intrusion detection

• Datamining/learning

– Update strategies based on customers data

AOL Search Debacle

Differential PrivacyData Privacy and Security

5

• Back in 2006 AOL published research statisticsof over 3 milions users in a period of 3 months

• Data were anonymized in order to removepersonally identifiable information

– E.g., name, social security numbers,…

• Yet after some time many people wereidentified

Lessons to be Learned

Differential PrivacyData Privacy and Security

6

• Privacy is a concern when publishing datasets

• Wait: This does not apply to me!

– Don’t make the entire dataset available

– Only publish statistics

• Even if only data aggregations are publishedprivacy can be broken

• Overly accurate estimates of too manystatistics is blatantly non-private

Data Analysis

Differential PrivacyData Privacy and Security

7

𝑞1𝑎1𝑞2𝑎2

Data AnalystDatabase

• How to define privacy?

– Intuitively want that published statistics do notundermine privacy of the individuals

– After all statistics are just aggregated data aboutthe overall population

The Statistics Masquerade

Differential PrivacyData Privacy and Security

8

• Differential attack

– How many people in the room had XYZ last night?

– How many people, other than the speaker, hadXYZ last night?

• Needle in a hystack

– Determine presence of an individual genomic data in GWAS case group

• The big bang attack

– Reconstruct sensitive attibutes given statisticsfrom multiple overlapping stats

Privacy-Preserving Data Analysis?

Differential PrivacyData Privacy and Security

9

𝑞1𝑎1𝑞2𝑎2

Data AnalystDatabase

• Can’t learn anything new about Alice?

– Reminiscent of semantic security for encryption

– Then what is the point?

• Ideally: Learn same thing if Alice is replaced by a random member of the population

Differential Privacy

Differential PrivacyData Privacy and Security

10

• The outcome of any analysis is essentiallyequally likely, independent of whether anyindividual joins, or refrains from joining, the dataset

– Alice goes away, Bob joins, Alice is replaced by Bob (i.e., small perturbations do not matter)

• Note that instead if we completely change the dataset we get completely different answers

More Formally…

Differential PrivacyData Privacy and Security

11

𝑞1𝑎1𝑞2𝑎2

Data AnalystDatabase

RandomizedMechanism𝓜

Mechanism 𝓜 gives 휀-differential privacy if for all pairs of adjacent datasets 𝑥, 𝑦, and for all events 𝑆:

Pr 𝓜 𝑥 ∈ 𝑆 ≤ 𝑒 ∙ Pr 𝓜 𝑦 ∈ 𝑆

Notes on the Definition

Differential PrivacyData Privacy and Security

12

• Worst-case guarantee

– It holds for all datasets

– It holds even against unbounded adversaries

– Probability over the randomness of the algorithm, not over the choice of the dataset

• The roles of 𝑥 and 𝑦 can be flipped

• Randomness is in the hands of the good guys

Properties

Differential PrivacyData Privacy and Security

13

• Immune to auxiliary information

– Current and future side information

• Automatically yields group privacy

– Privacy loss 𝑘휀 for groups of size 𝑘

• Composition

– Can bound cumulative privacy loss over multiple analysis (the epsilons add up)

– Can combine a few differentially private mechanisms to solve complex analytical tasks

Did you XYZ Last Night?

Differential PrivacyData Privacy and Security

14

• Flip a coin

– Heads: Flip again and return YES if heads, and else return NO

– Tails: Answer honestly

•Pr YES Truth = YESPr YES Truth = NO

= 3

•Pr NO Truth = YESPr NO Truth = NO

= 3

• 휀 = 1.098Absolute error≈ 1/ 𝑛 for a

single fractional query

Real-Valued Functions

Differential PrivacyData Privacy and Security

15

• Want to compute 𝑞(𝑥)

• Adding pulls the answer to 𝑞 𝑦

– Add random noise to obscure difference 𝑞(𝑥) vs 𝑞(𝑦)

Sensitivity

Differential PrivacyData Privacy and Security

16

• Noise depends on ∆𝑞, 휀, but not on the dataset

– Smaller ∆𝑞: Less distortion

– Smaller 휀: More distortion

• Privacy in the land of plenty (i.e., not a tool for tiny datasets)

To achieve 휀-differential privacy it suffices to addnoise scaled to ∆𝑞/휀

∆𝑞 = maxadj. 𝑥,𝑦|𝑞 𝑥 − 𝑞(𝑦)|

The Laplace Mechanism

Differential PrivacyData Privacy and Security

17

Theorem: Adding noise Lap( ∆𝑞 ) yields 휀-differential privacy

∆𝑞 = maxadj. 𝑥,𝑦|𝑞 𝑥 − 𝑞(𝑦)|

𝑝 𝑧 = 𝑒− 𝑧 /𝑏/2𝑏

Why does it work?

Differential PrivacyData Privacy and Security

18

Theorem: Adding noise Lap( ∆𝑞 ) yields 휀-differential privacy

∆𝑞 = maxadj. 𝑥,𝑦|𝑞 𝑥 − 𝑞(𝑦)|

Pr[𝓜 𝑥 = 𝑡]

Pr[𝓜 𝑦 = 𝑡]= 𝑒

−|𝑡−𝑞(𝑥)|−|𝑡−𝑞(𝑦)|∆𝑞/ ≤ 𝑒

Example: Histogram Queries

Differential PrivacyData Privacy and Security

19

∆𝑞 = maxadj. 𝑥,𝑦 𝑞 𝑥 − 𝑞 𝑦 = 1

So it sufficesadding noiseLap( 1 )

Accuracy for a Set of Queries

• A mechanism𝓜 has (𝛼, 𝛽)-accuracy wrt a set of queries 𝑄 if for all 𝑥: With probability 1 − 𝛽the outcome 𝓜 𝑥 yields an approximate value [𝑞 𝑥 − 𝛼, 𝑞 𝑥 + 𝛼]

• Accuracy of Laplace mechanism for 𝑘 queries

– Fact: Pr Lap 𝑏 ≥ 𝑡𝑏 = 𝑒−𝑡 (for us 𝑏 = 𝑘/휀)

– Union bound: 𝑘 ∙ Pr[|Lap(𝑘/휀)| ≥ 𝑡𝑘/휀] ≤ 𝛽

– Thus: Pr Lap𝑘

≥𝑡𝑘

≤𝛽

𝑘= 𝑒−𝑡

– So, 𝑡 = ln(𝑘/𝛽) and 𝛼 =𝑘ln(𝑘/𝛽)

Differential PrivacyData Privacy and Security

20

≈ 1/휀 for 𝑘 = 1

Noisy ArgMax

• Compute 𝑔1 𝑥 ,… , 𝑔𝑚 𝑥

– How many people play ping pong, how many go running, and etc.

– Wanna know most popular hobby

• Add Lap(2max𝑖∆𝑔𝑖) to each value

– Do not release the outcome, but report the indexof the function with larger noisy outcome

– Works as long as there is a gap between two mostpopular choices

– Compute much more than what is released

Differential PrivacyData Privacy and Security

21

Generalization of Noisy ArgMax

• Target notions of utility

– E.g., mechanism outputs a classifier and utility issmallness of classification error

• Applications where adding noise makes no sense

• Goal: Maximize 𝑢(𝑥, ξ)

– Utility of ξ on database 𝑥

Differential PrivacyData Privacy and Security

22

The Exponential Mechanism

• 𝑞 𝑥 ∈ 𝛯 = ξ1, … , ξ𝑘– Strings, prices, etc.

– Each ξ ∈ 𝛯 has utility 𝑢 𝑥, ξ for 𝑥

Differential PrivacyData Privacy and Security

23

Intuition: Output ξ with probability

∝ 𝑒𝑢 𝑥,ξ /∆𝑢

𝑒𝑢 𝑥,ξ

𝑒𝑢 𝑦,ξ

/∆𝑢

= 𝑒𝑢 𝑥,ξ −𝑢 𝑦,ξ /∆𝑢≤ 𝑒

Unlimited Supply Auctions

• Bidders have demands curves, describing for each price 𝑝 ∈ [0,1] the number of goodsthey wish to purchase at 𝑝

– Total budget 𝑝 ∙ 𝑏𝑖 𝑝 ≤ 1

• Auctioneer picks 𝑝 max. revenue 𝑝 𝑖 𝑏𝑖 𝑝

• Select the price using exponential mechanism

– Utility is the revenue

– Approximately truthful: An individual can’tinfluence the price choice by changing her bid

– Resilient to collusion (for small coalitions)!

Differential PrivacyData Privacy and Security

24

Efficiency?

• Generating synthetic data can be hard

• Consider the following database

– Choose single sign/verify key pair (𝑣𝑘∗, 𝑠𝑘∗)

– Database is 𝑛 rows: (𝑚𝑖 , 𝐒𝐢𝐠𝐧 𝑠𝑘∗, 𝑚𝑖 , 𝑣𝑘

∗) for random messages𝑚𝑖

– One query for each 𝑣𝑘: What fraction of rows are valid signatures wrt 𝑣𝑘

• Efficient curator cannot generate syntheticdatabase (yielding correct answers wrt 𝑣𝑘∗) without leaking rows

Differential PrivacyData Privacy and Security

25

Efficient Synopsis Generation?

• Trivial to find a synopsis with same functionality

– Simply publish 𝑣𝑘∗

• Maintain functionality of the database wrt a set 𝑄 of queries

• Hide presence/absence of any individual

• We will argue that there are hard to sanitizedistributions

– Assuming so-called traitor tracing schemes

– Converse is also true (but we won’t talk about it)

Differential PrivacyData Privacy and Security

26

Traitor Tracing Schemes

Differential PrivacyData Privacy and Security

27

⋮𝑏𝑘

𝑐 ←$ 𝐄𝐧𝐜(𝑏𝑘, 𝑏)

𝑠𝑘1

𝑠𝑘2

𝑠𝑘𝑛

𝑐

𝑐

𝑐

𝐃𝐞𝐜 𝑠𝑘𝑖 , 𝑐 = 𝑏

Stateful Pirates

• What if some users try to resell the content?

• Some users in the coalition will be traced!

Differential PrivacyData Privacy and Security

28

Tracer

𝑠𝑘1

𝑠𝑘2

𝑠𝑘𝑛

𝑐1, … , 𝑐𝑡

𝑏1, … , 𝑏𝑡

Pirate Decoder

Accuse user 𝑖

𝑡𝑘

Intuition for the Lower Bound

• Assume traitor tracing scheme

• One universe element for each private key

– Database is collection of 𝑛 randomly chosen keys

• One query for each ciphertext 𝑐

– For what fraction of rows 𝑖 does 𝐃𝐞𝐜 𝑠𝑘𝑖 , 𝑐 = 1?

• For any synopsis (i.e. the pirate):

– Answer will be 0 or 1, i.e. the decryption of 𝑐

– Tracing reveals ≥ 1 key and never falsely accuses

– Violates differential privacy!

Differential PrivacyData Privacy and Security

29

More Lower Bounds

• Theorem: Assuming one-way functions exist, differentially private algorithms for the following require exponential time

• Synthetic data for 2-way marginals

– Proof relies on digital signatures

• Synopsis release for > 𝑛2 arbitrary countingqueries

– Proof relies on traitor tracing schemes

Differential PrivacyData Privacy and Security

30

Approximate Differential Privacy

Differential PrivacyData Privacy and Security

31

𝑞1𝑎1𝑞2𝑎2

Data AnalystDatabase

RandomizedMechanism𝓜

Mechanism 𝓜 gives (휀, 𝛿)-differential privacy if for all pairs of adjacent datasets 𝑥, 𝑦, and for all events 𝑆:

Pr 𝓜 𝑥 ∈ 𝑆 ≤ 𝑒 ∙ Pr 𝓜 𝑦 ∈ 𝑆 + 𝛿

Benefits of the Relaxation

• Advanced composition

– Can answer 𝑘 queries with cumulative loss √𝑘 ∙ 휀instead of 𝑘휀

• Can use cryptography to simulate trustedcenter

• Gaussian noise

– Leading to better accuracy

Differential PrivacyData Privacy and Security

32

Gaussian Noise

Differential PrivacyData Privacy and Security

33

N 𝜇; 𝜎2 =1

𝜎 2𝜋𝑒−(𝑧−𝜇)2

2𝜎2

The Gaussian Mechanism

Differential PrivacyData Privacy and Security

34

Theorem: Adding noise Lap(∆1/휀) yields (휀, 0)-differential privacy

∆1= maxadj. 𝑥,𝑦 𝑞 𝑥 − 𝑞(𝑦) 1

Theorem: Adding noise N(0,2ln(2/𝛿)(∆2

2)2

2 ) yields

(휀, 𝛿)-differential privacy

∆2= maxadj. 𝑥,𝑦 𝑞 𝑥 − 𝑞(𝑦) 2

For 𝑘 queries∆1= 𝑘

∆2= 𝑘

Incentives

• Until now: Goal was designing differentiallyprivate mechanisms, but the data is assumedto be already there

• But why should someone participate in the computation?

• Why would they give their true data?

• Do we need compensation? How much?

• Any connection with game theory?

Differential PrivacyData Privacy and Security

35

Game Theory and Mechanism Design

• Goal: Solve optimization problem

• Catch: No access to inputs

– Inputs held by self-interested agents

• Design incentives and choice of solution(mechanism) that incentivizes truth-telling

– No need for participants to strategize

– Simple to predict what will happen

– Often a non-truth-telling mechanism can be replaced by one where the coordinatore does the lying on behalf of the participants

Differential PrivacyData Privacy and Security

36

Good News

• Composition: Approximate truthfulness stillsatisfied under composition!

• Collusion resistance: 𝑂(𝑘휀)-approximatedominant strategy, even for coalitions of 𝑘agents

• Both properties not immediate in game-theoretic mechanism design

• All done without money!

Differential PrivacyData Privacy and Security

37

Bad News

• But not only truthful reporting gives an approximate dominant strategy, any reportdoes so

– Even malicious ones

• How do we actually properly get people to truthfully participate?

– Perhaps need compensation

– Much harder to achieve

Differential PrivacyData Privacy and Security

38

Differential Privacy as a Tool

• Nash equilibrium: An assignment of players to strategies so that no player would benefit by changing strategy, given how everyone else isplaying

• Correlated equilibrium: Players have access to correlated signals (e.g., traffic light)

– Every Nash equilibrium is a correlated equilibrium, but not viceversa

• Differential privacy has applications to mechanism design with correlated equilibria

Differential PrivacyData Privacy and Security

39

The Issue of Verification

• Challenging to strictly incentivize truth-tellingin differentially private mechanisms design

• Exceptions:

– Resposes are verifiable

– Agents care about outcome

• Challenge: No observed outcome

– What is the prevalence of drug use?

– Are students cheating in class?

Differential PrivacyData Privacy and Security

40

Bayesan Setting

• Bit-cost pairs (𝑏𝑖 , 𝑐𝑖) drawn from jointdistribution

• If 𝑏𝑖 tells I am a cheater, I tend to believe weare in a world where people cheat

• But cost 𝑐𝑖 does not give additionalinformation beyond what 𝑏𝑖 gives

– Privacy costs arbitrary, but upper bounded by linear cost in 𝑐𝑖 ∙ 휀

– Utility model: 𝑐𝑖 ∙ 휀 − 𝑝𝑖

Differential PrivacyData Privacy and Security

41

Privacy & Game Theory

• Asymptothic truthfulness, some newmechanism design and equilibrium selectionresults

• Interesting challenge of modeling costs for privacy

• In order to design privacy for humans do weneed to understand

– How people currently value or should value it?

– What are the right promises to give?

Differential PrivacyData Privacy and Security

42

Not a Panacea

• The fundamental law still applies!

Differential PrivacyData Privacy and Security

43

Overly accurate estimates of too many statistics is blatantly

non-private