Upload
duongtuong
View
214
Download
0
Embed Size (px)
Citation preview
Data Privacy and Security
Master Degree in Data Science
Sapienza University of Rome
Academic Year 2017-2018
Instructor: Daniele Venturi(Some slides from a talk by Cynthia Dwork)
Data Exploitation
Differential PrivacyData Privacy and Security
3
• Availability of lots of data
– Social networks, financial data, medical records…
• All these data are an asset
– We would like to exploit them
𝑥
𝑦
Data collection
Examples
Differential PrivacyData Privacy and Security
4
• Finding statistical correlations
– Genotype/phenotype association
– Correlating medical outcomes with risk factors
• Publishing aggregate statistics
• Noticing events/outliers
– Intrusion detection
• Datamining/learning
– Update strategies based on customers data
AOL Search Debacle
Differential PrivacyData Privacy and Security
5
• Back in 2006 AOL published research statisticsof over 3 milions users in a period of 3 months
• Data were anonymized in order to removepersonally identifiable information
– E.g., name, social security numbers,…
• Yet after some time many people wereidentified
Lessons to be Learned
Differential PrivacyData Privacy and Security
6
• Privacy is a concern when publishing datasets
• Wait: This does not apply to me!
– Don’t make the entire dataset available
– Only publish statistics
• Even if only data aggregations are publishedprivacy can be broken
• Overly accurate estimates of too manystatistics is blatantly non-private
Data Analysis
Differential PrivacyData Privacy and Security
7
𝑞1𝑎1𝑞2𝑎2
⋮
Data AnalystDatabase
• How to define privacy?
– Intuitively want that published statistics do notundermine privacy of the individuals
– After all statistics are just aggregated data aboutthe overall population
The Statistics Masquerade
Differential PrivacyData Privacy and Security
8
• Differential attack
– How many people in the room had XYZ last night?
– How many people, other than the speaker, hadXYZ last night?
• Needle in a hystack
– Determine presence of an individual genomic data in GWAS case group
• The big bang attack
– Reconstruct sensitive attibutes given statisticsfrom multiple overlapping stats
Privacy-Preserving Data Analysis?
Differential PrivacyData Privacy and Security
9
𝑞1𝑎1𝑞2𝑎2
⋮
Data AnalystDatabase
• Can’t learn anything new about Alice?
– Reminiscent of semantic security for encryption
– Then what is the point?
• Ideally: Learn same thing if Alice is replaced by a random member of the population
Differential Privacy
Differential PrivacyData Privacy and Security
10
• The outcome of any analysis is essentiallyequally likely, independent of whether anyindividual joins, or refrains from joining, the dataset
– Alice goes away, Bob joins, Alice is replaced by Bob (i.e., small perturbations do not matter)
• Note that instead if we completely change the dataset we get completely different answers
More Formally…
Differential PrivacyData Privacy and Security
11
𝑞1𝑎1𝑞2𝑎2
⋮
Data AnalystDatabase
RandomizedMechanism𝓜
Mechanism 𝓜 gives 휀-differential privacy if for all pairs of adjacent datasets 𝑥, 𝑦, and for all events 𝑆:
Pr 𝓜 𝑥 ∈ 𝑆 ≤ 𝑒 ∙ Pr 𝓜 𝑦 ∈ 𝑆
Notes on the Definition
Differential PrivacyData Privacy and Security
12
• Worst-case guarantee
– It holds for all datasets
– It holds even against unbounded adversaries
– Probability over the randomness of the algorithm, not over the choice of the dataset
• The roles of 𝑥 and 𝑦 can be flipped
• Randomness is in the hands of the good guys
Properties
Differential PrivacyData Privacy and Security
13
• Immune to auxiliary information
– Current and future side information
• Automatically yields group privacy
– Privacy loss 𝑘휀 for groups of size 𝑘
• Composition
– Can bound cumulative privacy loss over multiple analysis (the epsilons add up)
– Can combine a few differentially private mechanisms to solve complex analytical tasks
Did you XYZ Last Night?
Differential PrivacyData Privacy and Security
14
• Flip a coin
– Heads: Flip again and return YES if heads, and else return NO
– Tails: Answer honestly
•Pr YES Truth = YESPr YES Truth = NO
= 3
•Pr NO Truth = YESPr NO Truth = NO
= 3
• 휀 = 1.098Absolute error≈ 1/ 𝑛 for a
single fractional query
Real-Valued Functions
Differential PrivacyData Privacy and Security
15
• Want to compute 𝑞(𝑥)
• Adding pulls the answer to 𝑞 𝑦
– Add random noise to obscure difference 𝑞(𝑥) vs 𝑞(𝑦)
Sensitivity
Differential PrivacyData Privacy and Security
16
• Noise depends on ∆𝑞, 휀, but not on the dataset
– Smaller ∆𝑞: Less distortion
– Smaller 휀: More distortion
• Privacy in the land of plenty (i.e., not a tool for tiny datasets)
To achieve 휀-differential privacy it suffices to addnoise scaled to ∆𝑞/휀
∆𝑞 = maxadj. 𝑥,𝑦|𝑞 𝑥 − 𝑞(𝑦)|
The Laplace Mechanism
Differential PrivacyData Privacy and Security
17
Theorem: Adding noise Lap( ∆𝑞 ) yields 휀-differential privacy
∆𝑞 = maxadj. 𝑥,𝑦|𝑞 𝑥 − 𝑞(𝑦)|
𝑝 𝑧 = 𝑒− 𝑧 /𝑏/2𝑏
Why does it work?
Differential PrivacyData Privacy and Security
18
Theorem: Adding noise Lap( ∆𝑞 ) yields 휀-differential privacy
∆𝑞 = maxadj. 𝑥,𝑦|𝑞 𝑥 − 𝑞(𝑦)|
Pr[𝓜 𝑥 = 𝑡]
Pr[𝓜 𝑦 = 𝑡]= 𝑒
−|𝑡−𝑞(𝑥)|−|𝑡−𝑞(𝑦)|∆𝑞/ ≤ 𝑒
Example: Histogram Queries
Differential PrivacyData Privacy and Security
19
∆𝑞 = maxadj. 𝑥,𝑦 𝑞 𝑥 − 𝑞 𝑦 = 1
So it sufficesadding noiseLap( 1 )
Accuracy for a Set of Queries
• A mechanism𝓜 has (𝛼, 𝛽)-accuracy wrt a set of queries 𝑄 if for all 𝑥: With probability 1 − 𝛽the outcome 𝓜 𝑥 yields an approximate value [𝑞 𝑥 − 𝛼, 𝑞 𝑥 + 𝛼]
• Accuracy of Laplace mechanism for 𝑘 queries
– Fact: Pr Lap 𝑏 ≥ 𝑡𝑏 = 𝑒−𝑡 (for us 𝑏 = 𝑘/휀)
– Union bound: 𝑘 ∙ Pr[|Lap(𝑘/휀)| ≥ 𝑡𝑘/휀] ≤ 𝛽
– Thus: Pr Lap𝑘
≥𝑡𝑘
≤𝛽
𝑘= 𝑒−𝑡
– So, 𝑡 = ln(𝑘/𝛽) and 𝛼 =𝑘ln(𝑘/𝛽)
Differential PrivacyData Privacy and Security
20
≈ 1/휀 for 𝑘 = 1
Noisy ArgMax
• Compute 𝑔1 𝑥 ,… , 𝑔𝑚 𝑥
– How many people play ping pong, how many go running, and etc.
– Wanna know most popular hobby
• Add Lap(2max𝑖∆𝑔𝑖) to each value
– Do not release the outcome, but report the indexof the function with larger noisy outcome
– Works as long as there is a gap between two mostpopular choices
– Compute much more than what is released
Differential PrivacyData Privacy and Security
21
Generalization of Noisy ArgMax
• Target notions of utility
– E.g., mechanism outputs a classifier and utility issmallness of classification error
• Applications where adding noise makes no sense
• Goal: Maximize 𝑢(𝑥, ξ)
– Utility of ξ on database 𝑥
Differential PrivacyData Privacy and Security
22
The Exponential Mechanism
• 𝑞 𝑥 ∈ 𝛯 = ξ1, … , ξ𝑘– Strings, prices, etc.
– Each ξ ∈ 𝛯 has utility 𝑢 𝑥, ξ for 𝑥
Differential PrivacyData Privacy and Security
23
Intuition: Output ξ with probability
∝ 𝑒𝑢 𝑥,ξ /∆𝑢
𝑒𝑢 𝑥,ξ
𝑒𝑢 𝑦,ξ
/∆𝑢
= 𝑒𝑢 𝑥,ξ −𝑢 𝑦,ξ /∆𝑢≤ 𝑒
Unlimited Supply Auctions
• Bidders have demands curves, describing for each price 𝑝 ∈ [0,1] the number of goodsthey wish to purchase at 𝑝
– Total budget 𝑝 ∙ 𝑏𝑖 𝑝 ≤ 1
• Auctioneer picks 𝑝 max. revenue 𝑝 𝑖 𝑏𝑖 𝑝
• Select the price using exponential mechanism
– Utility is the revenue
– Approximately truthful: An individual can’tinfluence the price choice by changing her bid
– Resilient to collusion (for small coalitions)!
Differential PrivacyData Privacy and Security
24
Efficiency?
• Generating synthetic data can be hard
• Consider the following database
– Choose single sign/verify key pair (𝑣𝑘∗, 𝑠𝑘∗)
– Database is 𝑛 rows: (𝑚𝑖 , 𝐒𝐢𝐠𝐧 𝑠𝑘∗, 𝑚𝑖 , 𝑣𝑘
∗) for random messages𝑚𝑖
– One query for each 𝑣𝑘: What fraction of rows are valid signatures wrt 𝑣𝑘
• Efficient curator cannot generate syntheticdatabase (yielding correct answers wrt 𝑣𝑘∗) without leaking rows
Differential PrivacyData Privacy and Security
25
Efficient Synopsis Generation?
• Trivial to find a synopsis with same functionality
– Simply publish 𝑣𝑘∗
• Maintain functionality of the database wrt a set 𝑄 of queries
• Hide presence/absence of any individual
• We will argue that there are hard to sanitizedistributions
– Assuming so-called traitor tracing schemes
– Converse is also true (but we won’t talk about it)
Differential PrivacyData Privacy and Security
26
Traitor Tracing Schemes
Differential PrivacyData Privacy and Security
27
⋮𝑏𝑘
𝑐 ←$ 𝐄𝐧𝐜(𝑏𝑘, 𝑏)
𝑠𝑘1
𝑠𝑘2
𝑠𝑘𝑛
𝑐
𝑐
𝑐
𝐃𝐞𝐜 𝑠𝑘𝑖 , 𝑐 = 𝑏
Stateful Pirates
• What if some users try to resell the content?
• Some users in the coalition will be traced!
Differential PrivacyData Privacy and Security
28
⋮
Tracer
𝑠𝑘1
𝑠𝑘2
𝑠𝑘𝑛
𝑐1, … , 𝑐𝑡
𝑏1, … , 𝑏𝑡
Pirate Decoder
Accuse user 𝑖
𝑡𝑘
Intuition for the Lower Bound
• Assume traitor tracing scheme
• One universe element for each private key
– Database is collection of 𝑛 randomly chosen keys
• One query for each ciphertext 𝑐
– For what fraction of rows 𝑖 does 𝐃𝐞𝐜 𝑠𝑘𝑖 , 𝑐 = 1?
• For any synopsis (i.e. the pirate):
– Answer will be 0 or 1, i.e. the decryption of 𝑐
– Tracing reveals ≥ 1 key and never falsely accuses
– Violates differential privacy!
Differential PrivacyData Privacy and Security
29
More Lower Bounds
• Theorem: Assuming one-way functions exist, differentially private algorithms for the following require exponential time
• Synthetic data for 2-way marginals
– Proof relies on digital signatures
• Synopsis release for > 𝑛2 arbitrary countingqueries
– Proof relies on traitor tracing schemes
Differential PrivacyData Privacy and Security
30
Approximate Differential Privacy
Differential PrivacyData Privacy and Security
31
𝑞1𝑎1𝑞2𝑎2
⋮
Data AnalystDatabase
RandomizedMechanism𝓜
Mechanism 𝓜 gives (휀, 𝛿)-differential privacy if for all pairs of adjacent datasets 𝑥, 𝑦, and for all events 𝑆:
Pr 𝓜 𝑥 ∈ 𝑆 ≤ 𝑒 ∙ Pr 𝓜 𝑦 ∈ 𝑆 + 𝛿
Benefits of the Relaxation
• Advanced composition
– Can answer 𝑘 queries with cumulative loss √𝑘 ∙ 휀instead of 𝑘휀
• Can use cryptography to simulate trustedcenter
• Gaussian noise
– Leading to better accuracy
Differential PrivacyData Privacy and Security
32
The Gaussian Mechanism
Differential PrivacyData Privacy and Security
34
Theorem: Adding noise Lap(∆1/휀) yields (휀, 0)-differential privacy
∆1= maxadj. 𝑥,𝑦 𝑞 𝑥 − 𝑞(𝑦) 1
Theorem: Adding noise N(0,2ln(2/𝛿)(∆2
2)2
2 ) yields
(휀, 𝛿)-differential privacy
∆2= maxadj. 𝑥,𝑦 𝑞 𝑥 − 𝑞(𝑦) 2
For 𝑘 queries∆1= 𝑘
∆2= 𝑘
Incentives
• Until now: Goal was designing differentiallyprivate mechanisms, but the data is assumedto be already there
• But why should someone participate in the computation?
• Why would they give their true data?
• Do we need compensation? How much?
• Any connection with game theory?
Differential PrivacyData Privacy and Security
35
Game Theory and Mechanism Design
• Goal: Solve optimization problem
• Catch: No access to inputs
– Inputs held by self-interested agents
• Design incentives and choice of solution(mechanism) that incentivizes truth-telling
– No need for participants to strategize
– Simple to predict what will happen
– Often a non-truth-telling mechanism can be replaced by one where the coordinatore does the lying on behalf of the participants
Differential PrivacyData Privacy and Security
36
Good News
• Composition: Approximate truthfulness stillsatisfied under composition!
• Collusion resistance: 𝑂(𝑘휀)-approximatedominant strategy, even for coalitions of 𝑘agents
• Both properties not immediate in game-theoretic mechanism design
• All done without money!
Differential PrivacyData Privacy and Security
37
Bad News
• But not only truthful reporting gives an approximate dominant strategy, any reportdoes so
– Even malicious ones
• How do we actually properly get people to truthfully participate?
– Perhaps need compensation
– Much harder to achieve
Differential PrivacyData Privacy and Security
38
Differential Privacy as a Tool
• Nash equilibrium: An assignment of players to strategies so that no player would benefit by changing strategy, given how everyone else isplaying
• Correlated equilibrium: Players have access to correlated signals (e.g., traffic light)
– Every Nash equilibrium is a correlated equilibrium, but not viceversa
• Differential privacy has applications to mechanism design with correlated equilibria
Differential PrivacyData Privacy and Security
39
The Issue of Verification
• Challenging to strictly incentivize truth-tellingin differentially private mechanisms design
• Exceptions:
– Resposes are verifiable
– Agents care about outcome
• Challenge: No observed outcome
– What is the prevalence of drug use?
– Are students cheating in class?
Differential PrivacyData Privacy and Security
40
Bayesan Setting
• Bit-cost pairs (𝑏𝑖 , 𝑐𝑖) drawn from jointdistribution
• If 𝑏𝑖 tells I am a cheater, I tend to believe weare in a world where people cheat
• But cost 𝑐𝑖 does not give additionalinformation beyond what 𝑏𝑖 gives
– Privacy costs arbitrary, but upper bounded by linear cost in 𝑐𝑖 ∙ 휀
– Utility model: 𝑐𝑖 ∙ 휀 − 𝑝𝑖
Differential PrivacyData Privacy and Security
41
Privacy & Game Theory
• Asymptothic truthfulness, some newmechanism design and equilibrium selectionresults
• Interesting challenge of modeling costs for privacy
• In order to design privacy for humans do weneed to understand
– How people currently value or should value it?
– What are the right promises to give?
Differential PrivacyData Privacy and Security
42