View
218
Download
1
Tags:
Embed Size (px)
Citation preview
Toward Privacy in Public Databases
Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith,
Larry Stockmeyer, Hoeteck Wee
Work Done at Microsoft Research
2
Database Privacy
Think “Census” Individuals provide information Census Bureau publishes sanitized records
Privacy is legally mandated; what utility can we achieve?
Inherent Privacy vs Utility trade-off One extreme – complete privacy; no information Other extreme – complete information; no privacy
Goals: Find a middle path
preserve macroscopic properties “disguise” individual identifying information
Change the nature of discourse Establish framework for meaningful comparison of
techniques
3
Current solutions
Statistical approaches Alter the frequency (PRAN/DS/PERT) of particular features,
while preserving means. Additionally, erase values that reveal too much
Query-based approaches Disallow queries that reveal too much Output perturbation (add noise to true answer)
Unsatisfying Ad-hoc definitions of the privacy/breach Erasure can disclose information Noise can cancel (although, see work of Nissim+.) Combinations of several seemingly innocuous queries
could reveal information; refusal to answer can be revelatory
4
Everybody’s First Suggestion
Learn the distribution, then output A description of the distribution, or Samples from the learned distribution
Want to reflect facts on the ground Statistically insignificant clusters can be
important for allocating resources
5
Our Approach
Crypto-flavored definitions Mathematical characterization of Adversary’s goal Precise definition of when sanitization procedure fails
Intuition: seeing sanitized DB gives Adversary an “advantage”
Statistical Techniques Perturbation of attribute values
Differs from previous work: perturbation amounts depend on local densities of points
Highly abstracted version of problem If we can’t understand this, we can’t understand real life
(and we can’t…) If we get negative results here, the world is in trouble.
6
What do WE mean by privacy?
[Ruth Gavison] Protection from being brought to the attention of others
inherently valuable attention invites further privacy loss
Privacy is assured to the extent that one blends in with the crowd
Appealing definition; can be converted into a precise mathematical statement…
7
A geometric view
Abstraction: Database consists of points in high dimensional space Rd
independent samples from some underlying distribution
Points are unlabeledyou are your collection of attributes
Distance is everythingpoints are similar if and only if they are close (L2 norm)
Real Database (RDB), privaten unlabeled points in d-dimensional space
Sanitized Database (SDB), publicn’ new points, possibly in a different
space
8
The adversary or Isolator - Intuition
On input SDB and auxiliary information, adversary outputs a point q Rd
q “isolates” a real DB point x, if it is much closer to x than to x’s near neighbors q fails to isolate x if q looks roughly as much
like everyone in x’s neighborhood as it looks like x itself
Tightly clustered points have a smaller radius of isolation
RDB
9
Isolation – the definition
(c-1)
I(SDB,aux) = q x is isolated if B(q,c) contains fewer than T
other points from RDB T-radius of x – distance to its Tth-nearest
neighbor x is “safe” if x > (T-radius of x)/(c-1)
B(q,cx) contains x’s entire T-neighborhood
c – privacy parameter; eg, 4
qx
c
p
If |x-p| < T-radx < (c-1)x then
|q-p| · |q-x| + |x-p| < x + T-radx < cx
10
Requirements for the sanitizer
No way of obtaining privacy if AUX already reveals too much!
Sanitization procedure compromises privacy if giving the adversary access to the SDB considerably increases its probability of success
Definition of “considerably” can be forgiving, say, n-2. Made rigorous by quantification over adversaries,
distributions, auxiliary information, sanitizations, samples:
I I’ w.o.p. D aux z x 2 D |Pr[I(SDB,z) isolates x] – Pr[I’(z) isolates x]| is small/n
Provides a framework for describing the power of a sanitization method, and hence for comparisons
11
The Sanitizer
The privacy of x is linked to its T-radius Randomly perturb it in proportion to its T-
radius
x’ = San(x) R B(x,T-rad(x))
Intuition: We are blending x in with its crowd
We are adding to x random noise with mean zero, so several macroscopic properties should be preserved.
12
Flavor of Results (Preliminary)
Assumptions Data arises from a mixture of GaussiansDimension d, number of points n are larged = (log n)
Results Privacy: An adversary who knows the Gaussians
and some auxiliary information cannot isolate any point with probability more than 2-(d)
several special cases; general result not yet proved; Very different proof techniques from anything in the
statistics or crypto literatures!
Utility: A user who does not know the Gaussians can compute the means with a high probability.
13
The “simplest” interesting case
Two points – x and y – generated uniformly from surface of a ball B(o,)
The adversary knows x’, y’, and = |x-y|
We prove there are 2(d) “decoy” pairs (xi,yi) such that |xi-yi|= and Pr[ xi,yi | x’,y’ ] = Pr[ x,y | x’,y’ ]
Furthermore, the adversary can only isolate one point xi or yi at a time: they are “far apart” wrt
Proof based on symmetry arguments and coding theory.
High dimensionality crucial.
14
Finding Decoy Pairs
Consider a hyperplane H through x’, y’ and o xH, yH – mirror reflections of x, y through H
Note: reflections preserve distances! The world of xH, yH looks identical to the world of x,
y
x
y
y’
x’
xH
yH
Pr[ xH,yH | x’,y’ ] = Pr[ x,y | x’,y’ ]
H
15
Lots of choices for H
xH, yH – reflections of x, y through H(x’,y’,o)
Note: reflections preserve distances! The world of xH, yH looks identical to the world of x,
y
How many different H such that the corresponding xH are pairwise distant (and distant from x)?
2r sin
r
2
Sufficient to pick r > 2/3 and = 30°
Fact: There are 2(d) vectors in d-dim, at angle 60° from each other.
Probability that adversary wins ≤ 2-
(d)
x
x1
x2
> 2/3 r
16
Towards the general case… n points
The adversary is given n-1 real points x2,…,xn and one sanitized point x’1
Symmetry does not work – too many constraints
A more direct argument – Let Z = { pRd | p is a legal pre-image for x’1 }
Q = {p | if x1 = p then x1 is isolated by q } Show that Pr[x1 in Q∩Z | x’1 ] ≤ 2-(d)
Pr[x1 in Q∩Z | x’1 ] =
prob mass contribution from Q∩Z / contribution from Z = 21-d /(1/4)
17
Why does Q∩Z contribute so little mass?
Z = { p| p is a legal pre-image for x’1 }
Q = { p | if x1 = p then x1 is isolated by q }
qx1’
x2
x3
x4
x5
Z
Q
Key observation: As |q-x1’| increases, Q becomes larger.
But, larger distance from x1’ implies smaller probability mass, as x1 is randomized over a larger area
Q∩Z
x6
T=1; perturb to 1-radius|x1’ – x1| = 1-rad(x1)
18
The general case… n sanitized points
Initial intuition is wrong: Privacy of x1 given x1’ and all the other points in the
clear does not imply privacy of x1 given x1’ and sanitizations of others!
Sanitization of other points reveals information about x
19
Digression: Histogram Sanitization
U = d-dim cube, side = 2 Cut into 2d subcubes
split along each axis subcube has side = 1
For each subcubeif number of RDB points > 2T
then recurse
Output: list of cells and counts
20
Digression: Histogram Sanitization
Theorem: If n = 2o(d) and points are drawn uniformly from U, then histogram sanitizations are safe with respect to 8-isolation: Pr[I(SDB) succeeds] · 2-(d).
Rough Intuition:For q 2 C: expected distance to any x 2 C is relatively large (and even larger for x 2 C’); distances tightly concentrated. Increasing radius by 8 captures almost all the parent cell, which contains at least 2T points.
21
Combining the Two Sanitizations Partition RDB into two sets A and B Cross-training
Compute histogram sanitization for B v 2 A: v = side length of C containing v Output GSan(v, v)
A B
22
Cross-Training Privacy
Privacy for B: only histogram information about B is used
Privacy for A: enough variance for enough coordinates of v, even given C containing v and sanitization v’ of v.
23
Results on privacy.. The special Cases
Distribution Num. of points
Revealed to adversary
Auxiliary information
Uniform on surface of sphere
2 Both sanitized points
Distribution, 1-radius
Uniform over a bounding box or surface of sphere
n One sanitized point, all other real points
Distribution
Uniform over a hypercube
2(d) n/2 sanitized points Distribution
Gaussian 2o(d) n sanitized points Distribution
24
Learning mixtures of Gaussians - Spectral techniques
Observation: Optimal low-rank approx to a matrix of complex data yields the underlying structure, eg, means [M01,VW02].
We show that McSherry’s algorithm works for clustering sanitized Gaussian data
original distribution (mixture of Gaussians) is recovered
25
Spectral techniques for perturbed data
A sanitized point is the sum of two Gaussian variables – sample + noise
w.h.p. the T-radius of a point is less than the “radius” of its Gaussian
Variance of the noise is small Previous techniques work
26
Results on utility… An overview
Distributional/Worst-case
Objective Assumptions
Result
Worst-case Find K clusters minimizing largest diameter
- Diameter increases by a factor of 3
Distributional Find k maximum likelihood clusters
Mixture of k Gaussians
Correct clustering with high probability as long as means are pairwise sufficiently far
27
What about the real world?
Lessons from the abstract model High dimensionality is our friend Gaussian perturbations seem to be the right thing to
do Need to scale different attributes appropriately, so that
data is well rounded
Moving towards real data Outliers
– Our notion of c-isolation deals with them- Existence of outlier may be disclosed
Discrete attributes – Convert them into real-valued attributes- e.g. Convert a binary variable into a probability