Upload
corey-hopkins
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
Foundations of Privacy
Lecture 8
Lecturer: Moni Naor
Recap of last week’s lecture• Counting Queries
– Online algorithm with good accuracy– Started changing data
What if the data is dynamic?
• Want to handle situations where the data keeps changing– Not all data is available at the time of sanitization
Curator/Sanitizer
Google Flu Trends
“We've found that certain search terms are good indicators of flu activity.
Google Flu Trends uses aggregated Google search data to estimate current flu activity around the world in near real-time.”
Example of Utility: Google Flu Trends
What if the data is dynamic?• Want to handle situations where the data keeps changing
– Not all data is available at the time of sanitization
Issues• When does the algorithm make an output?• What does the adversary get to examine?• How do we define an individual which we should protect?
D+Me
• Efficiency measures of the sanitizer
Data StreamsData is a stream of items
Sanitizer sees each item and updates internal state.Produces output: either on-the-fly or at the end
state Sanitizer
Data Stream
output
Three new issues/concepts• Continual Observation
– The adversary gets to examine the output of the sanitizer all the time
• Pan Privacy– The adversary gets to examine the internal state of the
sanitizer. Once? Several times? All the time?
• “User” vs. “Event” Level Protection– Are the items “singletons” or are they related
Randomized Response• Randomized Response Technique [Warner 1965]
– Method for polling stigmatizing questions– Idea: Lie with known probability.
• Specific answers are deniable• Aggregate results are still valid
• The data is never stored “in the plain”
1
noise+
0
noise+
1
noise+
…
“trust no-one”
Popular in DB literatureMishra and Sandler.
The Dynamic Privacy Zoo
Differentially Private
Continual Observation
Pan Private
User level Private
User-Level Continual Observation Pan Private
Petting
Randomized Response
Continual Output Observation
Data is a stream of items Sanitizer sees each item, updates internal state.Produces an output observable to the adversary
state
Output
Sanitizer
Continual Observation• Alg - algorithm working on a stream of data
– Mapping prefixes of data streams to outputs– Step i output i
• Alg is ε-differentially private against continual observation if for all – adjacent data streams S and S’– for all prefixes t outputs 1 2 … t
Pr[Alg(S)=1 2 … t]
Pr[Alg(S’)=1 2 … t]≤ eε ≈ 1+ε e-ε ≤
Adjacent data streams: can get from one to the other by changing one element
S= acgtbxcde S’= acgtbycde
The Counter Problem
0/1 input stream 011001000100000011000000100101
Goal : a publicly observable counter, approximating the total number of 1’s so far
Continual output: each time period, output total number of 1’s
Want to hide individual increments while providing reasonable accuracy
Counters w. Continual Output Observation
Data is a stream of 0/1 Sanitizer sees each xi, updates internal state.Produces a value observable to the adversary
1 00 1 0 0 1 1 0 0 0 1
state
1 1 1 2 Output
Sanitizer
Counters w. Continual Output ObservationContinual output: each time period, output total 1’sInitial idea: at each time period, on input xi 2 {0, 1}
Update counter by input xi
Add independent Laplace noise with magnitude 1/ε
Privacy: since each increment protected by Laplace noise – differentially private whether xi is 0 or 1
Accuracy: noise cancels out, error Õ(√T)
For sparse streams: this error too high.
T – total number of time periods
0 1 2 3 4 5-1-2-3-4
Why So Inaccurate?
• Operate essentially as in randomized response– No utilization of the state
• Problem: we do the same operations when the stream is sparse as when it is dense– Want to act differently when the stream is dense
• The times where the counter is updated are potential leakage
Main idea: update output value only when large gap between actual count and output
Have a good way of outputting value of counter once: the actual counter + noise.
Maintain Actual count At (+ noise )Current output outt (+ noise)
Delayed Updates
D – update threshold
Outt - current output At - count since last update.Dt - noisy threshold
If At – Dt > fresh noise then Outt+1 Outt + At + fresh noise At+1 0 Dt+1 D + fresh noise Noise: independent Laplace noise with magnitude 1/εAccuracy:• For threshold D: w.h.p update about N/D times• Total error: (N/D)1/2 noise + D + noise + noise• Set D = N1/3 accuracy ~ N1/3
Delayed Output Counter
delay
Privacy of Delayed Output
Protect: update time and update value
For any two adjacent sequences101101110001101101010001
Can pair up noise vectors
12k-1 k k+1
12k-1 ’k k+1
Identical in all locations except one’k = k +1
Where first update after difference occurred
Prob ≈ eε
Outt+1Outt +At+ fresh noise
At – Dt > fresh noise, Dt+1 D + fresh noise
Dt
D’t
Dynamic from Static• Run many accumulators in parallel:
– each accumulator: counts number of 1's in a fixed segment of time plus noise.
– Value of the output counter at any point in time: sum of the accumulators of few segments
• Accuracy: depends on number of segments in summation and the accuracy of accumulators
• Privacy: depends on the number of accumulators that a point influences
Accumulator measured when stream is in the time frame
Only finished segments used
xt
Idea: apply conversion of static algorithms into dynamic onesBentley-Saxe 1980
The Segment ConstructionBased on the bit representation:Each point t is in dlog te segments
i=1
t xi - Sum of at most log t accumulators
By setting ’ ¼ / log T can get the desired privacyAccuracy: With all but negligible in T probability the error at
every step t is at most O((log1.5 T)/)). canceling
Synthetic Counter
Can make the counter synthetic• Monotone• Each round counter goes up by at most 1
Apply to any monotone function
Lower Bound on Accuracy Theorem: additive inaccuracy of log T is essential
for -differential privacy, even for =1 • Consider: the stream 0T compared to collection of
T/b streams of the form 0jb1b0T-(j+1)b
Sj = 000000001111000000000000
Call output sequence correct: if a b/3 approximation for all points in time
b
b=1/2log T, =1/2
…Lower Bound on Accuracy
Important properties• For any output: ratio of probabilities under stream Sj
and 0T should be at least e-b
– Hybrid argument from differential privacy
• Any output sequence correct for at most one Sj or 0T – Say probability of a good output sequence is at least
Sj=000000001111000000000000
Good for Sj
Prob under 0T: at least e-b
T/b e-b · 1- contradiction
b/3 approximation for all points in time
Hybrid Proof
Want to show that for any event B
Pr[A(Sj) 2 B]e-εb ≤
Pr[A(0T)2 B]
Pr[A(Sji+1)2B]e-ε ≤
Pr[A(Sji) 2 B]
LetSji=0jb1i0T-jb-i
Sj0=0T
Sjb=Sj
Pr[A(Sjb)2B] ¸ e-εb
Pr[A(Sj0)2B]=
Pr[A(Sj1)2B]
Pr[A(Sj0)2B]
Pr[A(Sjb)2B]
Pr[A(Sjb-1)2B]…. .
What shall we do with the counter?
Privacy-preserving counting is a basic building block in more complex environments
General characterizations and transformationsEvent-level pan-private continual-output algorithm for any low sensitivity function
Following expert advice privatelyTrack experts over time, choose who to followNeed to track how many times each expert was correct
Following Expert Advice
n experts, in every time period each gives 0/1 advice• pick which expert to follow• then learn correct answer, say in 0/1Goal: over time, competitive with best expert in hindsight
1
0
0
1
0
1
1
1
1
0
1
0
1
1
0
0
Expert 1
Expert 2
Correct
Expert 3
1 1 0 0
Hannan 1957Littlestone Warmuth 1989
1
0
0
1
0
1
1
1
1
0
1
0
1
1
0
0
Expert 1
Expert 2
Correct
Expert 3
1 1 0 0
Following Expert Advice
n experts, in every time period each gives 0/1 advice
pick which expert to followthen learn correct answer, say in 0/1Goal: over time, competitive with best expert in
hindsight
Goal:#mistakes of chosen experts ≈#mistakes made by best expert in hindsight
Want 1+o(1) approximation
n experts, in every time period each gives 0/1 advice• pick which expert to follow• then learn correct answer, say in 0/1Goal: over time, competitive with best expert in hindsight
New concern: protect privacy of experts’ opinions and outcomes
User-level privacyLower bound, no non-trivial algorithm
Event-level privacy counting gives 1+o(1)-competitive
Following Expert Advice, Privately
Was the expert consulted at all?
Algorithm for Following Expert Advice
Follow perturbed leader [Kalai Vempala]For each expert: keep perturbed # of mistakesfollow expert with lowest perturbed count
Idea: use counter, count privacy-preserving #mistakesProblem: not every perturbation works
need counter with well-behaved noise distribution
Theorem [Follow the Privacy-Perturbed Leader]For n experts, over T time periods, # mistakes is within ≈ poly(log n,log T,1/ε) of best expert
List Update ProblemThere are n distinct elements A={a1, a2, … an}
Have to maintain them in a list – some permutation– Given a request sequence: r1, r2, …
• Each ri 2 A
– For request ri: cost is how far ri is in the current permutation
– Can rearrange list between requests• Want to minimize total cost for request sequence
– Sequence not known in advanceOur goal: do it while providing privacy for the request sequence, assuming list order is public
for each request ri: cannot tell whether ri
is in the sequence or not
List Update ProblemIn general: cost can be very high
First problem to be analyzed in the competitive framework by Sleator and Tarjan (1985)
Compared to the best algorithm that knows the sequence in advance
Best algorithms: 2- competitive deterministicBetter randomized ~ 1.5
Assume free rearrangements between request
Bad news: cannot be better than (1/)-competitive if we want to keep privacy
Cannot act until 1/ requests to an element appear
Lower bound for Deterministic Algorithms
• Bad schedule: always ask for the last element in the list
• Cost of online: n¢t• Cost of best fixed list: sort the list according to
popularity – Average cost: · 1/2n – Total cost: · 1/2n¢t
List Update Problem: Static OptimalityA more modest performance goal: compete with the best algorithm that fixes
the permutation in advance
Blum-Chowla-Kalai: can be 1+o(1) competitive wrt best static algorithm (probabilistic)
BCK algorithm based on number of times each element has been requested.Algorithm:– Start with random weights ri in range [1,c]
– At all times wi = ri + ci • ci is # of times element ai was requested.
– At any point in time: arrange elements according to weights
Privacy with Static OptimalityAlgorithm:– Start with random weights ri in range [1,c]
– At any point in time wi = ri + ci • ci is # of times element ai was requested.
– Arrange elements according to weights
– Privacy: from privacy of counters • list depends on counters plus randomness
– Accuracy: can show that BCK proof can be modified to handle approximate counts as well
– What about efficiency?
Run with private counter
The multi-counter problem
How to run n counters for T time steps • In each round: few counters are incremented
– Identity of incremented counter is kept private
• Work per increment: logarithmic in n and T • Idea: arrange the n counters in a binary tree with n
leaves– Output counters associated with leaves– For each internal node: maintain a counter
corresponding to sum of leaves in subtree
Determines when to update subtree
The multi-counter problem
(internal, output)
• Idea: arrange the n counters in a binary tree with n leaves– Output counters associated with leaves
• For each internal node maintain:– Counter corresponding to sum of leaves in subtree– Register with number of increments since last output update
• When a leaf counter is updated:– All log n nodes to root are incremented– Internal state of root updated. – If output of parent node updated, internal state of children updated
Tree of Counters
Output counter
(counter, register)
The multi-counter problem• Work per increment:
– log n increment + number of counter need to update– Amortized complexity is O(n log n /k)
• k number of times we expect to increment a counter until output is updated
• Privacy: each increment of a leaf counter effects log n counters
• Accuracy: we have introduced some delay:– After t ¸ k log n increments all nodes on path have been
update
Pan-PrivacyIn privacy literature: data curator trustedIn reality:
even well-intentioned curator subject to mission creep, subpoena, security breach…
– Pro baseball anonymous drug tests– Facebook policies to protect users from application developers– Google accounts hacked
Goal: curator accumulates statistical information,but never storesstores sensitive data about individuals
Pan-privacy: algorithm private inside and out• internal state is privacy-preserving.
“think of the children”
Randomized Response [Warner 1965]Method for polling stigmatizing questionsIdea: participants lie with known probability.
– Specific answers are deniable– Aggregate results are still valid
Data never stored “in the clear”popular in DB literature [MiSa06]
1
noise+
0
noise+
1
noise+
…
User Data
User Response
Strong guarantee: no trust in curatorMakes sense when each user’s data appears only once,
otherwise limited utilityNew idea: curator aggregates statistical information,
but never stores sensitive data about individuals
Aggregation Without Storing Sensitive Data?
Streaming algorithms: small storage– Information stored can still be sensitive– “My data”: many appearances, arbitrarily
interleaved with those of others
Pan-Private Algorithm – Private “inside and out” – Even internal state completely hides the
appearance pattern of any individual:presence, absence, frequency, etc.
“User level”
Can also consider multiple intrusions
Pan-Privacy Model
Data is stream of items, each item belongs to a userData of different users interleaved arbitrarilyCurator sees items, updates internal state, output at stream end
Pan-Privacy For every possible behavior of user in stream, joint distribution of the internal state at any single point in time and the final output is differentially private
state
output
Universe U of users whose data in the stream; x 2 U• Streams x-adjacent if same projections of users onto U\{x}
Example: axbxcxdxxxex and abcdxe are x-adjacent • Both project to abcde• Notion of “corresponding locations” in x-adjacent streams
• U -adjacent: 9 x 2 U for which they are x-adjacent– Simply “adjacent,” if U is understood
Note: Streams of different lengths can be adjacent
Adjacency: User Level
Example: Stream Density or # Distinct Elements
Universe U of users, estimate how many distinct users in U appear in data stream
Application: # distinct users who searched for “flu”
Ideas that don’t work:• Naïve
Keep list of users that appeared (bad privacy and space)• Streaming
– Track random sub-sample of users (bad privacy)– Hash each user, track minimal hash (bad privacy)
Pan-Private Density Estimator
Inspired by randomized response.Store for each user x 2 U a single bit bx
Initially all bx 0 w.p. ½1 w.p. ½
When encountering x redraw bx 0 w.p. ½-ε1 w.p. ½+ε
Final output: [(fraction of 1’s in table - ½)/ε] + noise
Pan-PrivacyIf user never appeared: entry drawn from D0
If user appeared any # of times: entry drawn from D1
D0 and D1 are 4ε-differentially private
Distribution D0
Distribution D1
Pan-Private Density Estimator
Inspired by randomized response.Store for each user x 2 U a single bit bx
Initially all bx 0 w.p. ½1 w.p. ½
When encountering x redraw bx 0 w.p. ½-ε1 w.p. ½+ε
Final output: [(fraction of 1’s in table - ½)/ε] + noise
Improved accuracy and StorageMultiplicative accuracy using hashing Small storage using sub-sampling
Distribution D0
Distribution D1
Theorem [density estimation streaming algorithm]ε pan-privacy, multiplicative error αspace is poly(1/α,1/ε)
Pan-Private Density Estimator
The Dynamic Privacy Zoo
Differentially Private Outputs
Privacy under Continual Observation
Pan Privacy
User level Privacy
Continual Pan Privacy
Petting
Sketch vs. Stream