Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
BU CS591 S1 NEU CS 7880
Foundations of Private Data AnalysisSpring 2021
Lecture 01: Introduction
Jonathan Ullman Adam SmithNEU BU
Today
โข Course Introโข A taste of the syllabus
รAttacks on information computed from private dataรA first private algorithm: randomized response
2
This Courseโข Intro to research on privacy in ML and statistics
รMathematical modelsโข How do we formulate nebulous concepts?โข How do we assess and critique these formulations?
รAlgorithmic techniques
โข Skill sets you will work onรTheoretical analysisรCritical reading of research literature in CS and beyondรProgramming
โข PrerequisitesรComfort writing proofs about probability, linear algebra,
algorithmsรUndergrads: discuss your background with instructor.
3
Administriviaโข Web page: https://dpcourse.github.io
รCommunication via PiazzaรLectures on GatherรCourse work on Gradescope
โข Your jobsรLecture preparation, attendance, participationรHomework รProject
4
Every lectureโข Ahead of time
รWatch videoโข Engage actively and take notes by hand as you watch
รRead lecture notesรAnswer Gradescope pre-class questions
โข In classรBe present with camera on
โข Let us know on Piazza if that is an issue in general or for specific lectures. Default is attendance at every class
รActively participate in problem-solvingโข Problems will be posted ahead of time
รTake notes on your work
โข After classรSubmit your notes (photo or electronic) on Gradescope
5
Courseworkโข Lecture prep and in-class workโข Homework
ร Due Fridays every 2 weeksร Limited collaboration is permitted
โข Groups of size โค 4ร Academic honesty: You must
โข Acknowledge collaborators (or write โcollaborators: noneโ)โข Write your solutions yourself, and be ready to explain them orally
โ Rule of thumb: walk away from collaboration meetings with no notes.โข Use only course materials (except for reading general background, e.g.,
on probability, calculus, etc)
โข Project (details TBA)โข Read and summarize a set of 2-3 related papersโข Identify open questionsโข Develop new material (application of a technique to a new data set, work
on open question, show some assumption is necessary, ...)โข Presentation in last week of class
6
To do list for this weekโข Make sure you have access to Piazza, Gradescope
โข Read the syllabus
โข Fill Gradescope background surveyรBy Thursday
โข Watch videos, read notes, answer questions for Lecture 2รBy next lecture (Thu/Fri)
7
Today
โข Course Introโข A taste of the syllabus
รAttacks on information computed from private dataรA first private algorithm: randomized response
8
โข Census data used to apportion congressional seatsรThink about citizenship question
โข Also enforce Voting Rights Act, allocate Title I funds, design state districts, โฆ
10Image ยฉ Lumen Learning
Machine learning on your devicesโข Statistical models trained using data from your phones
โข Statistical models trained from other personal data
11
Machine learning on your devicesโข Statistical models trained using data from your phones
รOffer sentence completionรConvert voice to speechรSelect, for you and others to see,
โข Content (e.g. FB newsfeed)โข Adsโข Recommendations for products (โYou might also likeโฆโ)
โข Statistical models trained from other personal dataรAdvise judgesโ bail decisionsรAllocate police resourcesรAdvise doctors on diagnosis/treatment
12
Privacy in Statistical Databases
Large collections of personal informationโข census dataโข medical/public healthโข social networksโข education
14
Individuals
queries
answers
Researchers
Valuable because they reveal so much
about our lives
Statistical analysis benefits society
โAgencyโ
Two conflicting goalsโข Utility: release aggregate statisticsโข Privacy: individual information stays hidden
How do we define โprivacyโ?โข Studied since 1960โs in
รStatisticsรDatabases & data miningรCryptography
โข This course section: Rigorous foundations and analysis 15
Utility Privacy
First attempt: Remove obvious identifiers
โข Everything is an identifierโข Attacker has external
informationโข โAnonymizationโ schemes
are regularly broken16Images: whitehouse.gov, genesandhealth.org, medium.com
[Ganta Kasiviswanathan S โ08]
โAI recognizes blurred facesโ[McPherson Shokri Shmatikov โ16]
Name:Ethnicity:
[Gymrek McGuire Golan Halperin Erlich โ13]
[Pandurangan โ14]
Reidentification attack example
17
AnonymizedNetFlix data
AliceBobCharlieDanielleEricaFrank
Public, incomplete IMDB data
Identified NetFlix Data
=AliceBobCharlieDanielleEricaFrank
On average, four movies uniquely identify user
Image credit: Arvind Narayanan
[Narayanan, Shmatikov 2008]
Is the problem granularity?What if we only release aggregate information?
Problem 1: Models leak informationโข Support vector machine output
reveals individual data pointsโข Deep learning models reveal even more
18
Models Leak Information
โข Example: Smart Compose in Gmail รHavenโt seen you in a while.
Hope you are doing well
ร John Doeโs SSN is 920-24-1930 [Carlini et al. 2018]
โข Modern deep learning algorithms often โmemorizeโ inputs
19Sources: Top 2- https://www.vice.com/en_us/article/j5npeg/why-is-google-translate-spitting-out-sinister-religious-prophecies (inspiration: Ilya Mironov)
Models can leak information about training dataunexpected waysin unexpected ways
Is the problem granularity?What if we only release aggregate information?
Problem 1: Models leak information
Problem 2: Statistics together may encode dataโข Example: Average salary before/after resignationโข More generally:
Too many, โtoo accurateโ statistics reveal individual information
ร Reconstruction attacksโข Reconstruct all or part of data
ร Membership attacksโข Determine if a target individual is in (part of) the data set
20
Cannot release everything everyone would want to know
โข Robust notion of โprivacyโ for algorithmic outputsรMeaningful in the presence of arbitrary side information
โข Several current deployments
โข Burgeoning field of research
22
Apple Google US Census
Algorithms Crypto,security
Statistics,learning
Game theory,economics
Databases,programming
languages
Law,policy
Differential Privacy
Differential Privacy
โข Data set ๐ฅ = ๐ฅ!, โฆ , ๐ฅ" โ ๐ณรDomain ๐ณ can be numbers, categories, tax formsรThink of x as fixed (not random)
โข ๐ด = probabilistic procedureร๐ด(๐ฅ) is a random variableรRandomness might come from adding noise, resampling, etc.
23
random bits
A A(x)
โข A thought experimentรChange one personโs data (or add or remove them)รWill the probabilities of outcomes change?
24
A A(xโ)A A(x)
For any set of outcomes, (e.g. I get denied health insurance)about the same probability in both worlds
Differential Privacy
random bits
random bits
Randomized Response (Warner 1965)
โข Say we want to release the proportion of diabetics in a data setร Each personโs data is 1 bit: ๐ฅ! = 0 or ๐ฅ! = 1
โข Randomized response: each individual rolls a dieร 1, 2, 3 or 4: Report true value ๐ฅ!ร 5 or 6: Report opposite value 1 โ ๐ฅ!
โข Output is list of reported values ๐#, โฆ , ๐$ร It turns out that we can estimate fraction of ๐ฅ! โs that are 1 when ๐
is large
26
local random coins
A ๐ด ๐ฅ = ๐!, โฆ , ๐"
๐ฅ!๐ฅ#
๐ฅ"
โฎ
Randomized Response
27
What sort of privacy does this provide? โข Many possible answersOne approach: Plausible deniability
ร๐ฅ!$ could have been 0ร๐ฅ% could have been 1
โข Suppose we fix everyone elseโs data ๐ฅ!, โฆ , ๐ฅ#โฆโข What is $% ๐!& = ๐๐ ๐ฅ!& = 1$% ๐!& = ๐๐ ๐ฅ!& = 0 ?
๐ ๐ฅ! Die roll ๐!1 0 5 yes2 1 1 yes3 1 3 yes4 1 2 yes5 0 6 yes6 0 4 no7 1 2 yes8 0 3 no9 1 2 yes10 1 5 no
10 0 3 no
โข A thought experimentรChange one personโs data (or add or remove them)รWill the probabilities of outcomes change?
28
A A(xโ)A A(x)
For any set of outcomes, (e.g. I get denied health insurance)about the same probability in both worlds
Differential Privacy
random bits
random bits
Plausible deniability and RRA bit more generallyโฆโข Fix any data set ๏ฟฝโ๏ฟฝ โ 0,1 $, and any neighboring data set ๏ฟฝโ๏ฟฝ%
ร Let ๐ be the position where ๐ฅ! โ ๐ฅ!"
ร (Recall ๐ฅ# = ๐ฅ#" for all ๐ โ ๐)
โข Fix an output ๏ฟฝโ๏ฟฝ โ 0,1 $
Pr ๐ด ๏ฟฝโ๏ฟฝ = ๏ฟฝโ๏ฟฝ =23
# ':)$*+$ 13
# ':)$,+$
(because decisions made independently)โข When we change one output, one term in the product
changes (from -.
to #.
or vice versa)
โข So /0 1 )โ *+/0 1 )โ% *+
โ #-, 2 .
29
Recall basic probability factsโข Random variables have expectations and variances
๐ผ ๐ =2)
๐ฅ โ Pr ๐ = ๐ฅ
๐๐๐ ๐ = ๐ผ ๐ โ ๐ผ ๐ -
โข Expectations are linear: For any rvโs ๐#, โฆ , ๐$ and constants ๐#, โฆ , ๐$:
๐ผ 23
๐3๐3 =23
๐3๐ผ(๐3)
โข Variances add over independent random variables. If ๐#, โฆ , ๐$ are independent, then
๐๐๐ 23
๐3๐3 =23
๐3-๐๐๐(๐3)
โข The standard deviation is ๐๐๐ ๐3
30
Exercise 1: sums of random variablesโข Say ๐!, ๐', โฆ , ๐" are independent with, for all ๐,
๐ผ ๐( = ๐๐๐๐ ๐( = ๐
โข Then what are the expectation and variance of the average 5๐ = !
"โ()!" ๐(?
a) ๐ผ 5๐ = ๐๐ and ๐๐๐ 5๐ = ๐๐b) ๐ผ 5๐ = ๐ and ๐๐๐ 5๐ = ๐c) ๐ผ 5๐ = ๐ and ๐๐๐ 5๐ = ๐/ ๐d) ๐ผ 5๐ = ๐ and ๐๐๐ 5๐ = *
"
e) ๐ผ 5๐ = ๐/๐ and ๐๐๐ 5๐ = *"
31
Exercise 2: Estimating โ๐๐๐ from RRโข Show there is a procedure which, given ๐!, โฆ , ๐",
produces an estimate ๐ด such that
๐ผ ๐ด โ9()!
"
๐ฅ(
'
= ๐ ๐ .
Equivalently, ๐ผ +"โ 5๐
'= ๐ !
"
รHint: What are the mean and variance of 3๐) โ 1?
32
Standard deviation of estimate
Randomized response for other ratiosโข Each person has data ๐ฅ3 โ ๐ณ
ร Normally data is more complicated than bitsโข Tax records, medical records, Instagram profiles, etc
ร Use ๐ณ to denote the set of possible records
โข Analyst wants to know sum of ๐: ๐ณ โ 0,1 over ๐ร Here ๐ captures the property we want to sumร E.g. โwhat is the number of diabetics?โ
โข ๐ ๐ด๐๐๐, 168 ๐๐๐ . , 17, ๐๐๐ก ๐๐๐๐๐๐ก๐๐ = 0โข ๐ ๐ด๐๐, 142 ๐๐๐ . , 47, ๐๐๐๐๐๐ก๐๐ = 1โข We want to learn โ!"#
$ ๐(๐ฅ!)
โข Randomization operator takes ๐ง โ {0,1}:
๐ ๐ง = C๐ง ๐ค. ๐. P%
P%Q#
1 โ ๐ง ๐ค. ๐. #P%Q#
34
Ratio is ๐% (think 1 + ๐ for small ๐)
For each person ๐,๐) = ๐ ๐ ๐ฅ)
Randomized response for other ratiosโข Each person has data ๐ฅ) โ ๐ณ
ร Analyst wants to know sum of ๐:๐ณ โ 0,1 over ๐
โข Randomization operator takes ๐ง โ {0,1}:
๐ ๐ง = 6๐ง ๐ค. ๐. *!
*!+!
1 โ ๐ง ๐ค. ๐. !*!+!
โข How can we estimate a proportion?ร๐ด ๐ฅ!, โฆ . , ๐ฅ" :
โข For each ๐, let ๐! = ๐ ๐ ๐ฅ!โข Return ๐ด = โ! ๐๐! โ ๐
รWhat values for ๐, ๐ make ๐ผ ๐ด = โ)๐ ๐ฅ) ?
โข Proposition: ๐ผ ๐ด โโ)๐ ๐ฅ)' = *!+!
*!,!๐.
35
We can do much better than this!Coming up โฆ
โ # "-
when ๐ small