33
BU CS591 S1 NEU CS 7880 Foundations of Private Data Analysis Spring 2021 Lecture 01: Introduction Jonathan Ullman Adam Smith NEU BU

BU CS591 S1 NEU CS 7880 Foundations of Private Data

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

BU CS591 S1 NEU CS 7880

Foundations of Private Data AnalysisSpring 2021

Lecture 01: Introduction

Jonathan Ullman Adam SmithNEU BU

Today

โ€ข Course Introโ€ข A taste of the syllabus

ร˜Attacks on information computed from private dataร˜A first private algorithm: randomized response

2

This Courseโ€ข Intro to research on privacy in ML and statistics

ร˜Mathematical modelsโ€ข How do we formulate nebulous concepts?โ€ข How do we assess and critique these formulations?

ร˜Algorithmic techniques

โ€ข Skill sets you will work onร˜Theoretical analysisร˜Critical reading of research literature in CS and beyondร˜Programming

โ€ข Prerequisitesร˜Comfort writing proofs about probability, linear algebra,

algorithmsร˜Undergrads: discuss your background with instructor.

3

Administriviaโ€ข Web page: https://dpcourse.github.io

ร˜Communication via Piazzaร˜Lectures on Gatherร˜Course work on Gradescope

โ€ข Your jobsร˜Lecture preparation, attendance, participationร˜Homework ร˜Project

4

Every lectureโ€ข Ahead of time

ร˜Watch videoโ€ข Engage actively and take notes by hand as you watch

ร˜Read lecture notesร˜Answer Gradescope pre-class questions

โ€ข In classร˜Be present with camera on

โ€ข Let us know on Piazza if that is an issue in general or for specific lectures. Default is attendance at every class

ร˜Actively participate in problem-solvingโ€ข Problems will be posted ahead of time

ร˜Take notes on your work

โ€ข After classร˜Submit your notes (photo or electronic) on Gradescope

5

Courseworkโ€ข Lecture prep and in-class workโ€ข Homework

ร˜ Due Fridays every 2 weeksร˜ Limited collaboration is permitted

โ€ข Groups of size โ‰ค 4ร˜ Academic honesty: You must

โ€ข Acknowledge collaborators (or write โ€œcollaborators: noneโ€)โ€ข Write your solutions yourself, and be ready to explain them orally

โ€“ Rule of thumb: walk away from collaboration meetings with no notes.โ€ข Use only course materials (except for reading general background, e.g.,

on probability, calculus, etc)

โ€ข Project (details TBA)โ€ข Read and summarize a set of 2-3 related papersโ€ข Identify open questionsโ€ข Develop new material (application of a technique to a new data set, work

on open question, show some assumption is necessary, ...)โ€ข Presentation in last week of class

6

To do list for this weekโ€ข Make sure you have access to Piazza, Gradescope

โ€ข Read the syllabus

โ€ข Fill Gradescope background surveyร˜By Thursday

โ€ข Watch videos, read notes, answer questions for Lecture 2ร˜By next lecture (Thu/Fri)

7

Today

โ€ข Course Introโ€ข A taste of the syllabus

ร˜Attacks on information computed from private dataร˜A first private algorithm: randomized response

8

Data are everywhereโ€ข Decisions increasingly automated using rules based on

personal data

9

โ€ข Census data used to apportion congressional seatsร˜Think about citizenship question

โ€ข Also enforce Voting Rights Act, allocate Title I funds, design state districts, โ€ฆ

10Image ยฉ Lumen Learning

Machine learning on your devicesโ€ข Statistical models trained using data from your phones

โ€ข Statistical models trained from other personal data

11

Machine learning on your devicesโ€ข Statistical models trained using data from your phones

ร˜Offer sentence completionร˜Convert voice to speechร˜Select, for you and others to see,

โ€ข Content (e.g. FB newsfeed)โ€ข Adsโ€ข Recommendations for products (โ€œYou might also likeโ€ฆโ€)

โ€ข Statistical models trained from other personal dataร˜Advise judgesโ€™ bail decisionsร˜Allocate police resourcesร˜Advise doctors on diagnosis/treatment

12

Privacy in Statistical Databases

Large collections of personal informationโ€ข census dataโ€ข medical/public healthโ€ข social networksโ€ข education

14

Individuals

queries

answers

Researchers

Valuable because they reveal so much

about our lives

Statistical analysis benefits society

โ€œAgencyโ€

Two conflicting goalsโ€ข Utility: release aggregate statisticsโ€ข Privacy: individual information stays hidden

How do we define โ€œprivacyโ€?โ€ข Studied since 1960โ€™s in

ร˜Statisticsร˜Databases & data miningร˜Cryptography

โ€ข This course section: Rigorous foundations and analysis 15

Utility Privacy

First attempt: Remove obvious identifiers

โ€ข Everything is an identifierโ€ข Attacker has external

informationโ€ข โ€œAnonymizationโ€ schemes

are regularly broken16Images: whitehouse.gov, genesandhealth.org, medium.com

[Ganta Kasiviswanathan S โ€™08]

โ€œAI recognizes blurred facesโ€[McPherson Shokri Shmatikov โ€™16]

Name:Ethnicity:

[Gymrek McGuire Golan Halperin Erlich โ€™13]

[Pandurangan โ€˜14]

Reidentification attack example

17

AnonymizedNetFlix data

AliceBobCharlieDanielleEricaFrank

Public, incomplete IMDB data

Identified NetFlix Data

=AliceBobCharlieDanielleEricaFrank

On average, four movies uniquely identify user

Image credit: Arvind Narayanan

[Narayanan, Shmatikov 2008]

Is the problem granularity?What if we only release aggregate information?

Problem 1: Models leak informationโ€ข Support vector machine output

reveals individual data pointsโ€ข Deep learning models reveal even more

18

Models Leak Information

โ€ข Example: Smart Compose in Gmail ร˜Havenโ€™t seen you in a while.

Hope you are doing well

ร˜ John Doeโ€™s SSN is 920-24-1930 [Carlini et al. 2018]

โ€ข Modern deep learning algorithms often โ€œmemorizeโ€ inputs

19Sources: Top 2- https://www.vice.com/en_us/article/j5npeg/why-is-google-translate-spitting-out-sinister-religious-prophecies (inspiration: Ilya Mironov)

Models can leak information about training dataunexpected waysin unexpected ways

Is the problem granularity?What if we only release aggregate information?

Problem 1: Models leak information

Problem 2: Statistics together may encode dataโ€ข Example: Average salary before/after resignationโ€ข More generally:

Too many, โ€œtoo accurateโ€ statistics reveal individual information

ร˜ Reconstruction attacksโ€ข Reconstruct all or part of data

ร˜ Membership attacksโ€ข Determine if a target individual is in (part of) the data set

20

Cannot release everything everyone would want to know

Differential privacy

21

โ€ข Robust notion of โ€œprivacyโ€ for algorithmic outputsร˜Meaningful in the presence of arbitrary side information

โ€ข Several current deployments

โ€ข Burgeoning field of research

22

Apple Google US Census

Algorithms Crypto,security

Statistics,learning

Game theory,economics

Databases,programming

languages

Law,policy

Differential Privacy

Differential Privacy

โ€ข Data set ๐‘ฅ = ๐‘ฅ!, โ€ฆ , ๐‘ฅ" โˆˆ ๐’ณร˜Domain ๐’ณ can be numbers, categories, tax formsร˜Think of x as fixed (not random)

โ€ข ๐ด = probabilistic procedureร˜๐ด(๐‘ฅ) is a random variableร˜Randomness might come from adding noise, resampling, etc.

23

random bits

A A(x)

โ€ข A thought experimentร˜Change one personโ€™s data (or add or remove them)ร˜Will the probabilities of outcomes change?

24

A A(xโ€™)A A(x)

For any set of outcomes, (e.g. I get denied health insurance)about the same probability in both worlds

Differential Privacy

random bits

random bits

A First Algorithm: Randomized Response

25

Randomized Response (Warner 1965)

โ€ข Say we want to release the proportion of diabetics in a data setร˜ Each personโ€™s data is 1 bit: ๐‘ฅ! = 0 or ๐‘ฅ! = 1

โ€ข Randomized response: each individual rolls a dieร˜ 1, 2, 3 or 4: Report true value ๐‘ฅ!ร˜ 5 or 6: Report opposite value 1 โˆ’ ๐‘ฅ!

โ€ข Output is list of reported values ๐‘Œ#, โ€ฆ , ๐‘Œ$ร˜ It turns out that we can estimate fraction of ๐‘ฅ! โ€™s that are 1 when ๐‘›

is large

26

local random coins

A ๐ด ๐‘ฅ = ๐‘Œ!, โ€ฆ , ๐‘Œ"

๐‘ฅ!๐‘ฅ#

๐‘ฅ"

โ‹ฎ

Randomized Response

27

What sort of privacy does this provide? โ€ข Many possible answersOne approach: Plausible deniability

ร˜๐‘ฅ!$ could have been 0ร˜๐‘ฅ% could have been 1

โ€ข Suppose we fix everyone elseโ€™s data ๐‘ฅ!, โ€ฆ , ๐‘ฅ#โ€ฆโ€ข What is $% ๐‘Œ!& = ๐‘›๐‘œ ๐‘ฅ!& = 1$% ๐‘Œ!& = ๐‘›๐‘œ ๐‘ฅ!& = 0 ?

๐‘– ๐‘ฅ! Die roll ๐‘Œ!1 0 5 yes2 1 1 yes3 1 3 yes4 1 2 yes5 0 6 yes6 0 4 no7 1 2 yes8 0 3 no9 1 2 yes10 1 5 no

10 0 3 no

โ€ข A thought experimentร˜Change one personโ€™s data (or add or remove them)ร˜Will the probabilities of outcomes change?

28

A A(xโ€™)A A(x)

For any set of outcomes, (e.g. I get denied health insurance)about the same probability in both worlds

Differential Privacy

random bits

random bits

Plausible deniability and RRA bit more generallyโ€ฆโ€ข Fix any data set ๏ฟฝโƒ—๏ฟฝ โˆˆ 0,1 $, and any neighboring data set ๏ฟฝโƒ—๏ฟฝ%

ร˜ Let ๐‘– be the position where ๐‘ฅ! โ‰  ๐‘ฅ!"

ร˜ (Recall ๐‘ฅ# = ๐‘ฅ#" for all ๐‘— โ‰  ๐‘–)

โ€ข Fix an output ๏ฟฝโƒ—๏ฟฝ โˆˆ 0,1 $

Pr ๐ด ๏ฟฝโƒ—๏ฟฝ = ๏ฟฝโƒ—๏ฟฝ =23

# ':)$*+$ 13

# ':)$,+$

(because decisions made independently)โ€ข When we change one output, one term in the product

changes (from -.

to #.

or vice versa)

โ€ข So /0 1 )โƒ— *+/0 1 )โƒ—% *+

โˆˆ #-, 2 .

29

Recall basic probability factsโ€ข Random variables have expectations and variances

๐”ผ ๐‘‹ =2)

๐‘ฅ โ‹… Pr ๐‘‹ = ๐‘ฅ

๐‘‰๐‘Ž๐‘Ÿ ๐‘‹ = ๐”ผ ๐‘‹ โˆ’ ๐”ผ ๐‘‹ -

โ€ข Expectations are linear: For any rvโ€™s ๐‘‹#, โ€ฆ , ๐‘‹$ and constants ๐‘Ž#, โ€ฆ , ๐‘Ž$:

๐”ผ 23

๐‘Ž3๐‘‹3 =23

๐‘Ž3๐”ผ(๐‘‹3)

โ€ข Variances add over independent random variables. If ๐‘‹#, โ€ฆ , ๐‘‹$ are independent, then

๐‘‰๐‘Ž๐‘Ÿ 23

๐‘Ž3๐‘‹3 =23

๐‘Ž3-๐‘‰๐‘Ž๐‘Ÿ(๐‘‹3)

โ€ข The standard deviation is ๐‘‰๐‘Ž๐‘Ÿ ๐‘‹3

30

Exercise 1: sums of random variablesโ€ข Say ๐‘‹!, ๐‘‹', โ€ฆ , ๐‘‹" are independent with, for all ๐‘–,

๐”ผ ๐‘‹( = ๐œ‡๐‘‰๐‘Ž๐‘Ÿ ๐‘‹( = ๐œŽ

โ€ข Then what are the expectation and variance of the average 5๐‘‹ = !

"โˆ‘()!" ๐‘‹(?

a) ๐”ผ 5๐‘‹ = ๐œ‡๐‘› and ๐‘‰๐‘Ž๐‘Ÿ 5๐‘‹ = ๐‘›๐œŽb) ๐”ผ 5๐‘‹ = ๐œ‡ and ๐‘‰๐‘Ž๐‘Ÿ 5๐‘‹ = ๐œŽc) ๐”ผ 5๐‘‹ = ๐œ‡ and ๐‘‰๐‘Ž๐‘Ÿ 5๐‘‹ = ๐œŽ/ ๐‘›d) ๐”ผ 5๐‘‹ = ๐œ‡ and ๐‘‰๐‘Ž๐‘Ÿ 5๐‘‹ = *

"

e) ๐”ผ 5๐‘‹ = ๐œ‡/๐‘› and ๐‘‰๐‘Ž๐‘Ÿ 5๐‘‹ = *"

31

Exercise 2: Estimating โˆ‘๐’Š๐’™๐’Š from RRโ€ข Show there is a procedure which, given ๐‘Œ!, โ€ฆ , ๐‘Œ",

produces an estimate ๐ด such that

๐”ผ ๐ด โˆ’9()!

"

๐‘ฅ(

'

= ๐‘‚ ๐‘› .

Equivalently, ๐”ผ +"โˆ’ 5๐‘‹

'= ๐‘‚ !

"

ร˜Hint: What are the mean and variance of 3๐‘Œ) โˆ’ 1?

32

Standard deviation of estimate

Randomized response for other ratiosโ€ข Each person has data ๐‘ฅ3 โˆˆ ๐’ณ

ร˜ Normally data is more complicated than bitsโ€ข Tax records, medical records, Instagram profiles, etc

ร˜ Use ๐’ณ to denote the set of possible records

โ€ข Analyst wants to know sum of ๐œ‘: ๐’ณ โ†’ 0,1 over ๐’™ร˜ Here ๐œ‘ captures the property we want to sumร˜ E.g. โ€œwhat is the number of diabetics?โ€

โ€ข ๐œ‘ ๐ด๐‘‘๐‘Ž๐‘š, 168 ๐‘™๐‘๐‘ . , 17, ๐‘›๐‘œ๐‘ก ๐‘‘๐‘–๐‘Ž๐‘๐‘’๐‘ก๐‘–๐‘ = 0โ€ข ๐œ‘ ๐ด๐‘‘๐‘Ž, 142 ๐‘™๐‘๐‘ . , 47, ๐‘‘๐‘–๐‘Ž๐‘๐‘’๐‘ก๐‘–๐‘ = 1โ€ข We want to learn โˆ‘!"#

$ ๐œ‘(๐‘ฅ!)

โ€ข Randomization operator takes ๐‘ง โˆˆ {0,1}:

๐‘… ๐‘ง = C๐‘ง ๐‘ค. ๐‘. P%

P%Q#

1 โˆ’ ๐‘ง ๐‘ค. ๐‘. #P%Q#

34

Ratio is ๐‘’% (think 1 + ๐œ– for small ๐œ–)

For each person ๐‘–,๐‘Œ) = ๐‘… ๐œ‘ ๐‘ฅ)

Randomized response for other ratiosโ€ข Each person has data ๐‘ฅ) โˆˆ ๐’ณ

ร˜ Analyst wants to know sum of ๐œ‘:๐’ณ โ†’ 0,1 over ๐’™

โ€ข Randomization operator takes ๐‘ง โˆˆ {0,1}:

๐‘… ๐‘ง = 6๐‘ง ๐‘ค. ๐‘. *!

*!+!

1 โˆ’ ๐‘ง ๐‘ค. ๐‘. !*!+!

โ€ข How can we estimate a proportion?ร˜๐ด ๐‘ฅ!, โ€ฆ . , ๐‘ฅ" :

โ€ข For each ๐‘–, let ๐‘Œ! = ๐‘… ๐œ‘ ๐‘ฅ!โ€ข Return ๐ด = โˆ‘! ๐‘Ž๐‘Œ! โˆ’ ๐‘

ร˜What values for ๐‘Ž, ๐‘ make ๐”ผ ๐ด = โˆ‘)๐œ‘ ๐‘ฅ) ?

โ€ข Proposition: ๐”ผ ๐ด โˆ’โˆ‘)๐œ‘ ๐‘ฅ)' = *!+!

*!,!๐‘›.

35

We can do much better than this!Coming up โ€ฆ

โ‰ˆ # "-

when ๐œ– small