77
Privacy in D.A.T.A. Latanya Sweeney, Ph.D. Assistant Professor of Computer Science, Technology & Policy Carnegie Mellon University [email protected] http://privacy.cs.cmu.edu/

Privacy in D.A.T.A. - ALADDIN Center, Carnegie Mellon ... · Latanya Sweeney Ben Vernot Aaron White Joseph Barrett, JD Sylvia Barrett, JD Joseph Lombardo Deanna Mool, JD Julie Pavlin,

  • Upload
    doantu

  • View
    221

  • Download
    2

Embed Size (px)

Citation preview

Privacy in D.A.T.A.

Latanya Sweeney, Ph.D.Assistant Professor of Computer Science, Technology & Policy

Carnegie Mellon [email protected]

http://privacy.cs.cmu.edu/

The Question in this Talk

Can computer scientistsprovide both safety andprivacy to society?

The Question in this Talk

Can computer scientistsprovide both safety andprivacy to society?

Answer:YES. Three goals: (1) understand the nature of real

privacy threats; (2) design technical solutions to integratewith policy to avoid a setting in which society is forced tochoose; and, (3) construct technical solutions that addressthese threats while keeping data useful.

Data Privacy Laboratory atCMU

Ralph GrossYiheng LiBradley MalinElaine NewtonMichael ShamosLatanya SweeneyBen VernotAaron White

http://privacy.cs.cmu.edu/

Joseph Barrett, JDSylvia Barrett, JDJoseph LombardoDeanna Mool, JDJulie Pavlin, MDUniversity of PittsburghLaw Students

Laboratory for InternationalData Privacy at CMU

Work with real-world stakeholders:- public health agencies- government agencies- private corporations

Kinds of projects currently underway:- health data- web data- video surveillance data- genetic data- census surveys- crime data- grocery data, and so on…

http://privacy.cs.cmu.edu/

Laboratory for InternationalData Privacy at CMU

http://privacy.cs.cmu.edu/

Data Linkage (“data detectives”):

combining disparate pieces of entity-specificinformation to learn more about an entity

Privacy Protection (“data protectors”):

release information such that certain entity-specific properties (such as identity) cannotbe inferred; restrict what can be learned

“Can’t release data”

Distortion, anonymity

Holder

RecipientConfidentiality, Privacy, Liability concerns

Accuracy, quality, risk

“Privacy is dead, get over it”

Ann 10/2/61 02139 cardiacAbe 7/14/61 02139 cancerAl 3/8/61 02138 liver

Distortion, anonymity

Recipient

HolderResearchers need data

Accuracy, quality, risk

“Share data while guaranteeinganonymity”

Accuracy, quality, risk Distortion, anonymity

Holder

A* 1961 0213* cardiacA* 1961 0213* cancerA* 1961 0213* liver

Recipient Computational solutions

This talk

� Data investigations

�Lots of data out there

�Use innocent looking data to learnsensitive information

� Data protection

� Surveillance

0

50

100

150

200

250

300

350

400

450

500

1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003

Year

GD

SP

(MB

/per

son)

0

5

10

15

20

25

30

35

1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003

Sew

rver

s(in

Mill

ions

)Technically-empowered Society

1993 FirstWWWconference

2001

Growth inavailablediskstorage

Growth inactive webservers

19961991

Typical Birth Certificate Fields, post 1925Field nameChild's first nameChild's middle name (sometimes or initial)Child's last nameDay, month and year of birthCity and/or County of birth (sometimes hospital)Father's nameMother's name (including maiden name)Place of birth (address and town/city)Mother's age and addressMother's birthplace (town/city, state, county)Mother's occupationMother, number of previous childrenFather's age and addressFather's birthplace (town/city, state, county)Father's occupation

Typical Electronic Birth Certificate Fieldsin 1999-starting fields 1-15

Field# Size Field name1 1 File Status2 50 Baby’s First Name3 50 Baby’s Middle Name4 50 Baby’s Last Name5 1 Baby’s Suffix Code6 3 Baby’s Suffix Text7 8 Baby’s Date of Birth8 5 Baby’s Time of Birth9 1 AM/PM Indicator

10 1 Baby’s Sex11 3 Blood Type12 1 Born Here?13 40 Place of Birth14 1 Facility Type

Typical Electronic Birth Certificate Fieldsin 1999-starting fields 16-30Field# Size Field name

16 20 County of Birth17 6 Certifier’s Code18 30 Certifier’s Name19 1 Certifier’s Title20 30 Attendant’s Name21 1 Attendant’s Title22 23 Attendant’s Address23 19 Attendant’s City24 2 Attendant’s State25 10 Attendant’s Zip Code26 50 Mother’s First Name27 50 Mother’s Middle Name28 50 Mother’s Last Name29 9 Mother’s Social Security Number30 8 Mother’s Date of Birth

Typical Electronic Birth Certificate Fieldsin 1999-starting fields 31-45

field# Size Field name31 3 Mother’s State of Birth32 7 Mother’s Residence Address33 2 Mother’s Residence Direction34 20 Residence Street Address35 10 Residence Type36 2 Residence Extension37 10 Residence Apartment #38 20 Mother’s Town of Residence39 1 Mother’s Residence in City Limits40 14 Mother’s County of Residence41 3 Mother’s State of Residence42 10 Mother’s Residence Zip Code43 38 Mother’s Mailing Address44 19 Mother’s Mailing City45 2 Mother’s Mailing State

Typical Electronic Birth Certificate Fieldsin 1999-starting fields 46-60

Field# Size Field name46 10 Mother’s Mailing Zip Code47 1 Mother Married?48 50 Father’s First Name49 50 Father’s Middle Name50 50 Father’s Last Name51 1 Father’s Suffix Code52 9 Father’s Suffix Text53 9 Father’s Social Security Number54 8 Father’s Date of Birth55 3 Father’s State of Birth56 14 Mother’s Origin57 14 Mother’s Race58 2 Mother’s Elementary Education59 2 Mother’s College Education60 11 Mother’s Occupation

Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 61-75

Field# Size Field name61 11 Mother’s Industry62 14 Father’s Origin63 14 Father’s Race64 2 Father’s Elementary Education65 2 Father’s College Education66 11 Father’s Occupation67 11 Father’s Industry68 1 Plurality69 1 Birth Order70 2 Live Births Still Living71 2 Live Births Now Dead72 4 Month/Year Last Live Birth73 2 Number of Terminations74 4 Month/Year Last Termination75 1 Baby’s Weight Unit

Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 76-90

Field# Size Field name76 5 Baby’s Weight77 6 Date of Last Normal Menses78 1 Month Prenatal Care Began79 2 Total Number of Visits80 2 Apgar Score – 1 Minute81 2 Apgar Score – 5 Minute82 2 Estimate of Gestation83 6 Date of Blood Test84 22 Laboratory85 1 Mother Transferred In86 30 Facility Mother Transferred From87 1 Baby Transferred Out88 30 Facility Baby Transferred To89 1 Tobacco Use During Pregnancy90 3 Number of Cigarettes/Day

Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 91-105

Field# Size Field name91 1 Alcohol Use During Pregnancy92 3 Number of Drinks/Week93 3 Mother’s Weight Gain94 1 Release Info For SSN95 6 Operator Code96 12 Hospital ID97 1 Sent to Romans98 1 Sent to APORS99 16 Other Certifier Specify

100 12 Temporary Audit Number101 16 Other Facility Specify102 16 Other Attendant Specify103 1 Mother’s Race104 1 Father’s Race105 2 Mother’s Origin

Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 106-120

Field# Size Field name106 2 Father’s Origin107 1 Attendant Same YN108 1 Mailing Address Same YN109 1 Capture Father’s Info YN110 2 Mother’s Age111 2 Father’s Age112 12 Baby’s Hospital Med. Rec.113 1 High Risk Pregnancy YN114 1 Care Giver (For Chicago)115 1 Record Selected For Download116 1 Downloaded117 1 Printed118 12 Form Number

MEDICAL RISK FACTORS119 1 Anemia120 1 Cardiac Disease

Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 121-135

Field# Size Field name121 1 Acute/Chronic Lung Disease122 1 Diabetes123 1 Genital Herpes124 1 Hydramnios/Oligohydramnios125 1 Hemoglobinopathy126 1 Hypertension, Chronic127 1 Hypertension, Preg. Assoc.128 1 Eclampsia129 1 Incompetent Cervix130 1 Previous Infant 4000+ Grams131 1 Previous Preterm or SGA Infant132 1 Renal Disease133 1 Rh Sensitization134 1 Uterine Bleeding135 1 No Medical Risk Factors

Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 136-150

Field# Size Field name136 40 Other Medical Risk Factors

OBSTETRIC PROCEDURES137 1 Amniocentesis138 1 Electronic Fetal Monitoring139 1 Induction of Labor140 1 Stimulation of Labor141 1 Tocolysis142 1 Ultrasound143 1 No Obstetric Procedures144 40 Other Obstetric Procedures

COMPLICATIONS OF LABOR & D145 1 Febrile (>100 or 38C)146 1 Meconium Moderate, Heavy147 1 Premature Rupture (>12 Hrs)148 1 Abruptio Placenta149 1 Placenta Previa150 1 Other Excessive Bleeding

Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 151-165

Field# Size Field name151 1 Seizures During Labor152 1 Precipitous Labor (<3 Hrs)153 1 Prolonged Labor (>20 Hrs)154 1 Dysfunctional Labor155 1 Breech/Malpresentation156 1 Cephalopelvic Disproportion157 1 Cord Prolapse158 1 Anesthetic Complications159 1 Fetal Distress160 1 No Complications of L&D161 40 Other Complications of L&D

METHOD OF DELIVERY162 1 Vaginal163 1 Vaginal After Previous C-Section164 1 Primary C-Section165 1 Repeat C-Section

Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 166-180

Field# Size Field name166 1 Forceps167 1 Vacuum

ABNORMAL CONDITIONS OF NEWBO168 1 Anemia169 1 Birth Injury170 1 Fetal Alcohol Syndrome171 1 Hyaline Membrane Disease/RDS172 1 Meconium Aspiration Syndrome173 1 Assisted Ventilation <30174 1 Assisted Ventilation >30175 1 Seizures176 1 No Abnormal Conditions of Newborn177 40 Other Abnormal Condition of Newborn

CONGENITAL ANOMALIES OF CHILD178 1 Anencephalus179 1 Spina Bifida/Meningocele180 1 Hydrocephalus

Typical Electronic Birth Certificate Fieldsin 1999-continued fields 181-195

Field# Size Field name181 1 Microcephalus182 40 Other CNS Anomalies183 1 Heart Malformations184 40 Other Circ./Resp. Anomalies185 1 Rectal Atresia/Stenosis186 1 Tracheo-Esophageal Fistula/Esophag187 1 Omphalocele/Gastroschisis188 40 Other Gastrointestinal Ano.189 1 Malformed Genitalia190 1 Renal Agenesis191 40 Other Urogenital Anomalies192 1 Cleft Lip/Palate193 1 Polydactyly/Syndactyly/Adactyly194 1 Club Foot195 1 Diaphragmatic Hernia

Typical Electronic Birth Certificate Fieldsin 1999-continued fields 196-210

Field# Size Field name196 40 Other Musculoskeletal/Integumental A197 1 Down’s Syndrome198 40 Other Chromosomal Anomalies199 1 No Congenital Anomalies200 40 Other Congenital Anomalies

CODE STRIP201 1 Record Complete YN202 1 Record Type203 4 Facility ID204 4 City of Birth205 3 County of Birth206 2 Mother’s State of Birth207 2 Mother’s State of Residence208 4 Mother’s Town of Residence209 3 Mother’s County of Residence210 2 Father’s State of Birth

Typical Electronic Birth Certificate Fieldsin 1999-continued fields 211-226.

Field# Size Field name211 14 Certifier’s License Number212 6 Laboratory ID Number213 4 Mother Xfer Code214 3 Mother Xfer County Code215 4 Baby Xfer Code216 3 Baby Xfer County Code217 4 Year of Birth218 7 Certificate #219 1 Unique Code220 8 File Date221 2 Community Area222 4 Census Tract223 2 Century of Last Live Birth224 2 Century of Last Termination225 2 Century of Last Menses226 2 Century of Blood Test

On-line birth certificates(some California counties)

Numerous Efforts Underway to FuseAvailable Data Together on Individuals

health

schools

web use

groceries

marriages

real estate

entertainment

criminal data

death,family records

employment

This talk

� Data investigations

�Lots of data out there

�Use innocent looking data to learnsensitive information

� Data protection

� Surveillance

Health data (GIC example)

Ethnicity

Visit date

Diagnosis

Procedure

Medication

Total charge

ZIP

Birthdate

Sex

Medical Data

Population data (GIC example)

ZIP

Birthdate

Sex

Name

Address

Dateregistered

Partyaffiliation

Date lastvoted

Voter List

Linking to re-identify data

Ethnicity

Visit date

Diagnosis

Procedure

Medication

Total charge

ZIP

Birthdate

Sex

Name

Address

Dateregistered

Partyaffiliation

Date lastvoted

Medical Data Voter List

Uniqueness in Cambridge Voters

Birth date alone 12%Birth date & gender 29%Birth date & 5-digit ZIP 69%Birth date & full postal code 97%

Birth date includes month, day and year.Total 54,805 voters.

JLME 97

Few characteristics make a person unique

Birth includes month, day and year:

365 days x 100 years = 36,500 possibilities

Two genders and Five ZIP (5-digit) codes:

2 * 5 * 36,500 =365,000 possibilities

But the Cambridge Voter list had:

54,805 voters

So in general, using(birth[mon,day,yr], gender, ZIP[5-digit])provides aunique quasi-identifier.

JLME 97

{ date of birth, gender, 5-digit ZIP}uniquely identifies 87.1% of USA pop.

{ date of birth, gender, 5-digit ZIP}uniquely identifies 87.1% of USA pop.

ZIP 60623,112,167 people,11%, not 0%insufficient #above the age of55 living there.

{ date of birth, gender, 5-digit ZIP}uniquely identifies 87.1% of USA pop.

ZIP 11794, 5418people, primarilybetween 19 and24 (4666 of 5418or 86%), only13%.

Disclosure ScenarioPrivate (Single Hospital ) Public (Various Users )

DNADatabase

DGZ

ANONYMIZE

HospitalDischargeDatabase

ACTG

CombinedIdentified

Information

Genotype-Phenotype Relations

Can infer genotype-phenotype relationshipsout of both DNA and medical databases

MedicalDatabase

PhenotypeWith

Genetic Trait

GenomicDNA

DiseasePhenotype

DiseaseSequences

DNADatabase

Example: Huntington’s Disease

MedicalDatabase

ICD9 code3334

HD GeneMutation

ICD9 code3334

HD GeneMutation

DNADatabase

Uniqueness in Trails

P1: 00000000000000000000000000000000000000000000000000010000000

P2: 00000000000000000000100000000000000000000000000000000000000

P3: 00000000000000000000100000000001000000000000000000000000000

PA: 00000000000000000000000000000000000000000000000000010000000

PB: 00000000000000000000000000000001000000000000000000000000000

PC: 00000000000000000000100000000000000000000000000000000000000

Uniqueness of audit trails with large numbers ofpeople and locations.

logs

names

Uniqueness in Trails(Web logs)

P1: 00000000000000000000000000000000000000000000000000010000000

P2: 00000000000000000000100000000000000000000000000000000000000

P3: 00000000000000000000100000000001000000000000000000000000000

PA: 00000000000000000000000000000000000000000000000000010000000

PB: 00000000000000000000000000000001000000000000000000000000000

PC: 00000000000000000000100000000000000000000000000000000000000

logs

names

Bradley Malin will talk about re-identifying people from the trails ofdata the leave behind.

Computer Security & Data Sharing

Authentication: login with passwordAuthorization: allowed to read/write dataEncryption: to avoid eavesdropping

BUT data can re-identify individual!

This talk

�Data investigations

� Data protection

�Formal protection models

�Effort-based models (evolving)

� Surveillance

Idea ofk-map andk-anonymity

Sweeney 97 and 98

For every record released, there will be at leastkindividuals to whom the record indistinctly refers.

In k-map, thek individuals exist in the world.

In k-anonymity, thek individuals appear in therelease.

Sample population register of 6 people

Gil Hal Jim

Ken Len MelPopulation

Re-identification Example

�l�a��

There are 3green figuresand 2 figureshaving the sameprofile as therelease.

But only Halisgreen and hasthe same figuretype as theprofile in therelease. It is aunique match.

Gil Hal Jim

Ken Len MelPopulation

Re-identification Example

�l�a��

There are twomatches for thisprofile, JimandMel. There is nounique match.

Gil Hal Jim

Ken Len MelPopulation

Re-identification Example

�l�a��

To achievek-mapwherek=2, agentsfor Gil, Hal andKenagree tomerge theirinformationtogether.

Informationreleased aboutany of themresults in thesame mergedimage.

Gil Hal Jim

Ken Len MelPopulation

+ =

privacy.cs.cmu.edu/datafly/

privacy.cs.cmu.edu/datafly/

Ryan Williams will show thatk-anonymity using generalization andsuppression is NP hard in the generalcase.

This talk

�Data investigations

� Data protection

�Formal protection models

�Effort-based models (evolving)

� Surveillance

Video Surveillance Cameras inLower Manhattan

From http://www.appliedautonomy.com/isee

De-identification of Faces

Example.

Captured images are below.Here is a known image ofBob. Which person is Bob?

De-identification: T-maskExample continued...

Captured images are de-identified below. Here is aknown image of Bob.Which person is Bob?

(a) (e) (f) (g)

(h)

(d)(c)(b)

(l) (m) (n)(k)(j)(i)

Ralph Gross (for Elaine Newton)will show how faces can be de-identified tothwart any face recognition system yetpreserving many details in the face.

Video Surveillance Cameras inLower Manhattan

Yan Ke will talk about de-identifying imagesof people in networks of cameras.

This talk

�Data investigations

�Data protection

� Surveillance

Detect Early using Onset,Coordinate Deaths & Hospital Admits

Based on results reported in Guillemin, 1999.

1979 Sverdlovsk Anthrax Outbreak

0

10

20

30

40

50

60

70

0 10 20 30 40 50

Time (in Days)

Cum

ulat

ive

(cas

es)

OnsetHospital AdmitsDeaths

How can we detect onset?How early on each can we predict?How does coordination help?

1979 Sverdlovsk Anthrax Outbreak

0

10

20

30

40

50

60

70

0 10 20 30 40 50

Time (in Days)

Pre

vale

nce

(cas

es)

OnsetHospital AdmitsDeaths

Cum

ulat

ive

Cas

es

Ted Senator will talk about the big picture of earlydetection bio-terrorism surveillance systems.

Continuously Observe Behaviorsto Detect Onset of Symptoms

Prodromic surveillance:

How many are acting ill?

Unusualbehaviors→syndromes?

Not confirmeddiagnoses!

Andrew Moore will talk describe anomalydetection algorithms used in real-worldbio-terrorism surveillance systems.

Centralized Surveillance of Secondary Data

hospitals

schools

labs

groceries

physicians

animals

prescriptions

assisted living

deaths

businesses

detect

Centralized Surveillance of Secondary Data

hospitals

schools

labs

groceries

physicians

animals

prescriptions

assisted living

deaths

businesses

detect

Doug Dyer will talk about the surveillance fordetecting terrorists.

Access Instruments

hospitals

schools

labs

groceries

physicians

animals

prescriptions

assisted living

deaths

businesses

HIPAA

educationlaws

contract contract contract

contract

contract

HIPAA HIPAA contract

*Not includingpublic health law

Policy Matters…

� FOIA versus Privacy

� Law enforcement

� Intellectual property

� Medical privacy legislation

� Internet privacy

� Bio-terrorism surveillance

Mike Shamos will describehow these laws, regulationsand policies frame themathematics behindsolutions.

Automated Privacy Module

hospitals

schools

labs

groceries

physicians

animals

prescriptions

assisted living

deaths

businesses

detect

Mechanical distortion decisionstypically renders data useless

Gross overview

Sufficiently de-identified

Identifiable

Explicitly identified

Readily identifiable

Sufficiently anonymous

Unusual activity

Suspicious activity

Outbreak detected

Outbreak suspected

Normal operation

Datafly Idenifiability 0..1 Detection Status 0..1

Dynamically Augment the Model WhenSurveillance Detects Possible Attack� Lower the privacy threshold when potential attack detected

– Take advantage of disease-specific processing– Need to flush out early suspicions by looking at more detailed

data

“How manyx occurred yesterday?”

detect

store1store2

store3

store4

store5store6store7

store8

store9

store10 5 12

1

8

0117

14

3

0

“How manyx occurred yesterday?”

detect

store1store2

store3

store4

store5store6store7

store8

store9

store10 5 12

1

8

0117

14

3

0

But this kind ofreporting can poseconfidentiality concernsfor the store and somany stores may refuseto participate.

“How manyx occurred yesterday?”

detect

store1store2

store3

store4

store5store6store7

store8

store9

store10 5 12

1

8

0117

14

3

0

Joe Kilian willprovide a survey ofmethods in multi-party computation.

Total count: “How manyx occurred?”

detect

store1

store2

store3

store4

store5store6store7

store8

store9

store10

r2= r 1+1

r3= r 2+8

r4= r 3+0r5= r 4+11r6= r 5+7

r7= r 6+14

r8= r 7+3

r9= r 8+0

r9+5

r1= r 0+12

r0

Samuel Edoho-Eketwill provide somesolutions to thesebasic questions.

Other presentations

Privacy-preserving data mining:Rafail OstrovskyBenny PinkasJohannes Gehrke

Query restriction problem:Susmit Sarkar

Statistical approaches:Steve FienbergRebecca Wright

The Question in this Talk

Can computer scientistsprovide both safety andprivacy to society?

Answer:YES. Three goals: (1) understand the nature of real

privacy threats; (2) design technical solutions to integratewith policy to avoid a setting in which society is forced tochoose; and, (3) construct technical solutions that addressthese threats while keeping data useful.

This talk

�Data investigations

�Data protection

�Surveillance

Latanya Sweeney, PhD.Assistant Professor of Computer Science,

Technology and [email protected]

http://privacy.cs.cmu.edu/