Upload
doantu
View
221
Download
2
Embed Size (px)
Citation preview
Privacy in D.A.T.A.
Latanya Sweeney, Ph.D.Assistant Professor of Computer Science, Technology & Policy
Carnegie Mellon [email protected]
http://privacy.cs.cmu.edu/
The Question in this Talk
Can computer scientistsprovide both safety andprivacy to society?
Answer:YES. Three goals: (1) understand the nature of real
privacy threats; (2) design technical solutions to integratewith policy to avoid a setting in which society is forced tochoose; and, (3) construct technical solutions that addressthese threats while keeping data useful.
Data Privacy Laboratory atCMU
Ralph GrossYiheng LiBradley MalinElaine NewtonMichael ShamosLatanya SweeneyBen VernotAaron White
http://privacy.cs.cmu.edu/
Joseph Barrett, JDSylvia Barrett, JDJoseph LombardoDeanna Mool, JDJulie Pavlin, MDUniversity of PittsburghLaw Students
Laboratory for InternationalData Privacy at CMU
Work with real-world stakeholders:- public health agencies- government agencies- private corporations
Kinds of projects currently underway:- health data- web data- video surveillance data- genetic data- census surveys- crime data- grocery data, and so on…
http://privacy.cs.cmu.edu/
Laboratory for InternationalData Privacy at CMU
http://privacy.cs.cmu.edu/
Data Linkage (“data detectives”):
combining disparate pieces of entity-specificinformation to learn more about an entity
Privacy Protection (“data protectors”):
release information such that certain entity-specific properties (such as identity) cannotbe inferred; restrict what can be learned
“Can’t release data”
Distortion, anonymity
Holder
RecipientConfidentiality, Privacy, Liability concerns
Accuracy, quality, risk
“Privacy is dead, get over it”
Ann 10/2/61 02139 cardiacAbe 7/14/61 02139 cancerAl 3/8/61 02138 liver
Distortion, anonymity
Recipient
HolderResearchers need data
Accuracy, quality, risk
“Share data while guaranteeinganonymity”
Accuracy, quality, risk Distortion, anonymity
Holder
A* 1961 0213* cardiacA* 1961 0213* cancerA* 1961 0213* liver
Recipient Computational solutions
This talk
� Data investigations
�Lots of data out there
�Use innocent looking data to learnsensitive information
� Data protection
� Surveillance
0
50
100
150
200
250
300
350
400
450
500
1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003
Year
GD
SP
(MB
/per
son)
0
5
10
15
20
25
30
35
1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003
Sew
rver
s(in
Mill
ions
)Technically-empowered Society
1993 FirstWWWconference
2001
Growth inavailablediskstorage
Growth inactive webservers
19961991
Typical Birth Certificate Fields, post 1925Field nameChild's first nameChild's middle name (sometimes or initial)Child's last nameDay, month and year of birthCity and/or County of birth (sometimes hospital)Father's nameMother's name (including maiden name)Place of birth (address and town/city)Mother's age and addressMother's birthplace (town/city, state, county)Mother's occupationMother, number of previous childrenFather's age and addressFather's birthplace (town/city, state, county)Father's occupation
Typical Electronic Birth Certificate Fieldsin 1999-starting fields 1-15
Field# Size Field name1 1 File Status2 50 Baby’s First Name3 50 Baby’s Middle Name4 50 Baby’s Last Name5 1 Baby’s Suffix Code6 3 Baby’s Suffix Text7 8 Baby’s Date of Birth8 5 Baby’s Time of Birth9 1 AM/PM Indicator
10 1 Baby’s Sex11 3 Blood Type12 1 Born Here?13 40 Place of Birth14 1 Facility Type
Typical Electronic Birth Certificate Fieldsin 1999-starting fields 16-30Field# Size Field name
16 20 County of Birth17 6 Certifier’s Code18 30 Certifier’s Name19 1 Certifier’s Title20 30 Attendant’s Name21 1 Attendant’s Title22 23 Attendant’s Address23 19 Attendant’s City24 2 Attendant’s State25 10 Attendant’s Zip Code26 50 Mother’s First Name27 50 Mother’s Middle Name28 50 Mother’s Last Name29 9 Mother’s Social Security Number30 8 Mother’s Date of Birth
Typical Electronic Birth Certificate Fieldsin 1999-starting fields 31-45
field# Size Field name31 3 Mother’s State of Birth32 7 Mother’s Residence Address33 2 Mother’s Residence Direction34 20 Residence Street Address35 10 Residence Type36 2 Residence Extension37 10 Residence Apartment #38 20 Mother’s Town of Residence39 1 Mother’s Residence in City Limits40 14 Mother’s County of Residence41 3 Mother’s State of Residence42 10 Mother’s Residence Zip Code43 38 Mother’s Mailing Address44 19 Mother’s Mailing City45 2 Mother’s Mailing State
Typical Electronic Birth Certificate Fieldsin 1999-starting fields 46-60
Field# Size Field name46 10 Mother’s Mailing Zip Code47 1 Mother Married?48 50 Father’s First Name49 50 Father’s Middle Name50 50 Father’s Last Name51 1 Father’s Suffix Code52 9 Father’s Suffix Text53 9 Father’s Social Security Number54 8 Father’s Date of Birth55 3 Father’s State of Birth56 14 Mother’s Origin57 14 Mother’s Race58 2 Mother’s Elementary Education59 2 Mother’s College Education60 11 Mother’s Occupation
Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 61-75
Field# Size Field name61 11 Mother’s Industry62 14 Father’s Origin63 14 Father’s Race64 2 Father’s Elementary Education65 2 Father’s College Education66 11 Father’s Occupation67 11 Father’s Industry68 1 Plurality69 1 Birth Order70 2 Live Births Still Living71 2 Live Births Now Dead72 4 Month/Year Last Live Birth73 2 Number of Terminations74 4 Month/Year Last Termination75 1 Baby’s Weight Unit
Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 76-90
Field# Size Field name76 5 Baby’s Weight77 6 Date of Last Normal Menses78 1 Month Prenatal Care Began79 2 Total Number of Visits80 2 Apgar Score – 1 Minute81 2 Apgar Score – 5 Minute82 2 Estimate of Gestation83 6 Date of Blood Test84 22 Laboratory85 1 Mother Transferred In86 30 Facility Mother Transferred From87 1 Baby Transferred Out88 30 Facility Baby Transferred To89 1 Tobacco Use During Pregnancy90 3 Number of Cigarettes/Day
Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 91-105
Field# Size Field name91 1 Alcohol Use During Pregnancy92 3 Number of Drinks/Week93 3 Mother’s Weight Gain94 1 Release Info For SSN95 6 Operator Code96 12 Hospital ID97 1 Sent to Romans98 1 Sent to APORS99 16 Other Certifier Specify
100 12 Temporary Audit Number101 16 Other Facility Specify102 16 Other Attendant Specify103 1 Mother’s Race104 1 Father’s Race105 2 Mother’s Origin
Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 106-120
Field# Size Field name106 2 Father’s Origin107 1 Attendant Same YN108 1 Mailing Address Same YN109 1 Capture Father’s Info YN110 2 Mother’s Age111 2 Father’s Age112 12 Baby’s Hospital Med. Rec.113 1 High Risk Pregnancy YN114 1 Care Giver (For Chicago)115 1 Record Selected For Download116 1 Downloaded117 1 Printed118 12 Form Number
MEDICAL RISK FACTORS119 1 Anemia120 1 Cardiac Disease
Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 121-135
Field# Size Field name121 1 Acute/Chronic Lung Disease122 1 Diabetes123 1 Genital Herpes124 1 Hydramnios/Oligohydramnios125 1 Hemoglobinopathy126 1 Hypertension, Chronic127 1 Hypertension, Preg. Assoc.128 1 Eclampsia129 1 Incompetent Cervix130 1 Previous Infant 4000+ Grams131 1 Previous Preterm or SGA Infant132 1 Renal Disease133 1 Rh Sensitization134 1 Uterine Bleeding135 1 No Medical Risk Factors
Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 136-150
Field# Size Field name136 40 Other Medical Risk Factors
OBSTETRIC PROCEDURES137 1 Amniocentesis138 1 Electronic Fetal Monitoring139 1 Induction of Labor140 1 Stimulation of Labor141 1 Tocolysis142 1 Ultrasound143 1 No Obstetric Procedures144 40 Other Obstetric Procedures
COMPLICATIONS OF LABOR & D145 1 Febrile (>100 or 38C)146 1 Meconium Moderate, Heavy147 1 Premature Rupture (>12 Hrs)148 1 Abruptio Placenta149 1 Placenta Previa150 1 Other Excessive Bleeding
Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 151-165
Field# Size Field name151 1 Seizures During Labor152 1 Precipitous Labor (<3 Hrs)153 1 Prolonged Labor (>20 Hrs)154 1 Dysfunctional Labor155 1 Breech/Malpresentation156 1 Cephalopelvic Disproportion157 1 Cord Prolapse158 1 Anesthetic Complications159 1 Fetal Distress160 1 No Complications of L&D161 40 Other Complications of L&D
METHOD OF DELIVERY162 1 Vaginal163 1 Vaginal After Previous C-Section164 1 Primary C-Section165 1 Repeat C-Section
Typical Electronic Birth Certificate Fieldsin 1999 -continued fields 166-180
Field# Size Field name166 1 Forceps167 1 Vacuum
ABNORMAL CONDITIONS OF NEWBO168 1 Anemia169 1 Birth Injury170 1 Fetal Alcohol Syndrome171 1 Hyaline Membrane Disease/RDS172 1 Meconium Aspiration Syndrome173 1 Assisted Ventilation <30174 1 Assisted Ventilation >30175 1 Seizures176 1 No Abnormal Conditions of Newborn177 40 Other Abnormal Condition of Newborn
CONGENITAL ANOMALIES OF CHILD178 1 Anencephalus179 1 Spina Bifida/Meningocele180 1 Hydrocephalus
Typical Electronic Birth Certificate Fieldsin 1999-continued fields 181-195
Field# Size Field name181 1 Microcephalus182 40 Other CNS Anomalies183 1 Heart Malformations184 40 Other Circ./Resp. Anomalies185 1 Rectal Atresia/Stenosis186 1 Tracheo-Esophageal Fistula/Esophag187 1 Omphalocele/Gastroschisis188 40 Other Gastrointestinal Ano.189 1 Malformed Genitalia190 1 Renal Agenesis191 40 Other Urogenital Anomalies192 1 Cleft Lip/Palate193 1 Polydactyly/Syndactyly/Adactyly194 1 Club Foot195 1 Diaphragmatic Hernia
Typical Electronic Birth Certificate Fieldsin 1999-continued fields 196-210
Field# Size Field name196 40 Other Musculoskeletal/Integumental A197 1 Down’s Syndrome198 40 Other Chromosomal Anomalies199 1 No Congenital Anomalies200 40 Other Congenital Anomalies
CODE STRIP201 1 Record Complete YN202 1 Record Type203 4 Facility ID204 4 City of Birth205 3 County of Birth206 2 Mother’s State of Birth207 2 Mother’s State of Residence208 4 Mother’s Town of Residence209 3 Mother’s County of Residence210 2 Father’s State of Birth
Typical Electronic Birth Certificate Fieldsin 1999-continued fields 211-226.
Field# Size Field name211 14 Certifier’s License Number212 6 Laboratory ID Number213 4 Mother Xfer Code214 3 Mother Xfer County Code215 4 Baby Xfer Code216 3 Baby Xfer County Code217 4 Year of Birth218 7 Certificate #219 1 Unique Code220 8 File Date221 2 Community Area222 4 Census Tract223 2 Century of Last Live Birth224 2 Century of Last Termination225 2 Century of Last Menses226 2 Century of Blood Test
Numerous Efforts Underway to FuseAvailable Data Together on Individuals
health
schools
web use
groceries
marriages
real estate
entertainment
criminal data
death,family records
employment
This talk
� Data investigations
�Lots of data out there
�Use innocent looking data to learnsensitive information
� Data protection
� Surveillance
Health data (GIC example)
Ethnicity
Visit date
Diagnosis
Procedure
Medication
Total charge
ZIP
Birthdate
Sex
Medical Data
Population data (GIC example)
ZIP
Birthdate
Sex
Name
Address
Dateregistered
Partyaffiliation
Date lastvoted
Voter List
Linking to re-identify data
Ethnicity
Visit date
Diagnosis
Procedure
Medication
Total charge
ZIP
Birthdate
Sex
Name
Address
Dateregistered
Partyaffiliation
Date lastvoted
Medical Data Voter List
Uniqueness in Cambridge Voters
Birth date alone 12%Birth date & gender 29%Birth date & 5-digit ZIP 69%Birth date & full postal code 97%
Birth date includes month, day and year.Total 54,805 voters.
JLME 97
Few characteristics make a person unique
Birth includes month, day and year:
365 days x 100 years = 36,500 possibilities
Two genders and Five ZIP (5-digit) codes:
2 * 5 * 36,500 =365,000 possibilities
But the Cambridge Voter list had:
54,805 voters
So in general, using(birth[mon,day,yr], gender, ZIP[5-digit])provides aunique quasi-identifier.
JLME 97
{ date of birth, gender, 5-digit ZIP}uniquely identifies 87.1% of USA pop.
ZIP 60623,112,167 people,11%, not 0%insufficient #above the age of55 living there.
{ date of birth, gender, 5-digit ZIP}uniquely identifies 87.1% of USA pop.
ZIP 11794, 5418people, primarilybetween 19 and24 (4666 of 5418or 86%), only13%.
Disclosure ScenarioPrivate (Single Hospital ) Public (Various Users )
DNADatabase
DGZ
ANONYMIZE
HospitalDischargeDatabase
ACTG
CombinedIdentified
Information
Genotype-Phenotype Relations
Can infer genotype-phenotype relationshipsout of both DNA and medical databases
MedicalDatabase
PhenotypeWith
Genetic Trait
GenomicDNA
DiseasePhenotype
DiseaseSequences
DNADatabase
Example: Huntington’s Disease
MedicalDatabase
ICD9 code3334
HD GeneMutation
ICD9 code3334
HD GeneMutation
DNADatabase
Uniqueness in Trails
P1: 00000000000000000000000000000000000000000000000000010000000
P2: 00000000000000000000100000000000000000000000000000000000000
P3: 00000000000000000000100000000001000000000000000000000000000
PA: 00000000000000000000000000000000000000000000000000010000000
PB: 00000000000000000000000000000001000000000000000000000000000
PC: 00000000000000000000100000000000000000000000000000000000000
Uniqueness of audit trails with large numbers ofpeople and locations.
logs
names
Uniqueness in Trails(Web logs)
P1: 00000000000000000000000000000000000000000000000000010000000
P2: 00000000000000000000100000000000000000000000000000000000000
P3: 00000000000000000000100000000001000000000000000000000000000
PA: 00000000000000000000000000000000000000000000000000010000000
PB: 00000000000000000000000000000001000000000000000000000000000
PC: 00000000000000000000100000000000000000000000000000000000000
logs
names
Bradley Malin will talk about re-identifying people from the trails ofdata the leave behind.
Computer Security & Data Sharing
Authentication: login with passwordAuthorization: allowed to read/write dataEncryption: to avoid eavesdropping
BUT data can re-identify individual!
This talk
�Data investigations
� Data protection
�Formal protection models
�Effort-based models (evolving)
� Surveillance
Idea ofk-map andk-anonymity
Sweeney 97 and 98
For every record released, there will be at leastkindividuals to whom the record indistinctly refers.
In k-map, thek individuals exist in the world.
In k-anonymity, thek individuals appear in therelease.
Re-identification Example
�l�a��
There are 3green figuresand 2 figureshaving the sameprofile as therelease.
But only Halisgreen and hasthe same figuretype as theprofile in therelease. It is aunique match.
Gil Hal Jim
Ken Len MelPopulation
Re-identification Example
�l�a��
There are twomatches for thisprofile, JimandMel. There is nounique match.
Gil Hal Jim
Ken Len MelPopulation
Re-identification Example
�l�a��
To achievek-mapwherek=2, agentsfor Gil, Hal andKenagree tomerge theirinformationtogether.
Informationreleased aboutany of themresults in thesame mergedimage.
Gil Hal Jim
Ken Len MelPopulation
+ =
privacy.cs.cmu.edu/datafly/
Ryan Williams will show thatk-anonymity using generalization andsuppression is NP hard in the generalcase.
This talk
�Data investigations
� Data protection
�Formal protection models
�Effort-based models (evolving)
� Surveillance
De-identification of Faces
Example.
Captured images are below.Here is a known image ofBob. Which person is Bob?
De-identification: T-maskExample continued...
Captured images are de-identified below. Here is aknown image of Bob.Which person is Bob?
(a) (e) (f) (g)
(h)
(d)(c)(b)
(l) (m) (n)(k)(j)(i)
Ralph Gross (for Elaine Newton)will show how faces can be de-identified tothwart any face recognition system yetpreserving many details in the face.
Video Surveillance Cameras inLower Manhattan
Yan Ke will talk about de-identifying imagesof people in networks of cameras.
Detect Early using Onset,Coordinate Deaths & Hospital Admits
Based on results reported in Guillemin, 1999.
1979 Sverdlovsk Anthrax Outbreak
0
10
20
30
40
50
60
70
0 10 20 30 40 50
Time (in Days)
Cum
ulat
ive
(cas
es)
OnsetHospital AdmitsDeaths
How can we detect onset?How early on each can we predict?How does coordination help?
1979 Sverdlovsk Anthrax Outbreak
0
10
20
30
40
50
60
70
0 10 20 30 40 50
Time (in Days)
Pre
vale
nce
(cas
es)
OnsetHospital AdmitsDeaths
Cum
ulat
ive
Cas
es
Ted Senator will talk about the big picture of earlydetection bio-terrorism surveillance systems.
Continuously Observe Behaviorsto Detect Onset of Symptoms
Prodromic surveillance:
How many are acting ill?
Unusualbehaviors→syndromes?
Not confirmeddiagnoses!
Andrew Moore will talk describe anomalydetection algorithms used in real-worldbio-terrorism surveillance systems.
Centralized Surveillance of Secondary Data
hospitals
schools
labs
groceries
physicians
animals
prescriptions
assisted living
deaths
businesses
detect
Centralized Surveillance of Secondary Data
hospitals
schools
labs
groceries
physicians
animals
prescriptions
assisted living
deaths
businesses
detect
Doug Dyer will talk about the surveillance fordetecting terrorists.
Access Instruments
hospitals
schools
labs
groceries
physicians
animals
prescriptions
assisted living
deaths
businesses
HIPAA
educationlaws
contract contract contract
contract
contract
HIPAA HIPAA contract
*Not includingpublic health law
Policy Matters…
� FOIA versus Privacy
� Law enforcement
� Intellectual property
� Medical privacy legislation
� Internet privacy
� Bio-terrorism surveillance
Mike Shamos will describehow these laws, regulationsand policies frame themathematics behindsolutions.
Automated Privacy Module
hospitals
schools
labs
groceries
physicians
animals
prescriptions
assisted living
deaths
businesses
detect
Mechanical distortion decisionstypically renders data useless
Gross overview
Sufficiently de-identified
Identifiable
Explicitly identified
Readily identifiable
Sufficiently anonymous
Unusual activity
Suspicious activity
Outbreak detected
Outbreak suspected
Normal operation
Datafly Idenifiability 0..1 Detection Status 0..1
Dynamically Augment the Model WhenSurveillance Detects Possible Attack� Lower the privacy threshold when potential attack detected
– Take advantage of disease-specific processing– Need to flush out early suspicions by looking at more detailed
data
“How manyx occurred yesterday?”
detect
store1store2
store3
store4
store5store6store7
store8
store9
store10 5 12
1
8
0117
14
3
0
“How manyx occurred yesterday?”
detect
store1store2
store3
store4
store5store6store7
store8
store9
store10 5 12
1
8
0117
14
3
0
But this kind ofreporting can poseconfidentiality concernsfor the store and somany stores may refuseto participate.
“How manyx occurred yesterday?”
detect
store1store2
store3
store4
store5store6store7
store8
store9
store10 5 12
1
8
0117
14
3
0
Joe Kilian willprovide a survey ofmethods in multi-party computation.
Total count: “How manyx occurred?”
detect
store1
store2
store3
store4
store5store6store7
store8
store9
store10
r2= r 1+1
r3= r 2+8
r4= r 3+0r5= r 4+11r6= r 5+7
r7= r 6+14
r8= r 7+3
r9= r 8+0
r9+5
r1= r 0+12
r0
Samuel Edoho-Eketwill provide somesolutions to thesebasic questions.
Other presentations
Privacy-preserving data mining:Rafail OstrovskyBenny PinkasJohannes Gehrke
Query restriction problem:Susmit Sarkar
Statistical approaches:Steve FienbergRebecca Wright
The Question in this Talk
Can computer scientistsprovide both safety andprivacy to society?
Answer:YES. Three goals: (1) understand the nature of real
privacy threats; (2) design technical solutions to integratewith policy to avoid a setting in which society is forced tochoose; and, (3) construct technical solutions that addressthese threats while keeping data useful.
This talk
�Data investigations
�Data protection
�Surveillance
Latanya Sweeney, PhD.Assistant Professor of Computer Science,
Technology and [email protected]
http://privacy.cs.cmu.edu/