Upload
magdalene-todd
View
214
Download
2
Embed Size (px)
Citation preview
2015.9.28
Differential Privacy
A preliminary story- A hospital has a database of patient records, each record containing a binary value indicating whether or not the patient has some form of cancer.
- We want to know the total number of patients with cancers? Easy! A summation over these binary values
patient has cancer
Amy 0
Tom 1
Jack 1
- But how about if we know anyone must on the list? Or anyone must be the end of the list? Whether Jack has cancer? S(3)-S(2)
A preliminary story- If f is a random query function, for example:
f(i) = count(i) + noise f(5) : { 2, 2, 5, 3} f(4): {2, 2, 5, 3} with same probability
f(5) – f(4) is useless !
GIC Incidence [Sweeny 2002]
• Group Insurance Commissions (GIC, Massachusetts)– Collected patient data for ~135,000 state employees.– Gave to researchers and sold to industry.– Medical record of the former state governor is identified.
Patient 1 Patient 2 Patient n
GIC, MA
DB
……
…… Age Sex Zip code Disease
69 M 47906 Cancer
65 M 47907 Cancer
52 F 47902 Flu
43 F 46204 Gastritis
42 F 46208 Hepatitis
47 F 46203 Bronchitis
Name
Bob
Carl
Daisy
Emily
Flora
Gabriel
4Re-identification occurs!Topic 21: Data Privacy
DefinitionsLet be a randomized algorithm. Let be two datasets that differ in at most one entry (we call these database neighbors)
xi xi’
D1 D2
Database neighbors
Deifinition 1. Let . Define to be private if for all neighboring databases , and for all (measurable) subsets, we have
Where the probability is taken over the coin tosses of
Deifinition 1. Let . Define to be private if for all neighboring databases , and for all(measurable) subsets , we have
Where the probability is taken over the coin tosses of
Observation 2. Because we can switch interchangeably, Definition 1 implies that
Since for small , then we have roughly
satisfies
Laplace distribution