Upload
benjamin-robbins
View
217
Download
2
Tags:
Embed Size (px)
Citation preview
Task 1: Privacy Preserving Genomic
Data SharingPresented by
Noman MohammedSchool of Computer Science
McGill University
24 March 2014
2
Reference
N. Mohammed, R. Chen, B. C. M. Fung, and P. S. Yu. Differentially private data release for data mining. In Proceedings of the17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 493-501, 2011.
2
3
Outline
Privacy Models Algorithm for Relational Data Algorithm for Genomic Data Conclusion
3
4
Overview4
Privacy model
Anonymization
algorithm
Data utility
5
k-Anonymity [Samarati & Sweeney, PODS 1998]
Quasi-identifier (QID): The set of re-identification attributes.
k-anonymity: Each record cannot be distinguished from at least k-1 other records in the table wrt QID.
3-anonymous patient table
Job Sex Age Disease
Professional Male [36-40]
Cancer
Professional Male [36-40]
Cancer
Professional Male [36-40]
Cancer
Artist Female
[30-35]
Flu
Artist Female
[30-35]
Hepatitis
Artist Female
[30-35]
Fever
Artist Female
[30-35]
Hepatitis
Raw patient table
Job Sex Age Disease
Engineer Male 36 Cancer
Engineer Male 38 Cancer
Lawyer Male 38 Cancer
Musician Female 30 Flu
Musician Female 30 Hepatitis
Dancer Female 30 Fever
Dancer Female 30 Hepatitis
5
6
Differential Privacy [DMNS, TCC 06]
6
A
7
Differential Privacy 7
A non-interactive privacy mechanism A gives ε-differential privacy if for all neighbour D and D’, and for any possible sanitized database D*
PrA[A(D) = D*] ≤ exp(ε) × PrA[A(D’) = D*]
D D’
D and D’ are neighbors if they differ on at most one record
8
Laplace Mechanism8
For example, for a single counting query Q over a
dataset D, returning Q(D) + Laplace(1/ε) maintains
ε-differential privacy.
∆f = maxD,D’||f(D) – f(D’)||1
For a counting query f: ∆f =1
9
Outline
Privacy Models Algorithm for Relational Data Algorithm for Genomic Data Conclusion
9
10
Non-interactive Framework
0 + Lap(1/ε)
10
11
For high-dimensional
data, noise is too big
0 + Lap(1/ε)
11
Non-interactive Framework
12
12
Non-interactive Framework
13
Job Age Class Count
Any_Job [18-65) 4Y4N 8
Artist [18-65) 2Y2N 4Professional [18-65) 2Y2N 4
Age[18-65)
[18-40) [40-65)
Artist [18-40) 2Y2N 4 Artist [40-65) 0Y0N 0
Anonymization Algorithm
[18-30) [30-40)
13
Professional [18-40) 2Y1N 3 Professional [40-65) 0Y1N 1
JobAny_Job
Professional Artist
Engineer Lawyer Dancer Writer
14
Candidate Selection
we favor the specialization with maximum Score value
First utility function:
∆u =
Second utility function:
∆u = 1
14
14
15
Anonymization Algorithm
O(Aprx|D|log|D|)
O(|candidates|)
O(|D|)
O(|D|log|D|)
O(1)
15
16
Anonymization Algorithm
O(Aprx|D|log|D|)
O(|candidates|)
O(|D|)
O(|D|log|D|)
O(1)
O((Apr+h)x|D|log|D|)
16
17
Outline
Privacy Models Algorithm for Relational Data Algorithm for Genomic Data Conclusion
17
18
case_chr2_29504091_30044866
rs11686243AG AG AA AG GG AA AG AA GG AG AA AA AA AA AA AA AA GG GG AG AG AA AG GG AA AA GG AG AG AG GG AG AA AA AG AG AG AG AG AG AA AG GG AG AA GG GG GG GG AG AG AG AG AA GG GG GG AG AA AG GG AG AA GG GG AG AG AG AG AG AA AA AG AG AG AA AG AG AG AG GG AG AG AG GG GG AG AG GG AG AG AG AA AA GG AG AA GG AA AA AG GG AG AG AG AG AG AG AG AG GG GG AA AG AG AG AG AA AG GG AG GG AA AG GG AG AG AG AA AG AG AG GG AG GG AG GG AG AG AG GG AG AG GG GG AG AG GG AA GG AA AG AG AG AG GG AG AA AG GG GG AG AG AG AG AG GG AG AG AA AG AA AA AG GG AA AG AG GG AG GG AG AG GG GG AG AG AA AG AG AG GG AG GG GG AG AG GG AG GG
rs4426491CC CC CC CT CT CC CT CC CT CT CC CC CC CC CC CC CC CT CT CT CC CC CT CT CC CC CT CC CT CC CT CC CC CC CT ….
rs4305230CC CC CC CT CT CC CT CC TT CT CC CC CC CC CC CC CC TT TT CT CC CC CT CT CC CC CT CC CT CC TT CC CC CC CT ….
18
19
Case
rs11686243
rs4426491
rs4305230
rs4630725
… …
1 AG CC
2 AG CC
3 AA CC
4 AG CT
5 GG CT
… … …
… … …
… … …
… … …
198 GG TT
199 AG CT
200 GG CT
Raw Data19
20
Blocks/Attributes
Case
rs11686243
rs4426491
1 AG CC
2 AG CC
3 AA CC
4 AG CT
5 GG CT
20
Unique Combinations:
AG CC AA CC AG CT GG CT
Any
AG CC AA CC AG CT GG CT
21
Taxonomy Trees for Attributes SNP data was split evenly into N/6
blocks(attributes), where N is number of SNPs
21
22
Hierarchy Tree for Chr222
23
Hierarchy Tree for Chr1023
24
Block 1 Block 2 Block 3 Count
Any Any Any 200
AA CC Any Any 130AG CC Any Any 70
Block 3Any
CC GG CT AG
AA CC Any CC GG 60 AA CC Any CT AG 70
Genomic Data24
AG CC Any CC GG 30 AG CC Any CT AG 40
Block 1Any
AG CC AA CC
25
Anonymized Data25
Case
rs11686243
rs4426491
rs4305230
rs4630725
… …
1 AG CC Any Any
2 AG CC Any Any
3 … …
4 … …
5 AA CC Any Any
… AA CC Any Any
… … …
… … …
… … …
…
…
…
26
Heterogeneous Healthcare Data
ID Job Age rs4305230
rs4630725
… …
1 Engineer 50 AG CC … …
2 Doctor 45 AA CT … …
3 … … … … … …
Relational Data
Genomic Data
26
27
Privacy-Preserving Genomic Data Release Tree-based approach is promising
Future work Partitioning the SNPs to generate blocks Utility function for specialization Two-level tree Vs. multi-level hierarchy trees Single-dimension Vs. multi-dimensional
partitioning
Conclusions27
28
Privacy-Preserving Genomic Data Release Tree-based approach is promising
Future work Partitioning the SNPs to generate blocks Utility function for specialization Two-level tree Vs. multi-level hierarchy trees Single-dimension Vs. multi-dimensional
partitioning
Thank You !28