Task 1: Privacy Preserving Genomic Data Sharing Presented by Noman Mohammed School of Computer Science McGill University 24 March 2014

Task 1: Privacy Preserving Genomic

Data SharingPresented by

Noman MohammedSchool of Computer Science

McGill University

24 March 2014

2

Reference

N. Mohammed, R. Chen, B. C. M. Fung, and P. S. Yu. Differentially private data release for data mining. In Proceedings of the17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 493-501, 2011.

2

3

Outline

Privacy Models Algorithm for Relational Data Algorithm for Genomic Data Conclusion

3

4

Overview4

Privacy model

Anonymization

algorithm

Data utility

5

k-Anonymity [Samarati & Sweeney, PODS 1998]

Quasi-identifier (QID): The set of re-identification attributes.

k-anonymity: Each record cannot be distinguished from at least k-1 other records in the table wrt QID.

3-anonymous patient table

Job Sex Age Disease

Professional Male [36-40]

Cancer


Cancer


Cancer

Artist Female

[30-35]

Flu

Artist Female

[30-35]

Hepatitis

Artist Female

[30-35]

Fever

Artist Female

[30-35]

Hepatitis

Raw patient table

Job Sex Age Disease

Engineer Male 36 Cancer

Engineer Male 38 Cancer

Lawyer Male 38 Cancer

Musician Female 30 Flu

Musician Female 30 Hepatitis

Dancer Female 30 Fever

Dancer Female 30 Hepatitis

5

6

Differential Privacy [DMNS, TCC 06]

6

A

7

Differential Privacy 7

A non-interactive privacy mechanism A gives ε-differential privacy if for all neighbour D and D’, and for any possible sanitized database D*

PrA[A(D) = D*] ≤ exp(ε) × PrA[A(D’) = D*]

D D’

D and D’ are neighbors if they differ on at most one record

8

Laplace Mechanism8

For example, for a single counting query Q over a

dataset D, returning Q(D) + Laplace(1/ε) maintains

ε-differential privacy.

∆f = maxD,D’||f(D) – f(D’)||1

For a counting query f: ∆f =1

9

Outline


9

10

Non-interactive Framework

0 + Lap(1/ε)

10

11

For high-dimensional

data, noise is too big

0 + Lap(1/ε)

11


12

12


13

Job Age Class Count

Any_Job [18-65) 4Y4N 8

Artist [18-65) 2Y2N 4Professional [18-65) 2Y2N 4

Age[18-65)

[18-40) [40-65)

Artist [18-40) 2Y2N 4 Artist [40-65) 0Y0N 0

Anonymization Algorithm

[18-30) [30-40)

13

Professional [18-40) 2Y1N 3 Professional [40-65) 0Y1N 1

JobAny_Job

Professional Artist

Engineer Lawyer Dancer Writer

14

Candidate Selection

we favor the specialization with maximum Score value

First utility function:

∆u =

Second utility function:

∆u = 1

14

14

15


O(Aprx|D|log|D|)

O(|candidates|)

O(|D|)

O(|D|log|D|)

O(1)

15

16


O(Aprx|D|log|D|)

O(|candidates|)

O(|D|)

O(|D|log|D|)

O(1)

O((Apr+h)x|D|log|D|)

16

17

Outline


17

18

case_chr2_29504091_30044866

rs11686243AG AG AA AG GG AA AG AA GG AG AA AA AA AA AA AA AA GG GG AG AG AA AG GG AA AA GG AG AG AG GG AG AA AA AG AG AG AG AG AG AA AG GG AG AA GG GG GG GG AG AG AG AG AA GG GG GG AG AA AG GG AG AA GG GG AG AG AG AG AG AA AA AG AG AG AA AG AG AG AG GG AG AG AG GG GG AG AG GG AG AG AG AA AA GG AG AA GG AA AA AG GG AG AG AG AG AG AG AG AG GG GG AA AG AG AG AG AA AG GG AG GG AA AG GG AG AG AG AA AG AG AG GG AG GG AG GG AG AG AG GG AG AG GG GG AG AG GG AA GG AA AG AG AG AG GG AG AA AG GG GG AG AG AG AG AG GG AG AG AA AG AA AA AG GG AA AG AG GG AG GG AG AG GG GG AG AG AA AG AG AG GG AG GG GG AG AG GG AG GG

rs4426491CC CC CC CT CT CC CT CC CT CT CC CC CC CC CC CC CC CT CT CT CC CC CT CT CC CC CT CC CT CC CT CC CC CC CT ….

rs4305230CC CC CC CT CT CC CT CC TT CT CC CC CC CC CC CC CC TT TT CT CC CC CT CT CC CC CT CC CT CC TT CC CC CC CT ….

18

19

Case

rs11686243

rs4426491

rs4305230

rs4630725

… …

1 AG CC

2 AG CC

3 AA CC

4 AG CT

5 GG CT

… … …

… … …

… … …

… … …

198 GG TT

199 AG CT

200 GG CT

Raw Data19

20

Blocks/Attributes

Case

rs11686243

rs4426491

1 AG CC

2 AG CC

3 AA CC

4 AG CT

5 GG CT

20

Unique Combinations:

AG CC AA CC AG CT GG CT

Any

AG CC AA CC AG CT GG CT

21

Taxonomy Trees for Attributes SNP data was split evenly into N/6

blocks(attributes), where N is number of SNPs

21

22

Hierarchy Tree for Chr222

23

Hierarchy Tree for Chr1023

24

Block 1 Block 2 Block 3 Count

Any Any Any 200

AA CC Any Any 130AG CC Any Any 70

Block 3Any

CC GG CT AG

AA CC Any CC GG 60 AA CC Any CT AG 70

Genomic Data24

AG CC Any CC GG 30 AG CC Any CT AG 40

Block 1Any

AG CC AA CC

25

Anonymized Data25

Case

rs11686243

rs4426491

rs4305230

rs4630725

… …

1 AG CC Any Any

2 AG CC Any Any

3 … …

4 … …

5 AA CC Any Any

… AA CC Any Any

… … …

… … …

… … …

…

…

…

26

Heterogeneous Healthcare Data

ID Job Age rs4305230

rs4630725

… …

1 Engineer 50 AG CC … …

2 Doctor 45 AA CT … …

3 … … … … … …

Relational Data

Genomic Data

26

27

Privacy-Preserving Genomic Data Release Tree-based approach is promising

Future work Partitioning the SNPs to generate blocks Utility function for specialization Two-level tree Vs. multi-level hierarchy trees Single-dimension Vs. multi-dimensional

partitioning

Conclusions27

28

Privacy-Preserving Genomic Data Release Tree-based approach is promising

Future work Partitioning the SNPs to generate blocks Utility function for specialization Two-level tree Vs. multi-level hierarchy trees Single-dimension Vs. multi-dimensional

partitioning

Thank You !28

Documents

Task 1: Privacy Preserving Genomic Data Sharing Presented by Noman Mohammed School of Computer Science McGill University 24 March 2014