61
Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Embed Size (px)

Citation preview

Page 1: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Privacy Preserving Data Publication

Yufei Tao

Department of Computer Science and Engineering

Chinese University of Hong Kong

Page 2: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Centralized publication

Assume that a hospital wants to publish the following table, called the microdata.

The publication must preserve the privacy of patients. Prevent an adversary from knowing who-contracted-

what.Microdata

Page 3: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Centralized publication (cont.)

A simple solution: Remove column ‘Name’. It does not work. See next.

publish

Page 4: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Linking attacks

The published table A voter registration list

Quasi-identifier (QI) attributes

An adversary

Page 5: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

These are real threats

Fact: 87% of Americans can be uniquely identified by {Zipcode, gender, date-of-birth}.

A famous experiment by Sweeney [International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002]

finds the medical record of an ex-governor of Massachusetts.

Page 6: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Objectives

Publish a distorted version of the dataset so that [Privacy] the privacy of all individuals is “adequately”

protected; [Utility] the dataset is useful for analyzing the

characteristics of the microdata.

Paradox: Privacy protection , utility .

Page 7: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Issues

Privacy principleWhat is adequate privacy protection?

Distortion approachHow to achieve the privacy principle?

The literature has discussed other issues as well.Complexities, improving the utility of the published

data, etc.

Page 8: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Principle 1: k-anonymity

2-anonymous generalization:QI attributes

Sensitive attribute

4 Q

I gr

oups

A voter registration list

[Sweeney, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002]

Page 9: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Defects of k-anonymity

What is the disease of Joe?

No “diversity” in this QI group.A voter registration list

Page 10: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Principle 2: l-diversity

Each QI group should have at least l “well-represented” sensitive values.

Different ways to interpret “well-represented”.

[Machanavajjhala et al., ICDE, 2006]

Page 11: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Naive interpretation

Each QI-group has l different sensitive values.

A 2-diverse table

Age Sex Zipcode Disease[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia

[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis

[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu

Page 12: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Defects of the naive interpretation

Assume that Joe is identified in the QI group. What is the probability that he contracted HIV?

Implication: The most frequent sensitive value in a QI group cannot be too frequent.

But accomplishing only is still vulnerable against attacks with background knowledge.

Disease

...

HIV

HIV

HIV

pneumonia

...

...

bronchitis

...

A QI group with 100 tuples 98 tuples

Page 13: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Background knowledge attack

Let Joe be an individual in the QI group having HIV. A friend of Joe has the background knowledge: “Joe does not have

pneumonia”. How likely would this friend assume that Joe had HIV?

A QI group with 100 tuples

50 tuples

Disease

...

HIV

HIVpneumonia

...

...

bronchitis

...

pneumonia

...

49 tuples

Page 14: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Controlling also the 2nd most frequent value

Even if an adversary can eliminate pneumonia, s/he can only assume that Joe has HIV with 40 / 70 probability.

A QI group with 100 tuples

40 tuples

Disease

...

HIV

HIVpneumonia

...

...

bronchitis

...

pneumonia

...

bronchitis

...

30 tuples

30 tuples

Page 15: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

An example of 4-diversity

A QI group

Disease

...

...

...

The most frequent value

The 2nd most frequent value

The 3rd most frequent valueThe 4th most frequent value

The other values

Page 16: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

An example of 4-diversity (cont.)

A QI group

Disease

...

...

...

The most frequent value

The other values

Same cardinality

Page 17: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Assume that Joe is a person in the QI group. Property: If an adversary can eliminate only 3 diseases,

s/he can correctly guess the disease of Joe with at most 50% probability.

An example of 4-diversity (cont.)

A QI group

HIV

pneumonia

bronchitiscancer

The other values

Disease

...

...

...

Page 18: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

l-diversity

Consider a QI group. m is the number of sensitive values in the group. r1 is the number of tuples having the most sensitive value.

r2 is the number of tuples having the 2nd most sensitive value.

… rm is the number of tuples having the m-th most sensitive value.

Then, r1 c (rl + … + rm), where c is a constant.

If an adversary can eliminate only l – 1 sensitive values, s/he can infer the disease of a person with probability at most 1 / (c + 1).

Called (c, l)-diversity precisely.

Page 19: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Defects of l-diversity

Andy does not want anyone to know that he had a stomach problem. Sarah does not mind at all if others find out that she had flu.

Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000Sarah 28 F 37000Mary 56 F 58000

A 2-diverse table A voter registration listAge Sex Zipcode Disease[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia

[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis

[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu

Page 20: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Does not work if an individual can have multiple tuples in the microdata.

Defects of l-diversity (cont.)

Microdata

Name Age Sex Zipcode DiseaseAndy 4 M 12000 gastric ulcerAndy 4 M 12000 dyspepsiaKen 6 M 18000 pneumoniaNash 9 M 19000 bronchitisAlice 12 F 22000 fluBetty 19 F 24000 pneumoniaLinda 21 F 33000 gastritisJane 25 F 34000 gastritis

Sarah 28 F 37000 fluMary 56 F 58000 flu

Page 21: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Defects of l-diversity (cont.)

Name Age Sex ZipcodeAndy 4 M 12000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000

Sarah 28 F 37000Mary 56 F 58000

A 2-diverse table A voter registration listAge Sex Zipcode Disease

4 M 12000 gastric ulcer4 M 12000 dyspepsia

[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis

[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu

Page 22: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Principle 3: Personalized anonymity

Key ideas: Guarding node + sensitive attribute (SA) generalization Assume a publicly-known hierarchy on the sensitive attribute.

any illness

stomach diseaserespiratory infection

flu pneumonia gastricbronchitis dyspepsia

respiratory system problem digestive system problem

gastritisulcer

[Xiao and Tao, SIGMOD, 2006]

Page 23: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Guarding nodeany illness

stomach diseaserespiratory infection

flu pneumonia gastricbronchitis dyspepsia

respiratory system problem digestive system problem

gastritisulcer

Andy does not want anyone to know that he had a stomach problem. He can specify “stomach disease” as the guarding node for his tuple.

Protect Andy from being conjectured to have any disease in the subtree of the guarding node.

Name Age Sex Zipcode Disease guarding node

Andy 4 M 12000 gastric ulcer stomach disease

Page 24: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Guarding node (cont.)any illness

stomach diseaserespiratory infection

flu pneumonia gastricbronchitis dyspepsia

respiratory system problem digestive system problem

gastritisulcer

Sarah is willing to disclose her exact symptom. She can specify Ø as the guarding node for her tuple.

Name Age Sex Zipcode Disease guarding node

Sarah 28 F 37000 flu Ø

Page 25: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Guarding node (cont.)any illness

stomach diseaserespiratory infection

flu pneumonia gastricbronchitis dyspepsia

respiratory system problem digestive system problem

gastritisulcer

Bill does not have any special preference. He sets the guarding node of his tuple to be the same as his sensitive value.

Name Age Sex Zipcode Disease guarding node

Bill 5 M 14000 dyspepsia dyspepsia

Page 26: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

A personalized approachany illness

stomach diseaserespiratory infection

flu pneumonia gastricbronchitis dyspepsia

respiratory system problem digestive system problem

gastritisulcer

Name Age Sex Zipcode Disease guarding nodeAndy 4 M 12000 gastric ulcer stomach diseaseBill 5 M 14000 dyspepsia dyspepsiaKen 6 M 18000 pneumonia respiratory infectionNash 9 M 19000 bronchitis bronchitisAlice 12 F 22000 flu fluBetty 19 F 24000 pneumonia pneumoniaLinda 21 F 33000 gastritis gastritisJane 25 F 34000 gastritis ØSarah 28 F 37000 flu ØMary 56 F 58000 flu flu

Page 27: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Personalized anonymity

No adversary should be able to breach the privacy requirement of any guarding node with a probability above pbreach..

If pbreach = 0.3, then no adversary can have more than 30% probability to find out that: Andy had a stomach disease Bill had dyspepsia …

Name Age Sex Zipcode Disease guarding nodeAndy 4 M 12000 gastric ulcer stomach diseaseBill 5 M 14000 dyspepsia dyspepsiaKen 6 M 18000 pneumonia respiratory infectionNash 9 M 19000 bronchitis bronchitisAlice 12 F 22000 flu fluBetty 19 F 24000 pneumonia pneumoniaLinda 21 F 33000 gastritis gastritisJane 25 F 34000 gastritis ØSarah 28 F 37000 flu ØMary 56 F 58000 flu flu

Page 28: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Why SA generalization?

How many female patients are there with age above 30? 4 ∙ (60 – 30 + 1) / (60 – 21 + 1) = 3 Real answer: 1

Pure QI generalization

Age Sex Zipcode Disease[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia

[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis

[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu

Name Age Sex Zipcode DiseaseAndy 4 M 12000 gastric ulcerBill 5 M 14000 dyspepsiaKen 6 M 18000 pneumoniaNash 9 M 19000 bronchitisAlice 12 F 22000 fluBetty 19 F 24000 pneumoniaLinda 21 F 33000 gastritisJane 25 F 34000 gastritis

Sarah 28 F 37000 fluMary 56 F 58000 flu

Microdata

Page 29: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

SA generalization (cont.)

With SA generalizationAge Sex Zipcode Disease

[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia

[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis

[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 30] F [30001, 40000] gastritis[21, 30] F [30001, 40000] gastritis[21, 30] F [30001, 40000] flu

56 F 58000respiratory infection

Pure QI generalization

Age Sex Zipcode Disease[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia

[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis

[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu

any illness

stomach diseaserespiratory infection

flu pneumonia gastricbronchitis dyspepsia

respiratory system problem digestive system problem

gastritisulcer

Page 30: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Evaluation of disclosure risk

What is the probability that the adversary can find out that “Andy had a stomach disease”?

Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000

Sarah 28 F 37000Mary 56 F 58000

Age Sex Zipcode Disease[1, 10] M [10001, 20000] gastric ulcer[1, 10] M [10001, 20000] dyspepsia[1, 10] M [10001, 20000] pneumonia[1, 10] M [10001, 20000] bronchitis[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia

21 F 33000 stomach disease25 F 34000 gastritis28 F 37000 flu56 F 58000 respiratory infection

A voter registration listThe published data

Page 31: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Combinatorial reconstruction (cont.)

Can each individual appear more than once? No = the primary case Yes = the non-primary case

Some possible reconstructions:

Andy

Bill

Ken

Nash

Mike

gastric ulcer

dyspepsia

pneumonia

bronchitis

The primary case

Andy

Bill

Ken

Nash

Mike

gastric ulcer

dyspepsia

pneumonia

bronchitis

The non-primary case

Page 32: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Combinatorial reconstruction (cont.)

Can each individual appear more than once? No = the primary case Yes = the non-primary case

Some possible reconstructions:

Andy

Bill

Ken

Nash

Mike

gastric ulcer

dyspepsia

pneumonia

bronchitis

The primary case

Andy

Bill

Ken

Nash

Mike

gastric ulcer

dyspepsia

pneumonia

bronchitis

The non-primary case

Page 33: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Breach probability (primary)

Totally 120 possible reconstructions

If Andy is associated with a stomach disease in nb reconstructions The probability that the adversary should associate Andy with some stomach problem

is nb / 120

Andy is associated with gastric ulcer in 24 reconstructions dyspepsia in 24 reconstructions gastritis in 0 reconstructions

nb = 48

The breach probability for Andy’s tuple is 48 / 120 = 2 / 5.

Andy

Bill

Ken

Nash

Mike

gastric ulcer

dyspepsia

pneumonia

bronchitis

any illness

stomach diseaserespiratory infection

flu pneumonia gastricbronchitis dyspepsia

respiratory system problem digestive system problem

gastritisulcer

Page 34: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Breach probability (non-primary)

Totally 625 possible reconstructions

Andy is associated with gastric ulcer or dyspepsia or gastritis in 225 reconstructions.

nb = 225 The breach probability for Andy’s tuple is

225 / 625 = 9 / 25

any illness

stomach diseaserespiratory infection

flu pneumonia gastricbronchitis dyspepsia

respiratory system problem digestive system problem

gastritisulcer

Andy

Bill

Ken

Nash

Mike

gastric ulcer

dyspepsia

pneumonia

bronchitis

Page 35: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

A defect of personalized anonymity

Does not guard against background knowledge.Recall that l-diversity can achieve this purpose.

But it seems possible to adapt the personalized approach to tackle background knowledge.Future work?

Page 36: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Other privacy principles

k-gather. Due to [Aggarwal et al., PODS, 2006]

Suffers from the problems of k-anonymity.

(a, k)-anonymity Due to [Wong et al., KDD, 2006]

t-closeness. Recently proposed by [Li and Li, ICDE, 2007]

Page 37: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Issues

Privacy principleWhat is adequate privacy protection?

Distortion approachHow to achieve the privacy principle?

Page 38: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Three approaches Suppression

We do not discuss it because the utility of the resulting table is low; it can be regarded as a special case of generalization.

Generalization Due to [Sweeney, International Journal on Uncertainty, Fuzziness and

Knowledge-based Systems, 2002]

Anatomy (also called “bucketization”) Due to [Xiao and Tao, VLDB, 2006]

Each of the above approaches can be integrated with all the privacy principles discussed earlier.

Page 39: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

A multidimensional view of generalization

20

10k

7060504030

60k

50k

40k

30k

20k

x (Age)y

(Zip

code

)

1 2

3

4

5

6 and 7

8

R1 R2

Page 40: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Taxonomy of generalization

Local recoding (Generalized) rectangles

may overhalp.Suppression is a special case

of local recoding.

Global recodingAll rectangles are disjoint.

[LeFevre et al. SIGMOD, 2005]

Page 41: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Taxonomy of generalization (cont.)

Global recoding can be further divided.

Single-dimension recoding Rectangles form a grid.

Multi-dimension recodingThe opposite of single-

dimension recoding.

Page 42: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Taxonomy of generalization (cont.)

Single-dimension recoding can be further divided. Full-domain recoding Full-subtree recoding

Both assume a hierarchy on each QI attribute. Example: A hierarchy on Age

[1, 10][11, 20][21, 30] [31, 40][41, 50][51, 60] [61, 70][71, 80][81, 90]

[1, 30] [31, 60] [61, 90]

[1, 90]

1, 2, 3, …, 10 ...

Page 43: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Taxonomy of generalization (cont.)

Full-domain recoding All age values must be generalized to the same level of the

hierachy.

[1, 10][11, 20][21, 30] [31, 40][41, 50][51, 60] [61, 70][71, 80][81, 90]

[1, 30] [31, 60] [61, 90]

[1, 90]

1, 2, 3, …, 10 ...

Page 44: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Taxonomy of generalization (cont.)

Full-subtree recoding The subtrees of all generalized values must be disjoint. Permissible generalization:

[1, 30], [31, 40], [41, 50], [51, 60], [61, 90]. Illegal generalization:

[1, 10], [1, 30], [31, 60], [61, 90].

[1, 10][11, 20][21, 30] [31, 40][41, 50][51, 60] [61, 70][71, 80][81, 90]

[1, 30] [31, 60] [61, 90]

[1, 90]

1, 2, 3, …, 10 ...

Page 45: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Why all these generalization types?

Reason 1:If a dataset is generalized in a more restricted manner, less preprocessing is required before it can be analyzed by a standard statistical tool (such as SAAS).

Page 46: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Why all these generalization types?

Reason 2: More restrictive generalization is usually faster to compute and easier to analyze.

[1, 10][11, 20][21, 30] [31, 40][41, 50][51, 60] [61, 70][71, 80][81, 90]

[1, 30] [31, 60] [61, 90]

[1, 90]

1, 2, 3, …, 10 ... level 0

level 1

level 2

level 3

Page 47: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Why all these generalization types?

Reason 3: Less restrictive generalization promises more accurate data analysis, provided that a sophisticated analytical method is used.

Page 48: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Generalization algorithms

Operate on a quality metric. Examples: The generalization level (for full-domain recoding) Total rectangle size (for local recoding) …

Mostly heuristics-based. Finding the optimal generalization is often

NP hard.

level 0

level 1

level 2

level 3

Page 49: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Defect of generalization Query A: SELECT COUNT(*) from Unknown-Microdata

WHERE Disease = ‘pneumonia’ AND Age in [0, 30]

AND Zipcode in [10001, 20000]

Age Sex Zipcode Disease

[21, 60] M [10001, 60000] pneumonia

[21, 60] M [10001, 60000] dyspepsia

[21, 60] M [10001, 60000] dyspepsia

[21, 60] M [10001, 60000] pneumonia

[61, 70] F [10001, 60000] flu

[61, 70] F [10001, 60000] gastritis

[61, 70] F [10001, 60000] flu

[61, 70] F [10001, 60000] bronchitis

Estimated answer: 2p, where p is the probability that each of the two tuples satisfies the query conditions on the Age and Zipcode.

Page 50: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Defect of generalization (cont.) Query A: SELECT COUNT(*) from Unknown-Microdata

WHERE Disease = ‘pneumonia’ AND Age in [0, 30]

AND Zipcode in [10001, 20000]

p = Area( R1 ∩ Q ) / Area( R1 ) = 0.05

Estimated answer for Query A: 2p = 0.1

Age Sex Zipcode Disease

[21, 60] M [10001, 60000] pneumonia

[21, 60] M [10001, 60000] pneumonia

Page 51: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Defect of generalization (cont.) Query A:SELECT COUNT(*) from Unknown-Microdata

WHERE Disease = ‘pneumonia’ AND Age in [0, 30]

AND Zipcode in [10001, 20000] Estimated answer = 0.1

Name Age Sex Zipcode DiseaseBob 23 M 11000 pneumoniaKen 27 M 13000 dyspepsiaPeter 35 M 59000 dyspepsiaSam 59 M 12000 pneumoniaJane 61 F 54000 flu

Linda 65 F 25000 gastritisAlice 65 F 25000 flu

Mandy 70 F 30000 bronchitis

The exact answer = 1

Page 52: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Defect of generalization (cont.) Cause of inaccuracy:

QI distribution inside each QI group is lost!

Age Sex Zipcode Disease

[21, 60] M [10001, 60000] pneumonia

[21, 60] M [10001, 60000] pneumonia

Page 53: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Anatomy

Releases a quasi-identifier table (QIT) and a sensitive table (ST).

Group-ID Disease Count

1 dyspepsia 2

1 pneumonia 2

2 bronchitis 12 flu 2

2 gastritis 1

Age Sex Zipcode Group-ID

23 M 11000 127 M 13000 1

35 M 59000 1

59 M 12000 161 F 54000 2

65 F 25000 2

65 F 25000 2

70 F 30000 2

Quasi-identifier table (QIT)

Sensitive table (ST)

Age Sex Zipcode Disease

23 M 11000 pneumonia

27 M 13000 dyspepsia

35 M 59000 dyspepsia

59 M 12000 pneumonia

61 F 54000 flu

65 F 25000 gastritis

65 F 25000 flu

70 F 30000 bronchitis

Microdata

Page 54: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Anatomy (cont.)1. Decide an l-diverse partition of the tuples.

Age Sex Zipcode Disease

23 M 11000 pneumonia

27 M 13000 dyspepsia

35 M 59000 dyspepsia

59 M 12000 pneumonia

61 F 54000 flu

65 F 25000 gastritis

65 F 25000 flu

70 F 30000 bronchitis

QI group 1

QI group 2

A 2-diverse partition

Page 55: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Anatomy (cont.)

2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition.

Disease

pneumonia

dyspepsia

dyspepsia

pneumonia

flu

gastritis

flu

bronchitis

Age Sex Zipcode

23 M 1100027 M 1300035 M 5900059 M 12000

61 F 5400065 F 2500065 F 2500070 F 30000

group 1

group 2

quasi-identifier table (QIT) sensitive table (ST)

Page 56: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Anatomy (cont.)

2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the decided partition.

Group-ID Disease

1 pneumonia1 dyspepsia1 dyspepsia1 pneumonia

2 flu2 gastritis2 flu2 bronchitis

Age Sex Zipcode Group-ID

23 M 11000 127 M 13000 135 M 59000 159 M 12000 1

61 F 54000 265 F 25000 265 F 25000 270 F 30000 2

quasi-identifier table (QIT) sensitive table (ST)

Page 57: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Privacy preservation

Given a pair of QIT and ST generated from an l-diverse partition, an adversary can infer the sensitive value of each individual with confidence at most 1 / l.

Group-ID Disease Count

1 dyspepsia 2

1 pneumonia 22 bronchitis 1

2 flu 2

2 gastritis 1

Age Sex Zipcode Group-ID

23 M 11000 1

27 M 13000 1

35 M 59000 1

59 M 12000 1

61 F 54000 2

65 F 25000 2

65 F 25000 2

70 F 30000 2

quasi-identifier table (QIT)

sensitive table (ST)

Name Age Sex Zipcode

Bob 23 M 11000

Page 58: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Accuracy of data analysis Query A: SELECT COUNT(*) from Unknown-Microdata

WHERE Disease = ‘pneumonia’ AND Age in [0, 30]

AND Zipcode in [10001, 20000]

Group-ID Disease Count

1 dyspepsia 2

1 pneumonia 22 bronchitis 1

2 flu 2

2 gastritis 1

Age Sex Zipcode Group-ID

23 M 11000 1

27 M 13000 1

35 M 59000 1

59 M 12000 1

61 F 54000 2

65 F 25000 2

65 F 25000 2

70 F 30000 2

Quasi-identifier table (QIT)

Sensitive table (ST)

Page 59: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Accuracy of data analysis

Query A:SELECT COUNT(*) from Unknown-Microdata

WHERE Disease = ‘pneumonia’ AND Age in [0, 30]

AND Zipcode in [10001, 20000]

2 patients contracted pneumonia 2 out of 4 patients satisfy the query conditions on Age and Zipcode Estimated answer = 2 * 2 / 4 = 1.

Age Sex Zipcode Group-ID

23 M 11000 1

27 M 13000 1

35 M 59000 1

59 M 12000 1

t1t2t3t4

Page 60: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

A defect of anatomy

Existence breach: Does an individual exist in the microdata?

Page 61: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Future work

Re-publication

Tackle stronger background knowledgeRecent work [Martin et al., ICDE, 2007]

Improving utilityPioneering work [Kifer and Gehrke, SIGMOD, 2006]

Application to specific (non-trivial) applicationsLocation privacy

Pioneering work [Mokbel et al., VLDB, 2006]