32
HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION Presented by: Michael Cheng Supervisor: Dr. William Cheung Co-Supervisor: Dr. Byron Choi

HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Embed Size (px)

DESCRIPTION

HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION. Presented by: Michael Cheng Supervisor: Dr. William Cheung Co-Supervisor: Dr. Byron Choi. Presentation Flow. Privacy-Preserving Data Publishing Introduction to Emerging Patterns (EPs) Introduction to Equivalence Class - PowerPoint PPT Presentation

Citation preview

Page 1: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATIONPresented by: Michael ChengSupervisor: Dr. William Cheung Co-Supervisor: Dr. Byron Choi

Page 2: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Presentation Flow

Privacy-Preserving Data Publishing Introduction to Emerging Patterns

(EPs) Introduction to Equivalence Class Introduction to Generalization Proposed Problem and Motivation Heuristic for the Problem Experimental Results Future research plan

Page 3: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Privacy Preserving Data Publishing- Introduction Organizations often need to publish

or share their data for legitimate reasons

Sensitive information (e.g. personal identities, restrictive patterns) maybe inferred from the published data

Page 4: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Privacy Preserving Data Publishing- Objective Transform the dataset before publishing,

such that:1. Sensitive information In our case: Emerging Patterns (EPs)2. Subsequence analysis In our case: Frequent Itemset (FIS)

Mining

Page 5: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Introduction to Emerging Patterns (EPs) Emerging Patterns (EPs) are itemsets

exist in pair of datasets whose supports are significant in one dataset but insignificant in another

Edu Occup Marital

BA Exec Married

BA Exec Married

BA Exec Married

BA Exec Married

MSE Worker Never

Edu Occup Marital

Married

Married

BA Exec Married

BA Manager Married

BA Repair Never

MSE Exec

MSE Exec

{MSE, Exec} is an Emerging Pattern

Income >= 50k Income < 50k

Page 6: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Introduction to Emerging Patterns (EPs) Formally, growth rate and EPs are

defined as follow:

Page 7: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Manager

Introduction to Equivalence Class Tuples are said to be in the same

Equivalence Class w.r.t. a set of Attribute A if they take same values of A

ID Edu Occup Marital

1 MSE

2 MSE

3 BA

4 BA Married

5 BA Repair Never

Exec Married

Exec Married

Exec Married

Tuples {1,2,3} are in the same Equivalence Class w.r.t. {Occup,

Marital}

Page 8: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Introduction to Generalization Extensively studied in achieving k-Anonymity

Not studied before for hiding itemsets

Modify the original values in dataset into more general values according to a user-given hierarchy such that more tuples will share the same set of attribute values

Example:In Adult, “BA” and “MSE” maybe generalized to “Degree Holder”

Page 9: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Types of Generalization

Single Dimensional Global Recoding Multi Dimensional Global Recoding Multi Dimensional Local Recoding

Page 10: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Single Dimensional Global Recoding If we decide to generalize some values

to a single value, all tuples which contains these values will be affected

Occup

Exec

Exec

Exec

Manager

Repair

Occup

Occupation

Occupation

Occupation

Occupation

Occupation

Single Dimensional

Global Recoding

Page 11: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Multi Dimensional Global Recoding If we decide to generalize some values

to a single value, all tuples in the same equivalence class which contains those values will be affected

Occup

Exec

Exec

Exec

Manager

Repair

Multi Dimensional

Global Recoding

Occup

Manager

Repair

Occupation

Occupation

Occupation

Page 12: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Multi Dimensional Local Recoding Same as the Multi Dimensional Global

Recoding except no Equivalence Class constraint

Occup

Exec

Exec

Exec

Manager

Repair

Multi Dimensional

Local Recoding

Occup

Manager

Repair

Exec

Occupation

Occupation

Page 13: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Proposed Problem- Why EP and FIS ? Emerging Pattern may reveal sensitive

information

E.g. In the Adult dataset from UCI Repository, we found that: {Never-Married, Own-Child} is an EP from the class

“Income < 50k” to the class “Income >=50k” Growth Rate: 35

Frequent Itemset is a popular data mining task and supported by commercial data-mining software

Page 14: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Proposed Problem-Why Generalization ? Other methods studied in PPDP

For example: Adding unknowns, remove tuples, adding fake tuples

randomly Either

Incomplete information Fake information

In some applications, completeness and truthfulness of data are important

By using generalization, we can preserve the completeness and truthfulness of the data

Page 15: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Proposed problem- Problem Illustration

D D’Transformati

on(Local

Recoding)

Emerging PatternsFrequent Itemsets

Page 16: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Intuition of Local Recoding

Support of FIS = 40% Growth Rate of EP = 3

Frequent Itemset = {Exec, Married} Emerging Pattern = {MSE ,Exec}

Edu Occup Marital

Married

Married

BA Exec Married

BA Manager Married

BA Repair Never

MSE Exec

MSE Exec

Income >= 50k Income < 50k

Edu Occup Marital

BA Exec Married

BA Exec Married

BA Exec Married

BA Worker Married

MSE Manager Never

Page 17: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Intuition of Local RecodingEdu Occup Marital

Married

Married

BA Exec Married

BA Manager Married

BA Repair Never

MSE Exec

MSE Exec

Income >= 50k Income < 50k

Edu Occup Marital

BA Exec Married

BA Exec Married

BA Exec Married

BA Worker Married

MSE Manager Never

Edu Occup Marital

Married

Married

BA Exec Married

BA Manager Married

BA Repair Never

MSE White col

MSE White col

Income >= 50k Income < 50k

Edu Occup Marital

BA Exec Married

BA Exec Married

BA Exec Married

BA Worker Married

MSE White Col Never

Page 18: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Heuristic for the Problem- Greedy Approach

Repeat…

Until…

All Emerging Patterns are removed

DEmerging Patterns Mining

Applying the generalization

EPs

EP 1

EP 2

EP 3

EP 4

Equivalence ClassesUtility Gain

Class1 40

Class 2 90

Class 3 60

Class 4 20

Class 5 15

Page 19: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Heuristic for the Problem-Greedy Approach Drawbacks:

Trapped into some local minima Solution:

Simulated Annealing Style Approach for choosing equivalence class

Page 20: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Heuristic for the Problem- Simulated Annealing Style Approach

Choose Equivalence Class probabilistically

Two parameters: Initial temperature ( T0 ) Cooling Rate ( α )

Acceptance Probability: exp Utility Gain / Temperature

Temperature updating: Tn = α Tn-1

Utility Gain

T=1000

T=100 T=10

90 0.209 0.302 0.945

60 0.203 0.223 0.047

40 0.199 0.183 0.006

20 0.195 0.150 0.0009

15 0.194 0.142 0.0005Acceptance probability of different utility gain and temperature

Page 21: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Heuristic for the Problem- Simulated Annealing Style Approach

Repeat…

Until…

All Emerging Patterns are removed

DEmerging Patterns Mining

Applying the generalizationand

Decrease the temperature

EPs

EP 1

EP 2

EP 3

EP 4

Equivalence ClassesProbability

Class1 0.2

Class 2 0.4

Class 3 0.1

Class 4 0.25

Class 5 0.05

Page 22: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Two questions

How to choose an EP for generalization? How to calculate the utility gain?

Page 23: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

How to choose an EP for generalization? Choose the EP which overlaps with the

remaining EPs the most More likely to hide other EPs

simultaneouslyEmerging Patterns

MSE Never Married

BA Divorced

BA Divorced Worker

BA Divorced Repairman

BA DivorcedOwn-Child

Page 24: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

How to calculate utility gain?

Utility gain is a function of: Recoding Distance (RD) Reduction of Growth Rate (RG)

Page 25: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

How to calculate utility gain ?- Recoding Distance (RD) The detail derivation is stated in the paper Intuitively, it measures…

How many and how much FIS have been generalized?

How many FIS disappeared? High level definition of RD:

θq x (generalized FIS) + ( 1- θq ) x (disappeared FIS)

,where θq is user defined parameterThe larger the value of RD, the more the distortion generated on

the Frequent Itemset

Page 26: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

How to calculate utility gain ?- Reduction of Growth Rate(RG) After taken a local recoding, RG is

defined as: The reduction of growth rate of all EPs

Emerging Patterns

Growth Rate

Executive , Married

10

BA, Divorced 20

Executive 30

Sum of Growth Rate

60

Emerging Patterns

Growth Rate

White col, Married

5

BA, Divorced 20

Sum of Growth Rate

25

Local Recoding

RG = 60 – 25 = 35

Page 27: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

How to calculate utility gain? Putting all these together, utility gain is defined

as:θp x RG – (1- θp ) x RD

,where θp is user defined parameters

It favors: Local recoding which can reduce lots of growth rate

It penalizes: Local recoding which generate large distortion on

FIS

Page 28: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Experimental Setup

Dataset: Adult dataset from UCI Repository Popular benchmark dataset used for generalization

Total number of records: 30162 Income > 50k : 7508 Income <= 50k : 22654

Use only 8 categorical attributes for experiment A well accepted hierarchy is defined

Parameters: Support of FIS : 40% Growth rate of EP : 5 Initial Temperature : 10 Cooling Rate : 0.4

Page 29: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Performance

RD / No. of FIS disappeared of the Greedy Approach

RD / No. of FIS disappeared ofSimulated Annealing Style Approach

(Best of 5)

Maximum RD: 623.1

Page 30: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Runtime (in minutes)

Greedy Approach

Simulated Annealing Style Approach(Best of 5)

Page 31: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Future Research Plan

Hide EPs in temporal datasets Consider multi-level FIS Hiding a group of emerging patterns at a

time

Page 32: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Q & A

Any Questions?