Data Anonymization (1)

Outline Problem concepts algorithms on domain generalization

hierarchy Algorithms on numerical data

The Massachusetts Governor Privacy Breach

•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge

•Name•Address•Date Registered•Party affiliation•Date last voted

• Zip

• Birth date

• Sex

Medical Data Voter List

• Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis

• Zip

• Birth date

• Sex

Sweeney, IJUFKS 2002

Quasi IdentifierQuasi Identifier

87 % of US population

Definition Table

Column: attributes, row: records

Quasi-identifier A list of attributes that can potentially be

used to identify individuals

K-anonymity Any QI in the table appears at least k

Basic techniques Generalization

Zip {02138, 02139} 0213* Domain generalization hierarchy

A0 A1…An Eg. {02138, 02139} 0213* 021* 02*0** This hierarchy is a tree structure

suppression

Balance

Better privacy guaranteeLower data utility

There are many schemes satisfying the k-anonymity specification.We want to minimize the distortion of table, in order to maximize data utility

• Suppression is required if we cannot find a k-anonymity group for a record.

Criteria Minimal generalization

Minimal generalization that satisfy the k-anonymization specification

Minimal table distortion Minimal generalization with minimal

utility loss Use precision to evaluate the loss

[sweeny papers] Application-specific utility

Complexity of finding optimal solution on generalization NP-hard (bayardo ICDE05) So all proposed algorithms are

approximate algorithms

Shared features in different solutions Always satisfy the k-anonymity

specification If some records not, suppress them

Differences are at the utility loss/cost function Sweeney’s precision metric Discernibility & classification metrics Information-privacy metric

Algorithms Assume the domain generalization hierarchy is

given Efficiency Utility maximization

Metrics to be optimized Two cost metrics – we want to minimize

(bayardo ICDE05) Discernibility

Classification The dataset has a class label column – preserving

the classification model

# of items in the k-anony group

# Records in minor classes in the group

metrics A combination of information loss and

anonymity gain (wang ICDE04) Information loss, anonymity gain Information-privacy metric

metrics Information loss

Dataset has class labels Entropy

a set S, labeled by different classes Entropy is used to calculate the impurity of labels

Information loss of a generalization G{c1,c2,…cn} p

I(G) = info(Sp) - info (Rci)

ii pp log Pi is the percentage of label iInfo(S)=

Anonymity gain A(VID) : # of records with the VID AG(VID) >= A(VID): generalization

improves or does not change A(VID) Anonymity gain

P(G) = x – A(VID)x = AG (VID) if AG (VID) <=K

x = K, otherwise

As long as k-anonymity is satisfied, further generalization of the VID does not gain

Information-privacy combined metricIP = info loss/anonymity gain = I(G)/P(G)

We want to minimize IPIf P(G) ==0, use I(G) only

Either small I(G) or large P(G) will reduce IP…If P(G)s are same, pick one with minimum I(G)

Domain-hierarchy based algorithms The sweeny’s algorithm Bayardo’s tree pruning algorithm Wang’s top-down and bottom up

algorithms They are all dimension-by-dimension

methods

Multidimensional techniques Categorical data?

Categories are mapped to numerize the categories

Bayardo 95 paper Order matters? (no research on that)

Numerical data K-anonymization n-dim space

partitioning Many existing techniques can be applied

Single-dimensional vs. multidimensional

The evolving procedure

Categorical(domain hierarchy)[sweeney, top-down/bottom-up]

numerized categories, single dimensional [bayardo05]

numerized/numerical multidimensional[Mondrian,spatial indexing,…]

Method 1: Mondrain Numerize categorical data Apply a top-down partioning process

Step2.1 Step2.2

Allowable cut

Method 2: spatial indexing Multidimensional spatial techniques

Kd-tree (similar to Mondrain algorithm) R-tree and its variations

R-tree R+-tree

Leaf layer

Upperlayer

Compacting bounds

Example: uncompacted: age[1-80], salary[10k-100k]compacted: age[20-40], salary[10k-50k]

Original Mondrain does not consider compacting boundsFor R+-Tree, it is automatically done.

Information is betterpreserved

Benefits of using R+-Tree Scalable: originally designed for

indexing disk-based large data Multi-granularity k-anonymity: layers Better performance Better quality

Performance

Mondrain

Utility Metrics

Discenibility penalty KL divergence: describe the difference

between a pair of distributions

Certainty penalty

Anonymized data distribution

T: table, t: record, m: # of attributes, t.Ai generaled range, T.Ai total range

Other issues Sparse high-dimensionality

Transactional data boolean matrix“On the anonymization of sparse high-dimensional

data” ICDE08 Relate to the clustering problem of

transactional data! The above one uses matrix-based clustering item based clustering (?)

Other issues Effect of numerizing categorical data

Ordering of categories may have certain impact on quality

General-purpose utility metrics vs. special task oriented utility metrics

Attacks on k-anonymity definition

Data Anonymization (1)

Documents

MEET GDPR / PII REQUIREMENTS Personal Data Anonymization …

Data Anonymization: A Tutorial - HAMILTON INSTITUTE · Introduction Tabular data protection Queryable database protection Microdata protection Evaluation of SDC methods Anonymization

Data Anonymization – Introduction and k-anonymity

The Complete Book of Data Anonymization - IT Today Complete Book of Data Anonymization Balaji Raghunathan From Planning to Implementation K13578.indb 3 09/01/13 1:26 PM The Complete

Privacy-preserving Anonymization of Set Value Data

Fast Data Anonymization with Low Information Loss › conf › 2007 › papers › research › p758-ghinita.pdf · 2007-07-31 · Fast Data Anonymization with Low Information Loss

Privacy and AI: Contradiction or Symbiosis · 7 Huawei Confidential Anonymization / PseudonymizationTechniques Data Masking Equivalent class based Anonymization •Basic protection

Data Anonymization and Quantifying Risk Competition · Rule Ver. 1.3 (1) Each team submits one anonymized data. (2) Reject cheating anonymization (3) Each team is allowed to re-identify

DE-IDENTIFICATION AND ANONYMIZATION OF INDIVIDUAL …...A MODEL APPROACH DE-IDENTIFICATION AND ANONYMIZATION OF INDIVIDUAL PATIENT DATA IN CLINICAL STUDIES Clinical Data Transparency

Efficient Anonymization Algorithms to Prevent Generalized ...article.ajdmkd.org/pdf/10.11648.j.ajdmkd.20170202.13.pdf · Keywords: Data Anonymization, Micro Data, PPDP, Slicing 1

Data Anonymization Survey Results

Lightning: Utility-Driven Anonymization of High-Dimensional Data · Lightning: Utility-Driven Anonymization of High-Dimensional Data Fabian Prasser, Raffael Bild, Johanna Eicher,

Data Privacy: Anonymization & Re-Identification

Data Anonymization - European Commission

A Primer on Data Privacy, Anonymization, and De-Identification

In-Situ Anonymization of Big Data - DC-VISdc-vis.irb.hr/repository/2015/DC VIS - InSituAnonymizationOfBigData.pdf · In-Situ Anonymization of Big Data Tomislav Križan Consultancy

CS573 Data Privacy and Security Anonymization methods

Data Anonymization - Generalization Algorithms

A Novel Anonymization Technique for Privacy Preserving Data Publishing

A Framework for Efﬁcient Data Anonymization under Privacy ... · A Framework for Efﬁcient Data Anonymization • 9:3 Fig. 1. k-anonymization example (k = 4). algorithm (i.e.,