5. De-identification Khaled PPT.ppt. De-identification...Microsoft PowerPoint - 5. De-identification_Khaled_PPT.ppt [Compatibility Mode] Author jilly Created Date 10/4/2012 10:10:58

10/4/2012

1

DE-IDENTIFICATION OF PERSONALLY IDENTIFIABLE

INFORMATION

Dr. Khaled El Emam

Founder & CEO, Privacy Analytics

Canada Research Chair in Electronic Health Information & Associate Professor in the Faculty of Medicine - University of Ottawa

Senior Scientist at the Children’s Hospital of Eastern Ontario Research Institute

Director, Electronic Health Information Laboratory (EHIL)

Organizations, both in the private and public sectors, are sitting on large amounts of valuable data.

The value in this data can be unlocked to improve operational efficiencies and create new business opportunities if it can be used and disclosed for secondary purposes.

Legislation in different industries and jurisdictions requires de-identification mechanisms to be used before using and disclosing data without authorization or consent.

The healthcare industry has applied de-identification methods for more than a decade now to disclose very complex data sets, and has considerable experience with addressing the technical,

regulatory, and practical issues that arise.

Abstract

This talk describes de-identification methodologies that are being used for health data that can be applied in other industries like Financial Services, and explain how to use these methodologies to de-identify financial data in a defensible way.

Attendees will:

• Get an introduction to the general concepts around the de-identification of PII

• Learn through examples how data sets can be de-identified and still retain significant utility for sophisticated analytics

Abstract

10/4/2012

2

• Increasing demands for (health) data for secondary purposes

• Stronger enforcement of regulations with non-trivial financial penalties

• Increasing unease by consumers about how their personal data is being collected, used, and disclosed (the ‘creepy’ factor and public trust)

• One consequence is the need for more defensible methods for the de-identification of the data

Trends

• Total costs of a breach: detection and escalation, notification, ex-post response (e.g., credit monitoring), penalties, litigation, and lost business

Cost of a Breach - I

Year Industry $ per person(average)

2008Healthcare $282

Finance $240

2009 Healthcare $294

Finance $249


Finance $353


Finance $247

Source: Ponemon Institute

Cost of a Breach - II

Source: Javelin (2009 figures)

Country $ per person (average)

US $204

UK $98

DE $177

FR $199

AU $114

10/4/2012

3

• Directly Identifying

– Can uniquely identify an individual by itself or in conjunction with other readily available information

• Quasi-Identifiers

– Can identify an individual by itself or in conjunction with other information

• Sensitive Variables

Variable Distinctions

10/4/2012

4

• Name, full address, telephone number, fax number, username, medical record number, health card number, health plan beneficiary number, license plate number, email address, photograph, biometrics, SSN, SIN, implanted device number

Examples of Direct Identifiers

• Masking protects the direct identifiers

• Masking methods that work:– Variable suppression

– Randomization

– Pseudonymization

• A number of masking methods are used in practice that are not protective and should not be used– Character scrambling: 60% of census names are unique

– Character masking/truncation: 67% of census names were still unique after 1 character mask, and 46% after 2 character masks

– Deterministic encoding and frequency attacks

– Reversing the addition of noise through filters or using statistical and data mining techniques to ‘average out’ the noise

Step 1: Masking - I

• It is still possible to re-identify individuals from masked data – even when done properly, masking is not enough– Indirect identifiers can be sufficiently unique to make individuals easy to re-identify

– Masking is a necessary but insufficient step in protecting data against identifiability attacks

– Step 2 – de-identification of the indirect identifiers

• Regulators are increasingly examining identifiability risks from data that has been masked

Step 1: Masking - II

10/4/2012

5

• sex, date of birth or age, geographic locations (such as postal codes, census geography, information about proximity to known or unique landmarks), language spoken at home, ethnic origin, aboriginal identity, total years of schooling, marital status, criminal history, total income, visible minority status, activity difficulties/reductions, profession, event dates (such as admission, discharge, procedure, death, specimen collection, visit/encounter, travel), codes (such as diagnosis codes, procedure codes, and adverse event codes), country of birth, birth weight, and birth plurality

Examples of Quasi-Identifiers

10/4/2012

6

10/4/2012

7

• The distinction between attribute and identity disclosure is critical for understanding what is important to protect identifiability, what de-identification will achieve, and where governance needs to come in to manage residual risk.

• Attribute disclosure: discover something new about an individual in the database without knowing which record belongs to that individual

• Identity disclosure: determine which record in the database belongs to a particular individual (for example, determine that record number 7 belongs to Bob Smith –that is identity disclosure)

Attribute vs IdentityDisclosure

Attribute Disclosure - I

HPV Vaccinated

NOT HPV

Vaccinated

Religion A 5 40

NotReligion A

40 5

� Statistically significant relationship (chi-square, p<0.05)

� High risk of attribute disclosure

Attribute Disclosure - I

HPV Vaccinated

NOT HPV

Vaccinated

Religion A 5 40

NotReligion A

40 5

� Statistically significant relationship (chi-square, p<0.05)

� High risk of attribute disclosure

10/4/2012

8

Attribute Disclosure - II

HPV Vaccinated

NOT HPV

Vaccinated

Religion A 5 6

NotReligion A

6 5

� Not statistically significant relationship (chi-square)

� Low risk of attribute disclosure

Attribute Disclosure - III

• Attribute disclosure is an important outcome of analytics – it is arguably more of an ethics question whether it is acceptable to ask certain questions or discover certain things about individuals with certain characteristics rather than an identifiability issue

• Privacy regulations do not require one to address risks from attribute disclosure – only identity disclosure risks need to be addressed

• All known re-identification attacks were identity disclosure

Stigmatizing Analytics

10/4/2012

9

Step 2: De-IdentificationStandards

• No universal standards for de-identification (yet). The closest we have are the high level standards in the HIPAA Privacy Rule

• Even though these are focused at healthcare, the principles should be applicable to any kind of data set

• The HIPAA Privacy Rule specifies two de-identification standards:• Safe Harbor• Statistical method

HIPAA Safe Harbor - I

Safe Harbor Direct Identifiers and Quasi-identifiers

1. Names2. ZIP Codes (except

first three)3. All elements of dates

(except year)4. Telephone numbers5. Fax numbers6. Electronic mail

addresses7. Social security

numbers8. Medical record

numbers9. Health plan

beneficiary numbers10.Account numbers11.Certificate/license

numbers

12.Vehicle identifiers and serial numbers, including license plate numbers

13.Device identifiers and serial numbers

14.Web Universal Resource Locators (URLs)

15. Internet Protocol (IP) address numbers

16.Biometric identifiers, including finger and voice prints

17.Full face photographic images and any comparable images;

18. Any other unique identifying number, characteristic, or code

Applicability of Safe Harbor

• Safe Harbor ensures that the risk of re-identification is low if:• The data is a random sample from the US

population – justifiability of this assumption will depend on the data set

• The only quasi-identifiers in the data are dates and ZIP codes – no detailed demographics and socioeconomic data fields

• The adversary does not know who is in the data set – the data set is a random sample

• The data is cross-sectional – not suitable for transactional data

10/4/2012

10

High Risk Safe Harbor - II

If an adversary knows that Bob is in the database:

Gender Age ZIP Lab Test

M 55 112Albumin, Serum

F 53 114Alkaline

Phosphatase

M 24 134 Creatine Kinase

High Risk Safe Harbor - III

• Longitudinal (or transactional) data can have a high risk of re-identification even if it meets the stipulations of Safe Harbor

• For example, using the State Inpatient Database for NY (2007) with 2 million visits:

Quasi-identifiers % of patients unique

age, gender, ZIP3 1.8%

age, gender, ZIP3, LOS 20.75%

Statistical Method

• A person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable:

I. Applying such principles and methods, determines that the risk is very small that the information

could be used, alone or in combination with other

reasonably available information, by an anticipated recipient to identify an individual who is a subject

of the information; and

II. Documents the methods and results of the analysis that justify such determination

10/4/2012

11

What is “very small” Risk ?

• With reference to:

• The data set only

• The overall context of the disclosure and use

• The disclosure control community has in practice taken the overall context of the disclosure and use when deciding what is acceptable risk, and this is evident in writings and guidance going back a number of decades; this is also a more realistic assessment of risk

• Acceptable risk will also depend on the nature of the attack on the data (e.g., re-identifying a famous person, nosey neighbor, and/or linking with a public or semi-public registry)

Apply Transformations

If the measured risk is below the threshold, specific transformations (such as generalization and suppression) are applied to

reduce the risk.

Set Risk Threshold

Based on the characteristics of the data recipient, the data,

and precedents and quantitative threshold is set.

Measure Risk

Based on plausible re-identification attacks, appropriate metrics are

selected and used to measure actual re-identification risk from the data.

De-identification Process



Re-Identification Risk Spectrum

10/4/2012

12

Managing Re-Identification Risk

Common Transformations

• Commonly used transformations to de-identify data:– Generalize – reducing the precision of the fields (eg, date of birth

to month and year of birth)

– Suppress – selectively suppressing individual fields in the database to hide outliers (e.g., a 55 year old woman giving birth)

– Sub-sample – disclosing a smaller sample of the original data to

introduce uncertainty as to whether an individual is in the data set or not

– Truncate – removing transactions to ensure that no individual is

an outlier in terms of the number of transactions that they have (e.g., too many hospital visits or too many insurance claims compared to others in the database)

• By balancing these techniques it is possible to produce a data set that still has high utility and where it is possible to make strong claims about identifiability

Example: Birth Registry

• Context: – Provincial birth registry with close to 1 million records

– Has maternal and infant health information on all births in Ontario

– Data disclosed for secondary purposes (research and public health)

• Plausible attacks:– A risk assessment determined that there were three plausible

attacks: deliberate attack by data users, inadvertent “spontaneous” attack, and data breach where the data set is exposed publicly.

– The highest risk was computed to be from “spontaneous re-

identification” by someone performing analysis on the data

• Quasi-identifiers:– The kinds of fields that are included in analyses are demographic and socioeconomic, depending on the data request

10/4/2012

13

Multiple Options

Quasi-identifier Generalization Suppression

Baby DoB week/year 0%

Mother DoB 5 year interval 0%

Postal Code first character 1.023%

Baby Sex none 0.061%

Quasi-identifier Generalization Suppression

Baby DoB year 3.063%

Mother DoB 5 year interval 3.063%

Postal Code three characters 4.08%

Baby Sex none 3.066%

Option 1

Option 2

Further Information

[email protected]

www.ehealthinformation.ca

www.ehealthinformation.ca/knowledgebase

www.privacyanalytics.ca

www.privacyanalytics.ca/knowledgebase