Problems in Technology to Use Anonymized Personal Data

Problems in Technology to Use Anonymized Personal Data

Hiroshi Nakagawa

Information Technology Center

The University of Tokyo

OECD guideline will be revised, and one of the important point is:

A right to be forgotten

Google is defeated in EU Court, and agrees to erase its personal data link upon consumers’ request.

In Japan, Google is defeated and erases its personal data link upon consumers’ request.

Legal issue but involve some technical Issue

Current Situation around Privacy

Current Situation around Privacy EU Data Protection Directive Regulation

2014/3/12

Notice and Consent may not work in Bigdata (Schoenberger)

Accountability of Database provider

Notice and Consent should work (Cavoukian)

Putting you in control

Data protection first, not an afterthought

Privacy Data Ecosystem Trust Network

Technical Issues

Current Situation around Privacy

EU does not deem Japanese personal data law is adequate for EU standard, and prohibits to export EU citizen’s personal data to Japan.

Japanese government is moving towards revision of Japanese personal data protection law. One of the purpose is to get the adequacy.

Personal data can be transferred to the third party without consent

if risk of re-identification is reduced.

Technical Issue

• OECD guideline revision, EU data protection regulation,….

• A right to be forgotten:

– When you no longer want your data to be processed and there are no legitimate grounds for retaining it, the data will be deleted.

– This is about empowering individuals, not about restricting freedom of the press.

– Legally, balance of these two is issue

• Easier access to your own data:

– Much more technical issue

◆When DB without personal ID works as anonymized DB? ◆Data source person can accessed or erased his/her own data in anonymized DB without personal ID ? ◆ Does anonymization have side effect?

Then three of the technical problems of anonymized data are:

Part 1 ◆When DB without personal ID

works as anonymized DB?

“Anonymize” means deleting personal ID and maybe something like k-anonymity

Here, personal data consists of (ID, Quasi ID, Other date(including sensitive data).

◆When anonymity works? Classic categorizations

• (ID, Quasi ID, other data).

• Quasi ＩＤ（address, age, sex, etc.)

No QIDs

QIDs

Whose data is stored in DB is unknown

Unknown & no QID

Unknown & QID

Whose data is stored in DB is known

Known & no QID

Known & QID

New Categories

– Suppose that personal ID, such as name is deleted

• Known DB: Whether a specified person’s personal data is

stored in DB is definitely known.

• P Known DB: Whether a specified person’s personal data is stored in DB is probabilistically known.

• Unknown DB: Whether a specified person’s personal data is stored in DB is not known. – These categorization has not got enough attention.

Known, Probabilistically Known (P Known),

Unknown

• Some outsider is able to observe the personal data gathering process.

then observed person’s personal data is known to be stored in DB Such as using train boarding pass or buying wine at a liquor shop.

Known DB is the DB consists observable personal action If some one opt-out from “known DB”, it becomes P Known.

P Known DB is built with sampled personal data from the original DB. We only know probabilistically whether a specified person’s

data is stored in the DB

k-anonymized DB

Known／P Known sampling and k-anonymity

• To protect private data in personal data from the third party – (1) Transfer DB of randomly sampled data, or statistics of the

whole known DB, to the third party – (2) Transfer k-anonymized DB the third party

The whole known DB Sampled DB

(1) Randomly sampled data ＝P Known

(2) k-anonymize ＝Known

Other personal data makes things worse

Because, other personal data can be used as Quasi ID

Two aspects

Traditional view：QID＋personal data whose gathering process is not observed by other people

Current view: QID＋personal data whose gathering process can be observed by other people It is even problematic to transfer the third party this type of data without ID and QID.

When anonymized DB works?

No ID & No quasi ＩＤ

No ID but some quasi IDs

Whose data is stored in DB is unknown (Unknown DB)

not personal data

Unknown& QID k-anonymity works

Whose data is stored in DB is probabilistically Known (P Unknown DB) Such as Sampled DB

P Known & no QID The risk depends on

sampling rate.

P Known & QID k-anonymity may work. The risk depends both on

sampling rate and granularity of QID, such as data gathering frequency.

Whose data is stored in DB is known (Known DB)

Known & no QID If personal history of

location is used as PID, k-anonymity degrade the value of data too much .

Known & QID Quite risky

Summary

If personal data gathering action can be observed by other people, k-anonymity severely degrades the value of data.

If personal data gathering action can not be observed by other people,

in no QID case, k-anonymity is not needed

In case of QID included, k-anonymity of QID may work.

Part 2 ◆Data source person can

accessed or erased his/her own data in anonymized DB without

personal ID ?

Traditional view：QID＋unobserved personal data

ＩＤ QＩＤ Sensitive data Other data

name Address, age,sex Disease, …

ＩＤ pseudonym

name a123x

pseudonym Q ＩＤ Sensitive data Other data

a123x Address, age,sex Disease, …

split

Other DB including ID, QID

Matching these two DBs may enable to link sensitive data and

ID even without pseudonym

Access request from

To keep privacy stricter, pseudonym is frequently changed. But access is possible with pseudonym data base.

ＩＤ（name, etc.） Other personal data

ＩＤ（name, etc.）

Pseudonym （ex. A123B ）

Pseudonym （ex. A123B ）

Other personal data

This table is strictly controlled

Data mining is done only on this data, safe

If access is required, DB manager connects ID and other personal data with

Psesudonym table

ＩＤ（name, etc.）

Pseu：A123B4

Pseu：C1263B

Pseu：X91234

Pseu：Z12345

Pseu：A123B4 Other personal data:1

Pseu：C1263B Other personal data:2

Pseu：X91234 Other personal data:3

Pseu：Z12345 Other personal data:4

split

No k-anonymity cases

DB manager

What is distributed to third parties is the DB without ID, but…

This person requests to access his data, DB manager requests these four pseudonym. Then, the third party realize these four are of the same person’s data!

This table is not transferred to any one outside

ＩＤ name

pseudo：A123B4

pseudo：C1263B

pseudo：X91234

pseudo：Z12345

pseudo：A123B4 Personal data :1

pseudo：C1263B Personal data :2

pseudo：X91234 Personal data :3

pseudo：Z12345 Personal data :4

Third parties only receive this part of DB

pseudo：A123B4

pseudo：C1263B

pseudo：X91234

pseudo：Z12345

Personal data :1

Personal data :2

Personal data :3

Personal data :4

To remedy this situation, DB manager add many other

unrelated person’s pseudonyms

Because, obviously, adding unrelated person’s pseudonyms does not work.

In erasure case, if the third party is malicious, we do not have any protection methods that works.

But rectification and erasure request are more difficult

Access is possible in k-anonymity

ＩＤ Pseudo

Bob a12

Bill b23

Chris c34

Pseudo QＩＤ sensitive

a12 xxx flu

b23 xxx obesity

c34 xxx diabetes

DB manager Ａ

Service provider :B who received 3-anonymized data from A

Bob ②request for access to personal data about （a12,b23, c34)

④show Bob the data corresponding to his data = a12’s data

③3 persons’ sensitive data

Request for access

Erasure request for k-anonymized DB makes trouble

ＩＤ pseudo

Bob b23

Bill c34

pseudo QＩＤ sensitive

b23 xxx High blood press

c34 xxx Cancer

DB manager who makes 2-anonymity DB

①request for erase

Its Bill. Erase my

data.

2-anonymity collapses. 1-anonymity？ No kidding！

Re-build k-anonymity? Oh ,NO!

Third Party who has only 2-anonymized DB

②request to erase c34 data

Three solutions

• Erasing one person’s data collapses k-anonymity.

Solution１：Do k-anonymize DB again, but consuming too time, and need to distribute new k-anon. DB, too costly!X

Solution ２：Erase k persons’ data altogether if one of them is

erased. seemingly OK Degrade the quality of DB or accuracy of data mining from the DB

Solution ３：If beforehand, we use k+α-anonymity, then DB is still k-anonymity after erasing α persons’ date

probably OK However, if α is not small, the quality of DB of k+α-anonymity is

degraded.

Part 3

◆Does anonymization have side effect?

k-anonymity of Location and

False Light

name age gen Address(number, street name, ward name)

Location at some time

Alex 35 M 101 Hongo, Bunkyo consumer finance: K

Bill 30 M 120 Yushima, Bunkyo University T

Ken 33 M 312 Yayoi, Bunkyo University T

Paul 39 M 421 Sendagi, Bunkyo Hospital Y

Name(anonym) age gen address Location at some time

Alex 30 M Bunkyo consumer finance: K

Bill 30 M Bunkyo University T

Ken 30 M Bunkyo University T

Paul 30 M Bunkyo Hospital Y

４-Anonymize

A,B,K,P are not regarded as distinct person, Then all four are suspected to visit consumer

finance: K (meaning not good financially)

Side effect of K-anonymity

Location k-anonymizing can triggers false light

k-anonymized area： k persons in it

consumer finance shop: C

This student is seeking job now. If he is suspected to go to a consumer finance shop,

it does no-good effect for his job finding activity

False Light

Location k-anonymizing can triggers false light is remedied by dividing shop C into 4 areas



Only one person is at consumer finance shop: C among all k persons in a k-anonymized area

Suspecting a person at shop C is not reasonable



False Light




(#of Person at shop C)/k

Subjective Probability of suspecting

Something Wrong

1

0 1

Subjective Prob. of suspecting a person went to shop C

Expected damage

Expected damage estimated by the third person

Needed money for Precaution

This area is almost free from false light . The problem is how to select k to confined into this area!

Summary

• There is a side effect of k-anonymity, so called false light.

• In k-anonymity in location, the side effect is reduced by reorganizing k-anonymity area.

Data & Analytics

Problems in Technology to Use Anonymized Personal Data