Upload
hiroshi-nakagawa
View
406
Download
7
Tags:
Embed Size (px)
DESCRIPTION
Privacy is a big issue these days. Legally, EU data protection directive will be revised, Google was defeated in EU court and forced to erase data link uopn user's request. However, we are facing various technical problems to be solved even if limiting to anonyumisation or k-anonymity. In this slide, we describe three of these problmes.
Citation preview
Problems in Technology to Use Anonymized Personal Data
Hiroshi Nakagawa
Information Technology Center
The University of Tokyo
OECD guideline will be revised, and one of the important point is:
A right to be forgotten
Google is defeated in EU Court, and agrees to erase its personal data link upon consumers’ request.
In Japan, Google is defeated and erases its personal data link upon consumers’ request.
Legal issue but involve some technical Issue
Current Situation around Privacy
Current Situation around Privacy EU Data Protection Directive Regulation
2014/3/12
Notice and Consent may not work in Bigdata (Schoenberger)
Accountability of Database provider
Notice and Consent should work (Cavoukian)
Putting you in control
Data protection first, not an afterthought
Privacy Data Ecosystem Trust Network
Technical Issues
Current Situation around Privacy
EU does not deem Japanese personal data law is adequate for EU standard, and prohibits to export EU citizen’s personal data to Japan.
Japanese government is moving towards revision of Japanese personal data protection law. One of the purpose is to get the adequacy.
Personal data can be transferred to the third party without consent
if risk of re-identification is reduced.
Technical Issue
• OECD guideline revision, EU data protection regulation,….
• A right to be forgotten:
– When you no longer want your data to be processed and there are no legitimate grounds for retaining it, the data will be deleted.
– This is about empowering individuals, not about restricting freedom of the press.
– Legally, balance of these two is issue
• Easier access to your own data:
– Much more technical issue
◆When DB without personal ID works as anonymized DB? ◆Data source person can accessed or erased his/her own data in anonymized DB without personal ID ? ◆ Does anonymization have side effect?
Then three of the technical problems of anonymized data are:
Part 1 ◆When DB without personal ID
works as anonymized DB?
“Anonymize” means deleting personal ID and maybe something like k-anonymity
Here, personal data consists of (ID, Quasi ID, Other date(including sensitive data).
◆When anonymity works? Classic categorizations
• (ID, Quasi ID, other data).
• Quasi ID(address, age, sex, etc.)
No QIDs
QIDs
Whose data is stored in DB is unknown
Unknown & no QID
Unknown & QID
Whose data is stored in DB is known
Known & no QID
Known & QID
New Categories
– Suppose that personal ID, such as name is deleted
• Known DB: Whether a specified person’s personal data is
stored in DB is definitely known.
• P Known DB: Whether a specified person’s personal data is stored in DB is probabilistically known.
• Unknown DB: Whether a specified person’s personal data is stored in DB is not known. – These categorization has not got enough attention.
Known, Probabilistically Known (P Known),
Unknown
• Some outsider is able to observe the personal data gathering process.
then observed person’s personal data is known to be stored in DB Such as using train boarding pass or buying wine at a liquor shop.
Known DB is the DB consists observable personal action If some one opt-out from “known DB”, it becomes P Known.
P Known DB is built with sampled personal data from the original DB. We only know probabilistically whether a specified person’s
data is stored in the DB
k-anonymized DB
Known/P Known sampling and k-anonymity
• To protect private data in personal data from the third party – (1) Transfer DB of randomly sampled data, or statistics of the
whole known DB, to the third party – (2) Transfer k-anonymized DB the third party
The whole known DB Sampled DB
(1) Randomly sampled data =P Known
(2) k-anonymize =Known
Other personal data makes things worse
Because, other personal data can be used as Quasi ID
Two aspects
Traditional view:QID+personal data whose gathering process is not observed by other people
Current view: QID+personal data whose gathering process can be observed by other people It is even problematic to transfer the third party this type of data without ID and QID.
When anonymized DB works?
No ID & No quasi ID
No ID but some quasi IDs
Whose data is stored in DB is unknown (Unknown DB)
not personal data
Unknown& QID k-anonymity works
Whose data is stored in DB is probabilistically Known (P Unknown DB) Such as Sampled DB
P Known & no QID The risk depends on
sampling rate.
P Known & QID k-anonymity may work. The risk depends both on
sampling rate and granularity of QID, such as data gathering frequency.
Whose data is stored in DB is known (Known DB)
Known & no QID If personal history of
location is used as PID, k-anonymity degrade the value of data too much .
Known & QID Quite risky
Summary
If personal data gathering action can be observed by other people, k-anonymity severely degrades the value of data.
If personal data gathering action can not be observed by other people,
in no QID case, k-anonymity is not needed
In case of QID included, k-anonymity of QID may work.
Part 2 ◆Data source person can
accessed or erased his/her own data in anonymized DB without
personal ID ?
Traditional view:QID+unobserved personal data
ID QID Sensitive data Other data
name Address, age,sex Disease, …
ID pseudonym
name a123x
pseudonym Q ID Sensitive data Other data
a123x Address, age,sex Disease, …
split
Other DB including ID, QID
Matching these two DBs may enable to link sensitive data and
ID even without pseudonym
Access request from
To keep privacy stricter, pseudonym is frequently changed. But access is possible with pseudonym data base.
ID(name, etc.) Other personal data
ID (name, etc.)
Pseudonym (ex. A123B )
Pseudonym (ex. A123B )
Other personal data
This table is strictly controlled
Data mining is done only on this data, safe
If access is required, DB manager connects ID and other personal data with
Psesudonym table
ID (name, etc.)
Pseu:A123B4
Pseu:C1263B
Pseu:X91234
Pseu:Z12345
Pseu:A123B4 Other personal data:1
Pseu:C1263B Other personal data:2
Pseu:X91234 Other personal data:3
Pseu:Z12345 Other personal data:4
split
No k-anonymity cases
DB manager
What is distributed to third parties is the DB without ID, but…
This person requests to access his data, DB manager requests these four pseudonym. Then, the third party realize these four are of the same person’s data!
This table is not transferred to any one outside
ID name
pseudo:A123B4
pseudo:C1263B
pseudo:X91234
pseudo:Z12345
pseudo:A123B4 Personal data :1
pseudo:C1263B Personal data :2
pseudo:X91234 Personal data :3
pseudo:Z12345 Personal data :4
Third parties only receive this part of DB
pseudo:A123B4
pseudo:C1263B
pseudo:X91234
pseudo:Z12345
Personal data :1
Personal data :2
Personal data :3
Personal data :4
To remedy this situation, DB manager add many other
unrelated person’s pseudonyms
Because, obviously, adding unrelated person’s pseudonyms does not work.
In erasure case, if the third party is malicious, we do not have any protection methods that works.
But rectification and erasure request are more difficult
Access is possible in k-anonymity
ID Pseudo
Bob a12
Bill b23
Chris c34
Pseudo QID sensitive
a12 xxx flu
b23 xxx obesity
c34 xxx diabetes
DB manager A
Service provider :B who received 3-anonymized data from A
Bob ②request for access to personal data about (a12,b23, c34)
④show Bob the data corresponding to his data = a12’s data
③3 persons’ sensitive data
Request for access
Erasure request for k-anonymized DB makes trouble
ID pseudo
Bob b23
Bill c34
pseudo QID sensitive
b23 xxx High blood press
c34 xxx Cancer
DB manager who makes 2-anonymity DB
①request for erase
Its Bill. Erase my
data.
2-anonymity collapses. 1-anonymity? No kidding!
Re-build k-anonymity? Oh ,NO!
Third Party who has only 2-anonymized DB
②request to erase c34 data
Three solutions
• Erasing one person’s data collapses k-anonymity.
Solution1:Do k-anonymize DB again, but consuming too time, and need to distribute new k-anon. DB, too costly!X
Solution 2:Erase k persons’ data altogether if one of them is
erased. seemingly OK Degrade the quality of DB or accuracy of data mining from the DB
Solution 3:If beforehand, we use k+α-anonymity, then DB is still k-anonymity after erasing α persons’ date
probably OK However, if α is not small, the quality of DB of k+α-anonymity is
degraded.
Part 3
◆Does anonymization have side effect?
k-anonymity of Location and
False Light
name age gen Address(number, street name, ward name)
Location at some time
Alex 35 M 101 Hongo, Bunkyo consumer finance: K
Bill 30 M 120 Yushima, Bunkyo University T
Ken 33 M 312 Yayoi, Bunkyo University T
Paul 39 M 421 Sendagi, Bunkyo Hospital Y
Name(anonym) age gen address Location at some time
Alex 30 M Bunkyo consumer finance: K
Bill 30 M Bunkyo University T
Ken 30 M Bunkyo University T
Paul 30 M Bunkyo Hospital Y
4-Anonymize
A,B,K,P are not regarded as distinct person, Then all four are suspected to visit consumer
finance: K (meaning not good financially)
Side effect of K-anonymity
Location k-anonymizing can triggers false light
k-anonymized area: k persons in it
consumer finance shop: C
This student is seeking job now. If he is suspected to go to a consumer finance shop,
it does no-good effect for his job finding activity
False Light
Location k-anonymizing can triggers false light is remedied by dividing shop C into 4 areas
k-anonymized area: k persons in it
consumer finance shop: C
Only one person is at consumer finance shop: C among all k persons in a k-anonymized area
Suspecting a person at shop C is not reasonable
k-anonymized area: k persons in it
consumer finance shop: C
False Light
k-anonymized area: k persons in it
k-anonymized area: k persons in it
k-anonymized area: k persons in it
(#of Person at shop C)/k
Subjective Probability of suspecting
Something Wrong
1
0 1
Subjective Prob. of suspecting a person went to shop C
Expected damage
Expected damage estimated by the third person
Needed money for Precaution
This area is almost free from false light . The problem is how to select k to confined into this area!
Summary
• There is a side effect of k-anonymity, so called false light.
• In k-anonymity in location, the side effect is reduced by reorganizing k-anonymity area.