Anonymity and Privacy Issues --- re-identification Yimeng Zhang 12/4/07

Anonymity and Privacy Issues--- re-identification

Yimeng Zhang

12/4/07

Index

• Views on Privacy of Social Media• Overview of Re-identification• You are What You Say: Privacy Risks of Public

Mentions, Frankowski et al. SIGIR06

Improper Use of Personal Information Online

Top Privacy Concerns

Remaining Anonymous

True Information Provide While Registering

Ability to Remain Anonymous

Importance of Controlling Personal Information

Specifying Who Can ViewPersonal Information

Conclusion

• Around 40% of people would like to remain anonymous on social media or social networking sites

• Most people provide their true personal information while registering

• Most people think it is important to have the control of personal information online

Re-identification Techniques can identify the users of an anonymous dataset

Privacy Loss through Re-identification

• Re-identification: Linkage of datasets with explicit identifiers with datasets without explicit identifiers through common attributes

• Datasets without explicit identifiers– Public data which are made anonymous by users– Public data by research groups (after suitable anonym

izing)– Public data from government agencies (census)

People wish to keep private

Example of Re-identificationPublic by Group Insurance Commission of Massachusetts

Voter register list of Massachusetts purchased with only 20$

Sweeney, 2002

87% of Population in 1990. US are likely to be uniquely identified based on only on Zip, Birth and Sex

The Rebus Form

+ =

Governor’s medical records!

From Frankowski, SIGIR06

Example of face identification

Facebook Friendster

With explicit identified profiles Without explicit identified profiles

Face Recognizer

Gross and Acquisti, WPES 05

Identity violation!

You Are What You Say: Privacy Risks of Public Mentions

Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen, John Riedl

University of MinnesotaSIGIR 2006

Main Idea

• People can be identified by their preferences and what they talk about– Reviews of books, movies, songs– Mentions on forums or blogs– Friend list on Facebook– Wish or purchase list on Amazon

• Method for Re-identification– Datasets are represented in Sparse Relation Spaces– Re-identification can be done by matching two Sparse

Relation Spaces

Sparse Relation Space

• Relates people to items• Sparse: have few

relationships recorded per person

• Dataset that can be represented in a Sparse Relation Space is vulnerable

i1 i2 i3 …

p1 X

p2 X

p3 X

…

Research Questions

• Risks of dataset release– What are the risks to user privacy when

releasing a dataset

• Altering the dataset– How can dataset owners alter the dataset to

preserve user privacy

• Self defense– How can users protect their own privacy

Experiment Dataset: MovieLens

Dataset1: Movie Ratings Users do not allow to reveal

Released for research use“Anonymous Dataset”

Dataset2: Movies ReviewsPublic

Feature of the dataset

• Both ratings and mentions follow a power law

• Important feature for real world sparse relation space

Number of ratings of an item by percentile

0

10000

20000

30000

40000

50000

60000

0% 20% 40% 60% 80% 100%Item percentile

Nu

mb

er

of r

atin

gs

Frankowski, SIGIR 06

Evaluation Measure

Ratings

Re-identify Algorithm

Mentions by User t

Top k ratings users ranked by the likelihood they are user t

K-identified: t is in the k users returned by the algorithm

K-identification rate: the fraction of k-identified users

Mentions

Set Intersection Algorithm for Re-identification

• Likely list: Users in the rating database who have rated every movie mentions by user t

• Problem– Users mention movies but do not rate them

TF-IDF Algorithm

• Mentions of a user: vector of the movies the user mentioned

• Ratings of a user: vector of the movies the user rated

• Likelihood: TF-IDF cosine similarity

Scoring AlgorithmScoring:

• emphasize the mentions of rarely rated movies

• de-emphasize the number of ratings a user has

Score for one mention/movie of a user:

Fraction of users who have not rated mention m

Score for a user:

Multiplication of scores for all mentions of this user

Scoring Algorithm with Ratings

• Suppose we have an magic analyzer which can guess the rating of a movie from the mention– Eg. Using the context of that mention

• Algorithms– ExactRating: the analyzer can perfectly determine the rating– FuzzingRaing: the analyzer can guess the rating value within +/-1

Percent of users identified by different algorithms

1-identification rate

RQ2: Altering the dataset

• How can dataset owners alter the dataset they release to preserve user privacy

• Data Suppression– Algorithm: Drop rarely rated movies– Not big problem for industry, but harmful for

research

Dataset level Suppression

Do not work!

RQ3: Self Defence

• How can users protect their own privacy

• Suppression– Not to mention movies rated rarely

• Misdirection– Mention items they have not rated

User Level Suppression

Do not work!

Misdirection

Works when user mention popular items

Conclusion

• Simple data mining algorithms can identify the users who mention in a sparse relation space and think they are anonymous– Use the algorithms: eg. find paper reviewers

(Future work of Frankowski)– Privacy risks for users on Social Media sites

• Hard to preserve privacies– Don’t reveal your privacies even if it seems to

be anonymous

Documents

Anonymity and Privacy Issues --- re-identification Yimeng Zhang 12/4/07