Download pdf - Big data privacy issues in public social media

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

Belagavi, Karnataka

A Seminar Report on

“BIG DATA PRIVACY ISSUES IN PUBLIC SOCIAL MEDIA”

Submitted in partial fulfillment of the requirements for the award of the Degree of

Master of Technology

In

Computer Science and Engineering

Submitted

by

SUPRIYA R

1VI14SCS11

Guided By

Mrs. MARY VIDYA JOHN

Assistant Professor, Dept. of CSE

Vemana IT, Bengaluru – 34.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

VEMANA INSTITUTE OF TECHNOLOGY

Koramangala, Bengaluru – 560 034.

2014-2015

Karnataka Reddyjana Sangha®

VEMANA INSTITUTE OF TECHNOLOGY Koramangala, Bengaluru – 34

(Affiliated to Visvesvaraya Technological University, Belagavi)

Department of Computer Science & Engineering

Certificate

This is to certify that SUPRIYA R, bearing U.S.N 1VI14SCS11, a student of I semester,

Computer Science & Engineering has completed the seminar entitled “BIG DATA

PRIVACY ISSUES IN PUBLIC SOCIAL MEDIA” in partial fulfillment for the

requirement of the award of degree of Master of Technology in Computer Science and Engineering

of the Visvesvaraya Technological University, Belagavi during the academic year 2014-2015. The

seminar report has been approved as it satisfies the academic requirements in respect of seminar

work prescribed for the said degree.

________________ ______________

Signature of the Guide Signature of HOD

MRS. MARY VIDYA JOHN DR.NARASIMHA MURTHY K.N

Name of the Examiner Signature with date

1.____________________ ________________

2.____________________ ________________

ii

ACKNOWLEDGEMENT

I am indeed very happy to greatly acknowledge the numerous personalities involved in

lending their help to make my seminar “Big Data Privacy Issues in Public Social Media” a

successful one.

Salutations to my beloved and esteemed institute “Vemana institute of technology” for

having well qualified staff and labs furnished with necessary equipments.

I express my gratitude to DR. P. GIRIDHARA REDDY, Principal, Vemana IT for

providing the best facilities and without his constructive criticisms and suggestions I wouldn’t

have been able to do this curriculum.

I express my sincere gratitude to DR.NARASIMHA MURTHY K.N, Professor and

H.O.D (PG), Dept of Computer Science and Engineering, a great motivator and think-tank. I am

very much grateful to him for providing an opportunity to present my seminar.

I also express my gratitude to internal guide Mrs. MARY VIDYA JOHN, Assistant

Professor for giving the sort of encouragement in preparing the report, presenting the paper, spirit

and useful guidance.

I also thank my parents and friends for their moral support and constant guidance made

my efforts fruitful.

Date: SUPRIYA R

Place: Bangalore 1VI14SCS11

ABSTRACT

Big Data is a new label given to a diverse field of data intensive information in

which the datasets are so large that they are difficult to work with effectively. The term has

been mainly used in two contexts, firstly as a technological challenge when dealing with

data-intensive domains such as high energy physics, astronomy or internet search, and

secondly as a sociological problem when data about us is collected and mined by companies

such as Facebook, Google, Mobile phone companies and Governments.

The concentration is on the second issue from a new perspective, namely how the

user can gain awareness of the personally relevant part Big Data that is publicly available

in the social web. The amount of user-generated media uploaded to the web is expanding

rapidly and it is beyond the capabilities of any human to sift through it all to see which

media impacts our privacy. Based on an analysis of social media in Flickr, Locr, Facebook

and Google+, privacy implications and potential of the emerging trend of geo-tagged social

media are discussed. At last, a concept with which users can stay informed about which

parts of the social Big Data is relevant to them.

iii

iv

CONTENTS

CHAPTER NO. TITLE PAGE NO

ABSTRACT iii

CONTENTS iv

LIST OF FIGURES v

1

1 INTRODUCTION

1.1 Overview 1

2 LITERARURE SURVEY 3

2.1 Location privacy in pervasive computing 3

2.2 Facebook’s trainwreck: exposure and invasion 6

2.3 Facebook’s privacy settings: Who cares? 8

2.4 Protecting location privacy with K-Anonymity 12

3

PROBLEM STATEMENT 14

4 SYSTEM ANALYSIS 16

4.1 Existing system 16

4.2 Proposed system 17

5 THREAT ANALYSIS 19

5.1 Awareness of damaging media in big datasets 19

5.2 Analysis of Service Privacy 20

6 SURVEY OF METADATA IN SOCIAL MEDIA 24

7 CONCLUSION AND FUTURE WORK 27

REFERENCES 30

v

LIST OF FIGURES

FIGURE NO.

TITLE

PAGE. NO

Figure 1 Mixed zone arrangement 05

Figure 2 Facebook message in Dec 2009 10

Figure 3 Facebook’s simplified setting 11

Figure 4 Public Privacy metadata from Flickr 22

Figure 5 Public Privacy metadata from Locr 23

Big Data Privacy Issues in Public Social Media

M.Tech, 1st sem, Dept.of CSE, VIT 2014-15 1

CHAPTER 1

INTRODUCTION

In this electronic age, increasing number of organizations are facing the problem of

explosion of data and the size of the databases used in today’s enterprises has been growing

at exponential rates. Data is generated through many sources like business processes,

transactions, social networking sites, web servers, etc. and remains in structured as well as

unstructured form . Today's business applications are having enterprise features like large

scale, data-intensive, web-oriented and accessed from diverse devices including mobile

devices. Processing or analyzing the huge amount of data or extracting meaningful

information is a challenging task.

1.1 Overview

Big data is an all-encompassing term for any collection of data sets so large and

complex that it becomes difficult to process them using traditional data processing

applications. It is becoming a hot topic in many areas where datasets are so large that they

can no longer be handled effectively or even completely. Or put differently, any task that

is comparatively easy to execute when operating on a small set of data, but becomes

unmanageable when dealing with the same problem with a large dataset can be classified

as a Big Data problem. Data sets grow in size in part because they are increasingly being

gathered by ubiquitous information-sensing mobile devices, remote sensing, software logs,

cameras, microphones and wireless sensor networks.

Typical problems that are encountered when dealing with Big Data includes capture,

storage, dissemination, search, analytics and visualization. The traditional data-intensive

sciences such as astronomy, high energy physics, meteorology, and genomics, biological

and environmental research in which peta- and exabytes of data are generated are common

domain examples. Here even the capture and storage of the data is a challenge.

But there are also new domains emerging on the Big Data paradigm such as data

warehousing, Internet and social web search, finance and business informatics. Here

datasets can be small compared to the previous domains, however the complexity of the

data can still lead to the classification as a Big Data problem.

http://en.wikipedia.org/wiki/Data_set

http://en.wikipedia.org/wiki/Remote_sensing

http://en.wikipedia.org/wiki/Wireless_sensor_networks



The traditional Big Data applications such as astronomy and other e-sciences usually

operate on non-personal information and as such usually do not have significant privacy

issues. The privacy critical Big Data applications lie in the new domains of the social web,

consumer and business analytics and governmental surveillance. In these domains Big Data

research is being used to create and analyze profiles of users, for example for market

research, targeted advertisement, workflow improvement or national security.

These are very contentious issues since it is entirely up to the controller of the Big Data

sets if the information obtained from various sources is used for criminal purposes or not.

In particular in the context of the social web there is an increasing awareness of the value,

potential and risk of the personal data which users voluntarily upload to the web.

However, the Big Data issue in this area has focused almost entirely on what the

controlling companies do with this information. These concerns are being addressed by

calls for regulatory intervention, i.e. regulating what companies are allowed to do with the

data users give them or what data they are allowed to gather about users.

Social media Big Data issue is examined. The growing proliferation and capabilities of

mobile devices which is creating a deluge of social media which effects our privacy is

discussed. Due to the vast amounts of data being uploaded every day it is next to impossible

to be aware of everything which affects us. Then a concept which can be used to regain

control of some of the Big Data deluge created by other social web users is discussed.


M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 3

CHAPTER 2

LITERATURE SURVEY

2.1 Location Privacy in Pervasive Computing

Mix zones

Most theoretical models of anonymity and pseudonymity originate from the work

of David Chaum, whose pioneering contributions include two constructions for anonymous

communications (the mix network6 and the dining cryptographers algorithm) and the

notion of anonymity sets. Introduction to a new entity for location systems, the mix zone,

which is analogous to a mix node in communication systems is discussed. Using the mix

zone to model the spatiotemporal problem, useful techniques from the anonymous

communications field are adopted.

Mix networks and mix nodes

A mix network is a store-and-forward network that offers anonymous

communication facilities. The network contains normal message-routing nodes alongside

special mix nodes. Even hostile observers who can monitor all the links in the network

cannot trace a message from its source to its destination without the collusion of the mix

nodes. In its simplest form, a mix node collects n equal-length packets as input and reorders

them by some metric (for example, lexicographically or randomly) before forwarding them,

thus providing unlinkability between incoming and outgoing messages.

For briefness, some essential details about layered encryption of packets are

omitted; interested readers should consult Chaum’s original work.

The number of distinct senders in the batch provides a measure of the unlinkability

between the messages coming in and going out of the mix. Chaum later abstracted this last

observation into the concept of an anonymity set. As paraphrased by Andreas Pfitzmann

and Marit Köhntopp, whose recent survey generalizes and formalizes the terminology on

anonymity and related topics, the anonymity set is “the set of all possible subjects who

might cause an action.”



In anonymous communications, the action is usually that of sending or receiving a

message. The larger the anonymity set’s size, the greater the anonymity offered.

Conversely, when the anonymity set reduces to a singleton—the anonymity set cannot

become empty for an action that was actually performed—the subject is completely

exposed and loses all anonymity.

Mix zones

A mix zone for a group of users as a connected spatial region of maximum size in

which none of these users have registered any application callback has been discussed; for

a given group of users there might be several distinct mix zones. In contrast, an application

zone as an area where a user has registered for a callback. The middleware system can

define the mix zones and calculate them separately for each group of users as the spatial

areas currently not in any application zone. Because applications do not receive any

location information when users are in a mix zone, the identities are “mixed.”

Assuming users change to a new, unused pseudonym whenever they enter a mix

zone, applications that see a user emerging from the mix zone cannot distinguish that user

from any other who was in the mix zone at the same time and cannot link people going into

the mix zone with those coming out of it. If a mix zone has a diameter much larger than the

distance the user can cover during one location update period, it might not mix users

adequately.



Figure 1. A sample mix zone arrangement with three application zones. The airline agency

(A) is much closer to the bank (B) than the coffee shop (C). Users leaving A and C at the

same time might be distinguishable on arrival at B.

Figure 1 provides a plan view of a single mix zone with three application zones around the

edge: an airline agency (A), a bank (B), and a coffee shop (C). Zone A is much closer to B

than C, so if two users leave A and C at the same time and a user reaches B at the next

update period, an observer will know the user emerging from the mix zone at B is not the

one who entered the mix zone at C. Furthermore, if nobody else was in the mix zone at the

time, the user can only be the one from A. If the maximum size of the mix zone exceeds

the distance a user covers in one period, mixing will be incomplete. The amount to which

the mix zone anonymizes users is therefore smaller than one might believe by looking at

the anonymity set size.



2.2 Facebook’s Privacy Train-wreck. Exposure, Invasion, and Social Convergence

The tech world has a tendency to view the concept of ‘private’ as a single bit that is

either 0 or 1. Data are either exposed or not. When companies make a decision to make

data visible in a more ‘efficient’ manner, it is often startling, prompting users to speak of a

disruption of ‘privacy’.

Facebook, did not make anything public that was not already public, but search

disrupted the social dynamics. The reason for this is that privacy is not simply about the

state of an inanimate object or set of bytes; it is about the sense of vulnerability that an

individual experiences when negotiating data. On Facebook, people were wandering

around accepting others as Friends, commenting on others’ pages, checking out what others

posted, and otherwise participating in the networked environment. Now, imagine that

everyone involved notices all this because it is displayed when they login.

By aggregating all this information and projecting it to everyone’s News Feed,

Facebook made what was previously obscure difficult to miss (and even harder to forget).

Those data were all there before but were not efficiently accessible; they were not

aggregated. The acoustics changed and the social faux pas was suddenly very visible.

Without having privacy features, participants had to reconsider each change that they made

because they knew it would be broadcast to all their Friends.

With Facebook, participants have to consider how others might interpret their

actions, knowing that any action will be broadcast to everyone with whom they consented

to digital Friendship. For many, it is hard to even remember whom they listed as Friends,

let alone assess the different ways in which they may interpret information. Without being

able to see their Friends’ reactions, they are not even aware of when their posts have been

misinterpreted.

Facebook imply that people can maintain infinite numbers of friends provided they

have digital management tools. While having hundreds of Friends on social network sites

is not uncommon, users are not actually keeping up with the lives of all of those people.

These digital Friends are not necessarily close friends and friendship management tools are

not good enough for building and maintaining close ties. While Facebook assumes that all



Friends are friends, participants have varied reasons for maintaining Friendship ties on the

site that have nothing to do with daily upkeep.

News Feeds does not distinguish between these – all Friends are treated equally and

updates come from all Friends, not just those that an individual deems to be close friends.

To complicate matters, when data are there, people want to pay attention, even if it doesn’t

help them. People relish personal information because it is the currency of social hierarchy

and connectivity. Cognitive addiction to social information is great for Facebook because

News Feeds makes Facebook sticky.

Privacy is not simply about zero’s and ones, it is about how people experience their

relationship with others and with information. Privacy is a sense of control over

information, the context where sharing takes place, and the audience who can gain access.

Information is not private because no one knows it; it is private because the knowing is

limited and controlled. In most scenarios, the limitations are often more social than

structural. Secrets are considered the most private of social information because keeping

knowledge private is far more difficult than spreading it. There is an immense gray are

between secrets and information intended to be broadcast as publicly as possible. By and

large, people treated Facebook as being in that gray zone. Participants were not likely to

post secrets, but they often posted information that was only relevant in certain contexts.

The assumption was that if you were visiting someone’s page, you could access information

in context. When snippets and actions were broadcast to the News Feed, they were taken

out of context and made far more visible than seemed reasonable. In other words, with

News Feeds, Facebook obliterated the gray zone. Social convergence occurs when

disparate social contexts are collapsed into one. Even in public settings, people are

accustomed to maintaining discrete social contexts separated by space. How one behaves

is typically dependent on the norms in a given social context.

How one behaves in a pub differs from how one behaves in a family park, even

though both are ostensibly public. Social convergence requires people to handle disparate

audiences simultaneously without a social script. While social convergence allows

information to be spread more efficiently, this is not always what people desire. As with

other forms of convergence, control is lost with social convergence. Can we celebrate



people’s inability to control private information in the same breath as we celebrate

mainstream media’s inability to control what information is broadcast to us?

Privacy is not an inalienable right – it is a privilege that must be protected social

and structurally in order to exist. The question remains as to whether or not privacy is

something that society wishes to support. When Facebook launched News Feeds, they

altered the architecture of information flow. This was their goal and with a few tweets, they

were able to convince their users that the advantages of News Feeds outweighed security

through obscurity. Users quickly adjusted to the new architecture; they began taking actions

solely so that they could be broadcast across to Friends’ News Feeds.

2.3 Facebook privacy settings: Who cares?

Facebook’s approach to privacy was initially network–centric. By default, students’

content was visible to all other students on the same campus, but no one else. Through a

series of redesigns, Facebook provided users with controls for determining what could be

shared with whom, initially allowing them to share with “No One”, “Friends”, “Friends–

of–Friends”, or a specific “Network”. When Facebook became a platform upon which other

companies could create applications, users’ content could then be shared with third–party

developers who used Facebook data as part of the “Apps” that they provided. The company

introduced privacy settings to allow users to determine which third parties could access

what content when users encountered a message whenever they chose to add an application.

Over time, Facebook introduced the ability to share content with “Everyone” (inside

Facebook or not). Increasingly, the controls got more complex and media reports suggest

that users found themselves uncertain about what they meant (Bilton, 2010). Recognizing

the validity of this point, Facebook eventually simplified its privacy settings page

(Zuckerberg, 2010).

At each point when Facebook introduced new options for sharing content, the

default was to share broadly. Following a series of redesigns in December 2009, Facebook

added a prompt when users logged on asking them to reconsider their privacy settings for

various types of content on the site, including “Posts I Create: Status Updates, Likes,

Photos, Videos, and Notes“ (see Figure 2). For each item, users were given two options:

“Everyone” or “Old Settings.” The toggle buttons defaulted to “Everyone.”



This message appeared when users logged into their account and it was impossible to go to

the rest of the site without addressing the prompt. Faced with this obligatory prompt, many

users may well have just clicked through, accepting the defaults that Facebook had chosen.

In doing so, these users made much of their content more accessible than was previously

the case. As part of these changes, everyone’s basic profile and Friends list became

available to the public, regardless of whether or not they had previously chosen to restrict

access; this was later redacted.

When Facebook was challenged by the Federal Trade Commission to defend its

decision about this approach, the company representative noted that 35 percent of users

who had never before edited their settings did so when prompted. Facebook used these data

to highlight that more people engaged with Facebook privacy settings than the industry

average of 5–10 percent. While 35 percent may be significantly more than the industry

average, and Facebook did not specify what percentage of users had never adjusted their

settings, there is still likely a sizeable majority that accepted the site’s defaults each time

changes had been implemented.



Figure 2: The message Facebook users saw in December 2009.

As privacy advocates and regulators investigated what these changes meant,

Facebook moved toward another series of modifications that depended on users’

willingness to share information.

In April 2010, at f8 conference, Facebook announced Instant Personalizer and

Social Plugins, two services that allowed partners to leverage the social graph — the

information about one’s relationships on the site that the user makes available to the system

— and provide a channel for sharing information between Facebook and third parties. For

example, Web sites could implement a Like button on their own pages that enables users

to share content from that site with the user’s connections on Facebook. Sites could also

implement an Activity Feed that publicizes what a person’s Friends do on that site. These

tools were built for developers and likely few users, journalists, or regulators had any sense

of what data from their accounts and actions on Facebook were being shared with whom



under what circumstances. As the discussions became more voracious, Facebook was

increasingly pressured to respond.

On 26 May 2010, Zuckerberg announced that Facebook heard the concerns and

believed that the major issue on the table was Facebook’s confusing privacy settings page.

Facebook unveiled a new privacy settings page that, while simpler, also removed many of

the controls that allowed users to limit what content could be restricted (see Figure 3). This

quelled much of the news coverage, but it is unlikely that it would lead to the end of

controversy as in July 2010, a Canadian law firm filed a class action suit against Facebook

over privacy issues.

Figure 3: Facebook’s “simplified” privacy settings, July 2010.



2.4 Protecting Location Privacy with Personalized k-Anonymity: Architecture and

Algorithms

There are two popular approaches to protect location privacy in the context of

Location Based Services (LBS) usage: policy-based and anonymity-based approaches.

In policy-based approaches, mobile clients specify their location privacy

preferences as policies and completely trust that the third-party LBS providers adhere to

these policies.

In the anonymity-based approaches, the LBS providers are assumed to be semi

honest instead of completely trusted. A k-anonymity preserving management of location

information by developing efficient and scalable system-level facilities for protecting the

location privacy through ensuring location k-anonymity. Assume that anonymous location-

based applications do not require user identities for providing service.

The concept of k-anonymity was originally introduced in the context of relational

data privacy. It addresses the question of “how a data holder can release its private data

with guarantees that the individual subjects of the data cannot be identified whereas the

data remain practically useful”. For instance, a medical institution may want to release a

table of medical records with the names of the individuals replaced with dummy identifiers.

However, some set of attributes can still lead to identity breaches. These attributes are

referred to as the quasi-identifier. For instance, the combination of birth date, zip code, and

gender attributes in the disclosed table can uniquely determine an individual. By joining

such a medical record table with some publicly available information source like a voters

list table, the medical information can be easily linked to individuals. K-anonymity

prevents such a privacy breach by ensuring that each individual record can only be released

if there are at least k - 1 distinct individuals whose associated records are indistinguishable

from the former in terms of their quasi-identifier values.

In the context of LBSs and mobile clients, location k-anonymity refers to the k-

anonymity usage of location information. A subject is considered location k-anonymous if

and only if the location information sent from a mobile client to an LBS is indistinguishable

from the location information of at least k - 1 other mobile clients. Paper proposses that

location perturbation is an effective technique for supporting location k-anonymity and



dealing with location privacy breaches exemplified by the location inference attack

scenarios. If the location information sent by each mobile client is perturbed by replacing

the position of the mobile client with a coarser grained spatial range such that there are k-

1 other mobile clients within that range (k > 1), then the adversary will have uncertainty in

matching the mobile client to a known location-identity association or an external

observation of the location-identity binding. This uncertainty increases with the increasing

value of k, providing a higher degree of privacy for mobile clients, even though the LBS

provider knows that person A was reported to visit location L, it cannot match a message

to this user with certainty. This is because there are at least k different mobile clients within

the same location L. As a result, it cannot perform further tracking without ambiguity.



CHAPTER 3

PROBLEM STATEMENT

The amount of social media being uploaded into the web is growing rapidly and

there is still no end to this trend in sight. The ease-of-use of modern smartphones and the

proliferation of high-speed mobile networks is facilitating a culture of spontaneous and

carefree uploading of user-generated content. To give an idea of the scale of this

phenomenon: Just in the last two years the number of photos uploaded to Facebook per

month has risen from 2 billion to over 6 billion. From a personal perspective an

overwhelming majority of these photos have no privacy relevance for oneself. Finding the

few that are relevant is a daunting task.

While one’s own media is uploaded consciously, the flood of media uploaded by

others is so huge that it is almost impossible to stay aware of all media in which one might

be depicted. This can be classified as a Big Data problem on the user’s side, however not

on the provider side. Current social networks and photo-sharing sites mainly focus on the

privacy of users’ own media in terms of access control, but do little to deal with the privacy

implications created by other users’ media. There are ever more complex settings allowing

a user to decide who is allowed to see what content but only content owned by the user.

However, the issue of staying on top of what others are uploading (mostly in good faith),

that might also be relevant to the user, is still very much outside the control of that user.

Social networks which allow tagging of users usually inform affected users when they are

tagged. However, if no such tagging is done by the uploader or a third party, there are

currently no mechanisms to inform users of relevant media.

A second important emerging trend is the capability of many modern devices to

embed geo-data and other metadata into the created content. While the privacy issues of

location-based services such as Foursquare or Qype have been discussed at great length,

the privacy issues of location information embedded into uploaded media have not yet

received much attention. There is one very significant difference between these two

categories. In the first, users reveal their current location to access online services, such as

Google Maps, Yelp or Qype or the user actually publishes his location on a social network



site like Foursquare, Google Latitude or Facebook Places. In this category, the user mainly

affects his own privacy. There is a large body of work examining privacy preserving

techniques to protect a user’s own privacy, ranging from solutions which are installed

locally on the user’s mobile device, to solutions which use online services relying on group-

based anonymisation algorithms, as for instance mix zones or k-anonymity.

The second category is created by media which contain location information. This

can have all the same privacy implications for the creator of the media, however, a critical

and hitherto often overlooked issue is the fact, that the location and other metadata

contained in pictures and videos can also affect other people than the uploader himself.

This is a critical oversight and an issue which will gain importance as the mobile smart

device boom continues.



CHAPTER 4

SYSTEM ANALYSIS

4.1 EXISTING SYSTEM

Privacy issues are categorized into two classes. Firstly, homegrown problems:

Someone uploads a piece of compromising media of himself with insufficient protection or

forethought which causes damage to his own privacy. A prime example of this category is

someone uploading compromising pictures of himself into a public album instead of a

private one or onto his Timeline instead of a message. The damage done in these cases is

very obvious since the link between the content and the user is direct and the audience

(often the peer circle) has direct interest in the content. One special facet of this problem is

that what is considered damaging content by the user can and often does change over time.

While this is a serious problem, especially amongst the Facebook generation, this issue is

a small data problem and thus is not the focus of this work.

Secondly we have the Big Data problems created by others: An emerging threat to

users’ online privacy comes from other users’ media. What makes this threat particularly

bad is the fact that the person harmed is not involved in the uploading process and thus

cannot take any pre-emptive precautions and the amount of data being uploaded is so vast

it cannot be manually sighted. Also there are currently no countermeasures, except post-

priory legal ones, to prevent others from uploading potentially damaging content about

someone.

There are two requirements for this form of privacy threat to have an effect:

Firstly, to cause harm to a person a piece of media needs to be able to be

associated/linked to the person in some way. This link can either be non-technical, such as

being recognizable in a photo, or technical such as a profile being (hyper-) linked to a photo.

There is also the grey area of textual references to a person near to the photo or embedded

in the metadata of the photo. This metadata does not directly create a technical link to a

profile, but it opens the possibility for search engines to index the information and make it

searchable, thus creating a technical link.



Secondly, a piece of media in question must contain harmful content for the person

linked to it. This can again be non-technical such as being depicted in a compromising way.

However, more interestingly it can also be technical. In these cases metadata or associated

data causes harm. For instance time and location data can indicate that a person has been

at an embarrassing location, took part in a political event, or was not where he said he was.

Since the uploading of this type of damaging media cannot be effectively prevented,

awareness is the key issue in combating this emerging privacy problem.

4.2 PROPOSED SYSTEM

The privacy awareness concept consists of a watchdog client a server side watchdog

service. Using a GPS-enabled mobile device a user can activate the watchdog client to

locally track his position during times he considers relevant to his privacy. Then his device

can request the privacy watchdog to show him media that could potentially affect him

whenever the user is interested in the state of his online privacy. For this the watchdog

client sends the location traces to the watchdog server or servers which then respond with

a list of media which was taken in the vicinity (both spatially and temporally) of the user.

The watchdog service can be operated in three different ways. The first two would

be value-added services which can be offered by the media sharing sites or social networks

(SN) themselves. In both these cases the existing services would need to be extended by an

API to allow users to search for media by time and location. The first type of service would

do this via the regular user account. Thus, it would able to see all public pictures, but also

all pictures restricted in scope but visible to the user.

The benefit of this service type is that through the integration with the account of

the user pictures which aren’t publicly available can be searched. These private pictures are

typically from social network friends and thus the likelihood of pictures involving the user

is higher and the scope of people able to view the pictures more relevant. This type of

service also has the benefit-of-sorts that the location information is valuable to the SN, so

it has an incentive to offer this kind of value-added service. The privacy downside of

searching with the user’s account is that the SN receives the information of when and where

a user was. While there are certainly users who would not mind this information being sent

to their SN if it means they get to see and remove embarrassing photos in a timely manner,



there are also certainly users who do not wish their SN to know when and where they were,

particularly amongst the clientele who wish to protect their online privacy.

The second type of watchdog service would also be operated by the SN. However,

it does not require a user account to do the search and can be queried anonymously. This

type of service would thus have a smaller scope, since it can only access publicly available

media. A further drawback of this type of service is that there is less of an incentive for the

SN provider to implement such a service. While there are sites such as Locr that allow such

queries, most SN sites do not. Without outside pressure there is less intrinsic value for them

to include such a service compared to the first type.

The third type of service would be a stand-alone servi which can be operated by a

third party. The stand-alone service operates like an indexing search machine, which crawls

publicly available media and its metadata and allows this database to be queried. Possible

incentive models for this approach include pay-per-use, subscription, ad-based or

community services. The visibility scope would be the same as for the second type of

service.

All three types of service are mainly focused on detecting relevant media events

and breaking down the Big Data problem to humanly manageable sizes. The concept is

mainly focused on bringing possibly relevant media to the attention of the user without

overburdening him. The system does not explicitly protect from malicious uploads with

which the uploader is intentionally trying to harm another while attempting the activity at

the same time. Even though the watchdog service proposed here could make the subterfuge

harder for the malicious uploader. But even without full protection from malicious activity

we believe that such a watchdog would improve the current state of the art by enabling

users to gain better awareness of the relevant part of the social media Big Dataset.



CHAPTER 5

THREAT ANALYSIS

5.1 Awareness of Damaging Media in Big Datasets

Most popular social networks and media sharing sites allow users to tag objects and

people in their uploaded media. Additionally, some services also extract embedded

metadata and use this information for indexing and linking. Media is annotated with names,

comments, or is directly linked to users’ profiles. In particular the direct linking of profiles

to pictures was initially met with an outcry of privacy concerns since it greatly facilitated

finding information about people. For this reason, social networks quickly introduced the

option to prevent people from linking them in media. However, there is also a positive side

to this form of direct linking since the linked person is usually made aware about the linked

media and can then take action to have unwanted content removed or restrict the visibility

of the link.

While the privacy mechanism of current services are still limited, hidden and often

confusing, once the link is made the affected people can take action. A more critical case

in view is the non-linked tagging of photos. In this case a free text tag contains identifying

information and/or malicious comments. However there is no automated mechanism to

inform a user that he was named in or near a piece of media. The named person might not

even be a member of the service where the media was uploaded. The threat of this kind of

linking is significantly different to the one depicted above. While the immediate damage

can be smaller since no automated notification is sent to friends of the user, the threat can

remain hidden far longer. The person can remain unaware of this media whereas others can

stumble upon it or be informed by peers.

The final case of damaging media does not contain any technical link. Without any

link to the person in question this kind of media can only cause harm to the person if

someone having some contact to that person stumbles across it and makes the connection.

While the likelihood of causing noticeable harm is smaller, it is still possible. The viral

spreading of media has caused serious embarrassment and harm in real world cases. The

critical issue here is that there is currently no way for a person to pro-actively search for

this kind of media in the Big Data deluge to mitigate this threat.



5.2 Analysis of Service Privacy

The different types of watchdog services discussed in the proposed system can help

to reduce the number of relevant pieces of media a user needs to keep an eye on if users

don’t want uncontrolled media of themselves to be online. However, the devil is in the

detail since this form of service can also have serious privacy implications itself if designed

in the wrong way. Care must be taken to facilitate the different usage and privacy

requirements of different users. The critical issue is the fact that to request the relevant

media a user must send location information to the watchdog service.

When using the first type of service, there is little which can be done to protect the

location privacy of the user since the correlation between the location query and the user

account is direct. One option to protect the privacy to a degree can be an obfuscation

approach. For every true query a number of fake queries could also be sent, making it less

easy (but far from impossible) for the SN provider to ascertain the true location. However,

this approach does not scale well for two reasons. Firstly, it creates a higher load on the

SN. However, it is more critical that the likelihood deducing the true location rises, if many

queries are sent unless great care is taken in creating the fake paths and masking the source

IP addresses.

Protecting the user’s privacy in the second case is simpler. Since the queries do not

require an account the only way a user could be tracked directly is his IP address. Using an

anonymizing service such as TOR sequential queries cannot be linked together and creating

a tracking profile becomes significantly harder. The anonymous trace data can of course

still be used by the SN, but the missing link to the user makes it less critical for the user

himself. The third type of service is probably the most interesting privacy wise, since the

economic model behind the service will significantly impact the privacy techniques which

can be applied to this model. Most payment models would require user credentials to log

in and thus would allow the watchdog service provider to track the user. In this case it

would have to be the reputation of the service provider which the user would have to trust,

similar to the case of commercial anonymizing proxies. In an ad-based approach no user-

credentials are needed, thus it would be possible to use the service anonymously via TOR.

If so, the watchdog client would need to be open source to ensure that no secret tracking

information is stored there. In community-based approaches a privacy model like that in



TOR can be used, to ensure none of the participating nodes gets enough information to

track a single user. Each of these proposed service types has different privacy benefits and

disadvantages.

The following section overviews our privacy analysis of different media hosting

sites. It includes media access control as well as metadata handling.

Flickr provides the most fine-grained privacy/access control settings of all analyzed

services. Privacy settings can be defined on metadata as well as the image itself. One

particularly interesting feature of Flickr is the geo-fence. The geo-fence feature enables

users to define privacy regions on a map by putting a pin on it and setting a radius. Access

to GPS data of the user’s photos inside these regions is only allowed for a restricted set of

users (friends, family, and contacts). Flickr allows its users to tag and add people to images.

If a user revokes a person tag of himself in an image, no one can add the person to that

image again.

Facebook extracts the title and description of an image from metadata during the

upload process of some clients. All photos are resized after upload and metadata is stripped

off. Facebook uses face recognition for friend tagging suggestions based on already tagged

friends. Access to images is restricted by Facebook’s ever changing, complex and

sometimes abstruse privacy settings.

Picasa Web & Google+ store the complete EXIF metadata of all images. It is

accessible by everyone who can access the image. The access to images is regulated on a

per-album base. It can be set to public, restricted to people who know the secret URL to

the album, or to the owner only. A noteworthy feature is that geo-location data can be

protected separately. Google+ and Picasa Web allow the tagging of people in images.

Locr is a geo-tagging focused photo-sharing site. As such, location information is

included in most images. By default all metadata is retained in all images. Access control

is set on a per image basis. Anybody who can see an image can also see the metadata. There

are also extensive location-based search options. Geo-data is extracted from uploaded files

or set by people on the Locr website. Locr uses reverse geocoding to add textual location

information to images in its database.



Instagram and PicPlz are services/mobile apps that allow posting images in a

Twitter like way. Resized images stripped of metadata but with optional location data are

stored by the services. Additionally they allow the posting of photos to different services

like Flickr, Facebook, Dropbox, Foursquare and more. Depending on the service used

metadata is stored or discarded. For instance, when uploading a photo to Flickr metadata is

stripped from the actual file, but title, description as well as geo location are extracted from

the image and can be set by the user. In contrast, the Hipstamatic mobile app reserves in-

file metadata when uploading images to Flickr.

The logos for different social media apps:





CHAPTER 6

SURVEY OF METADATA IN SOCIAL MEDIA

To underpin the growing prevalence of privacy-relevant metadata and location data

in particular and to judge potential dangers and benefits based on real-world data we

analyzed a set of 20,000 publicly available Flickr images and their metadata.

Flickr was chosen as the premiere photo-sharing website, because it can be legally

crawled, offers the full extent of privacy mechanisms and does not remove metadata in

general. Crawling one photo each from 20k random Flickr users. Of these, 68.8% were Pro

users where the original file could be accessed as well. For the others only the metadata

available via the Flickr API was accessed. This includes data automatically extracted from

EXIF data during upload and data manually added via website or semi-automatically set

by client applications. 23% of the 20k users denied access to their extracted EXIF data in

the Flickr database.

Also took a set of 3,000 images made with a camera phone from 3k random mobile

Flickr users. 46.8% of the mobile users were Pro users and only 2% denied access to EXIF

data in the Flickr database GPS location data was present in 19% of the 20k dataset and in

34% of the 3k mobile phone dataset. While Flickr hosts many semi-professional DSLR

photos, mobile phones are becoming the dominant photo generation tool with the iPhone 4

currently being the most common camera on Flickr. Textual location information like street

or city names are currently not used much on Flickr.

However, as reverse geocoding becomes more common in client applications this

will change (cf. Locr in Figure 4). To evaluate the potential privacy impact, we manual

checked which photos contained people and geo-reference, but no user profile tags – i.e.

images which could contain people who are unaware of the photo. In the set of 20k images

we found 16% and in in the set of 3k mobile photos we found 28% fulfilling this criteria



Fig. 4. Public privacy-related metadata in 5.7k random and 1k mobile user original Flickr

photos

Further the subset of images which were available from Pro users were analyzed,

since these can contain the unaltered metadata from the camera. From the 20k dataset, 5761

images contained in-file metadata. From the 3k dataset, 1050 images contained in-file

metadata.

Figure 5 shows the percentage values for the different types of metadata contained

in the files. For the rest of the images metadata was either manually removed by the

uploader or the image never has had any in the first place. Of the 20k dataset only 3% of

the in-file metadata contained GPS data compared to 32% from the mobile 3k dataset. This

shows a clear dominance of mobile devices when it comes to publishing GPS metadata.



This it is unsurprising since most compact and DSLR cameras currently do not have

GPS receivers and only few photographers add external GPS devices to these cameras –

but this will likely change with future cameras. However, combined with the fact that

mobile phones are becoming the dominant type of camera where it comes to the number of

published pictures, it is to be expected that the amount of GPS data available for scrutiny,

use and abuse will rise further.

Fig. 5. Public privacy-related metadata of 5k random photos from Locr service



A 5k dataset of random photos from Locr were collected, and analyzed the metadata

in the images plus the images’ HTML pages built from the Locr database. Figure 4 shows

the results from this dataset. Particularly interesting is the high rate of non-GPS based

location information.

This is a trend to watch since most location stripping mechanisms only strip GPS

information and leave other text-based tags intact. Furthermore, the amount of camera ID

meta-data is notable since these IDs can be used to link different pictures and infer meta-

data even if data has been stripped from some of the photos.

To summarize, one third of the pictures taken by dominant camera devices contains

GPS information. About one third of these images depict people on it. Thus, about 10% of

all the photos could harm other peoples’ privacy without them knowing about it.


M.Tech, 1st sem, Dept. Of CSE,VIT 2014-15 28

CHAPTER 7

CONCLUSION AND FUTURE WORK

An analysis of the threat to an individual’s privacy that is created by other

peoples’ social media is discussed. A brief overview of privacy capabilities of common

social media services regarding their capability of protecting users from other peoples’

activities were discussed.Based on the survey, analyzed the Big Data privacy implications

and potential of the emerging trend of geo-tagged social media. Then the three concepts

how the location information can actually help users to stay in control of the flood of

potentially harmful or interesting social media uploaded by others are presented.

Applications can be built or modified to use pseudonyms rather than true user

identities, and route toward greater location privacy. Because different applications can

collide and share information about user sightings, users should adopt different

pseudonyms for different applications.

Furthermore, to stress more on sophisticated attacks, users should change

pseudonyms frequently, even while being tracked. Drawing on the methods developed for

anonymous communication, conceptual tools of mix zones and anonymity sets to analyze

location privacy can be used.

Although anonymity sets provide a first quantitative measure of location privacy,

the measure is only an upper-bound estimate. A better, more accurate metric, developed

using an information-theoretic approach, uses entropy, taking into account the a priori

knowledge that an observer can derive from historical data.The entropy measurement

shows a more pessimistic picture—in which the user hasless privacy—because it models

a more powerful adversary who might use historical data to de-anonymize pseudonyms

more accurately.

Two services worth mentioning which collect the type ofinformation needed for a

privacy watchdog are Social Camera and Locaccino. Social Camera is a mobile app that

detects faces in the picture and tries to recognize them with the help of Facebook profile

pictures of persons that are in your friends list. Recognized people can be automatically

tagged and pictures instantly uploaded to Facebook. Locaccino is a Foursquare type

application which allows users to upload location-based information into Facebook.



These two appsshow the willingness of users to share this kind of informationin the social

web.

Ahern et al. analyses in their work privacy decisions of mobile users in the photo

sharing process. Ahern et al. identify relationships between the location of the photo

capture and the corresponding privacy settings. Ahern et al. recommend the use context

information to help users to set privacy preferences andto increase the users’ awareness of

information aggregation.

Work by Fang and LeFevre focuses on helping the user to find appropriate

privacy settings in social networks. Fang and LeFevrepresent a system where the user

initially only needs to setup a few rules. Through the use of active machine learning

algorithms the system helps the user to protect private information based on the

individual behavior and taste.

Mannan et al. address the problem, that private user data is not only shared within

social networks, but also through personal web pages. In their work they focus on a

privacy-enabled web content sharing and utilize existing instant messaging friendship

relations to create and enforce access policies.

The three works shown above all focus on protecting auser’s privacy based on

dangers created by the user himself while sharing media. It did not discuss how users can

be protected from other peoples’ media. It is prevalent for most of the research work done

in this area.

Besmer et al present work which allows users that are tagged in photos to send a

request to the owner to hide the linked photo from certain people. This approach also

follows the idea that forewarned is forearmed and that creating awareness of critical

content is the first step towards the solution of the problem. However the work relies on

direct technical tags and as such does not cover the same scope asthe privacy watchdog

suggested in the paper.

Work that also takes into account other users’ media ispresented by Squicciarini et

al. who postulate that mostof the shared data does not only belong to a single user.

Therefore Squicciarini et al. propose a system to share the ownership of media items and

by that strive to establish a collaborative privacy management for shared content.



Squicciarini et al. Prototype is implemented as a Facebook app and is based on game

theory, rewarding users that promote co-ownerships of media items. While the work does

take into account other users’ media, unlike the approach it does not cope with previously

unknown and unrelated users.



REFERENCES

[1] S. Ahern, D. Eckles, N. Good, S. King, M. Naaman, and R. Nair.

Overexposed?:privacy patterns and considerations in online and mobile photosharing. In

Proceedings of the SIGCHI conference on Human factorsin computing systems, pages

357–366, 2007.

[2] C. a. Ardagna, M. Cremonini, S. De Capitani di Vimercati, and P. Samarati. An

Obfuscation-Based Approach for Protecting Location Privacy.IEEE Transactions on

Dependable and Secure Computing, 8(1):13–27,June 2009.

[3] A. Beresford and F. Stajano. Location privacy in pervasive computing.IEEE Pervasive

Computing, 2(1):46–55, Jan. 2003.

[4] A. Besmer and H. Richter Lipford. Moving beyond untagging. In Proceedings of the

28th international conference on Human factors incomputing systems - CHI ’10, page

1563, Apr. 2010.

[5] D. Boyd. Facebook’s privacy trainwreck: Exposure, invasion, and social convergence.

Convergence: The International Journal of Research into New Media Technologies,

14(1):13–20, 2008.

[6] d. Boyd and K. Crawford. Six Provocations for Big Data. SSRNeLibrary, 2011.

[7] D. Boyd and E. Hargittai. Facebook privacy settings: Who cares? FirstMonday, 15(8),

2010.

[8] Carnegie Mellons Mobile Commerce Laboratory. Locaccino - a usercontrollable

location-sharing tool. http://locaccino.org/, 2011.

[9] E. Eldon. New Facebook Statistics Show Big Increase in Content Sharing, Local

Business Pages. http://goo.gl/ebGQH, February 2010.

[10] Facebook. Statistics. 2011. http://www.facebook.com/press/info.php? statistics.

[11] L. Fang and K. LeFevre. Privacy wizards for social networking sites.In Proceedings

of the 19th international conference on World wide web- WWW ’10, page 351. ACM

Press, Apr. 2010.

[12] Flickr. Camera Finder. http://www.flickr.com/cameras, October 2011.

http://www.facebook.com/press/info.php



[13] B. Gedik and L. Liu. Protecting Location Privacy with Personalized k-Anonymity:

Architecture and Algorithms. IEEE Transactions on MobileComputing, 7(1):1–18, Jan.

2008.

[14] M. Mannan and P. C. van Oorschot. Privacy-enhanced sharing ofpersonal content on

the web. In Proceeding of the 17th international conference on World Wide Web - WWW

’08, page 487. ACM Press,April 2008.

[15] Metadata Working Group. Gartner special report –patternbasedstrategy: Getting

value from big data. www.gartner.com/patternbasedstrategy, 2012.

[16] A. C. Squicciarini, M. Shehab, and F. Paci. Collective privacy managementin social

networks. In Proceedings of the 18th international conference on World wide webWWW

’09, page 521. ACM Press, Apr.2009.

[17] Viewdle. SocialCamera. http://www.viewdle.com/products/mobile/index.html.

http://www.gartner.com/

http://www.viewdle.com/products/mobile/