VISVESVARAYA TECHNOLOGICAL UNIVERSITY
Belagavi, Karnataka
A Seminar Report on
“BIG DATA PRIVACY ISSUES IN PUBLIC SOCIAL MEDIA”
Submitted in partial fulfillment of the requirements for the award of the Degree of
Master of Technology
In
Computer Science and Engineering
Submitted
by
SUPRIYA R
1VI14SCS11
Guided By
Mrs. MARY VIDYA JOHN
Assistant Professor, Dept. of CSE
Vemana IT, Bengaluru – 34.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
VEMANA INSTITUTE OF TECHNOLOGY
Koramangala, Bengaluru – 560 034.
2014-2015
Karnataka Reddyjana Sangha®
VEMANA INSTITUTE OF TECHNOLOGY Koramangala, Bengaluru – 34
(Affiliated to Visvesvaraya Technological University, Belagavi)
Department of Computer Science & Engineering
Certificate
This is to certify that SUPRIYA R, bearing U.S.N 1VI14SCS11, a student of I semester,
Computer Science & Engineering has completed the seminar entitled “BIG DATA
PRIVACY ISSUES IN PUBLIC SOCIAL MEDIA” in partial fulfillment for the
requirement of the award of degree of Master of Technology in Computer Science and Engineering
of the Visvesvaraya Technological University, Belagavi during the academic year 2014-2015. The
seminar report has been approved as it satisfies the academic requirements in respect of seminar
work prescribed for the said degree.
________________ ______________
Signature of the Guide Signature of HOD
MRS. MARY VIDYA JOHN DR.NARASIMHA MURTHY K.N
Name of the Examiner Signature with date
1.____________________ ________________
2.____________________ ________________
ii
ACKNOWLEDGEMENT
I am indeed very happy to greatly acknowledge the numerous personalities involved in
lending their help to make my seminar “Big Data Privacy Issues in Public Social Media” a
successful one.
Salutations to my beloved and esteemed institute “Vemana institute of technology” for
having well qualified staff and labs furnished with necessary equipments.
I express my gratitude to DR. P. GIRIDHARA REDDY, Principal, Vemana IT for
providing the best facilities and without his constructive criticisms and suggestions I wouldn’t
have been able to do this curriculum.
I express my sincere gratitude to DR.NARASIMHA MURTHY K.N, Professor and
H.O.D (PG), Dept of Computer Science and Engineering, a great motivator and think-tank. I am
very much grateful to him for providing an opportunity to present my seminar.
I also express my gratitude to internal guide Mrs. MARY VIDYA JOHN, Assistant
Professor for giving the sort of encouragement in preparing the report, presenting the paper, spirit
and useful guidance.
I also thank my parents and friends for their moral support and constant guidance made
my efforts fruitful.
Date: SUPRIYA R
Place: Bangalore 1VI14SCS11
ABSTRACT
Big Data is a new label given to a diverse field of data intensive information in
which the datasets are so large that they are difficult to work with effectively. The term has
been mainly used in two contexts, firstly as a technological challenge when dealing with
data-intensive domains such as high energy physics, astronomy or internet search, and
secondly as a sociological problem when data about us is collected and mined by companies
such as Facebook, Google, Mobile phone companies and Governments.
The concentration is on the second issue from a new perspective, namely how the
user can gain awareness of the personally relevant part Big Data that is publicly available
in the social web. The amount of user-generated media uploaded to the web is expanding
rapidly and it is beyond the capabilities of any human to sift through it all to see which
media impacts our privacy. Based on an analysis of social media in Flickr, Locr, Facebook
and Google+, privacy implications and potential of the emerging trend of geo-tagged social
media are discussed. At last, a concept with which users can stay informed about which
parts of the social Big Data is relevant to them.
iii
iv
CONTENTS
CHAPTER NO. TITLE PAGE NO
ABSTRACT iii
CONTENTS iv
LIST OF FIGURES v
1
1 INTRODUCTION
1.1 Overview 1
2 LITERARURE SURVEY 3
2.1 Location privacy in pervasive computing 3
2.2 Facebook’s trainwreck: exposure and invasion 6
2.3 Facebook’s privacy settings: Who cares? 8
2.4 Protecting location privacy with K-Anonymity 12
3
PROBLEM STATEMENT 14
4 SYSTEM ANALYSIS 16
4.1 Existing system 16
4.2 Proposed system 17
5 THREAT ANALYSIS 19
5.1 Awareness of damaging media in big datasets 19
5.2 Analysis of Service Privacy 20
6 SURVEY OF METADATA IN SOCIAL MEDIA 24
7 CONCLUSION AND FUTURE WORK 27
REFERENCES 30
v
LIST OF FIGURES
FIGURE NO.
TITLE
PAGE. NO
Figure 1 Mixed zone arrangement 05
Figure 2 Facebook message in Dec 2009 10
Figure 3 Facebook’s simplified setting 11
Figure 4 Public Privacy metadata from Flickr 22
Figure 5 Public Privacy metadata from Locr 23
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept.of CSE, VIT 2014-15 1
CHAPTER 1
INTRODUCTION
In this electronic age, increasing number of organizations are facing the problem of
explosion of data and the size of the databases used in today’s enterprises has been growing
at exponential rates. Data is generated through many sources like business processes,
transactions, social networking sites, web servers, etc. and remains in structured as well as
unstructured form . Today's business applications are having enterprise features like large
scale, data-intensive, web-oriented and accessed from diverse devices including mobile
devices. Processing or analyzing the huge amount of data or extracting meaningful
information is a challenging task.
1.1 Overview
Big data is an all-encompassing term for any collection of data sets so large and
complex that it becomes difficult to process them using traditional data processing
applications. It is becoming a hot topic in many areas where datasets are so large that they
can no longer be handled effectively or even completely. Or put differently, any task that
is comparatively easy to execute when operating on a small set of data, but becomes
unmanageable when dealing with the same problem with a large dataset can be classified
as a Big Data problem. Data sets grow in size in part because they are increasingly being
gathered by ubiquitous information-sensing mobile devices, remote sensing, software logs,
cameras, microphones and wireless sensor networks.
Typical problems that are encountered when dealing with Big Data includes capture,
storage, dissemination, search, analytics and visualization. The traditional data-intensive
sciences such as astronomy, high energy physics, meteorology, and genomics, biological
and environmental research in which peta- and exabytes of data are generated are common
domain examples. Here even the capture and storage of the data is a challenge.
But there are also new domains emerging on the Big Data paradigm such as data
warehousing, Internet and social web search, finance and business informatics. Here
datasets can be small compared to the previous domains, however the complexity of the
data can still lead to the classification as a Big Data problem.
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept.of CSE, VIT 2014-15 2
The traditional Big Data applications such as astronomy and other e-sciences usually
operate on non-personal information and as such usually do not have significant privacy
issues. The privacy critical Big Data applications lie in the new domains of the social web,
consumer and business analytics and governmental surveillance. In these domains Big Data
research is being used to create and analyze profiles of users, for example for market
research, targeted advertisement, workflow improvement or national security.
These are very contentious issues since it is entirely up to the controller of the Big Data
sets if the information obtained from various sources is used for criminal purposes or not.
In particular in the context of the social web there is an increasing awareness of the value,
potential and risk of the personal data which users voluntarily upload to the web.
However, the Big Data issue in this area has focused almost entirely on what the
controlling companies do with this information. These concerns are being addressed by
calls for regulatory intervention, i.e. regulating what companies are allowed to do with the
data users give them or what data they are allowed to gather about users.
Social media Big Data issue is examined. The growing proliferation and capabilities of
mobile devices which is creating a deluge of social media which effects our privacy is
discussed. Due to the vast amounts of data being uploaded every day it is next to impossible
to be aware of everything which affects us. Then a concept which can be used to regain
control of some of the Big Data deluge created by other social web users is discussed.
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 3
CHAPTER 2
LITERATURE SURVEY
2.1 Location Privacy in Pervasive Computing
Mix zones
Most theoretical models of anonymity and pseudonymity originate from the work
of David Chaum, whose pioneering contributions include two constructions for anonymous
communications (the mix network6 and the dining cryptographers algorithm) and the
notion of anonymity sets. Introduction to a new entity for location systems, the mix zone,
which is analogous to a mix node in communication systems is discussed. Using the mix
zone to model the spatiotemporal problem, useful techniques from the anonymous
communications field are adopted.
Mix networks and mix nodes
A mix network is a store-and-forward network that offers anonymous
communication facilities. The network contains normal message-routing nodes alongside
special mix nodes. Even hostile observers who can monitor all the links in the network
cannot trace a message from its source to its destination without the collusion of the mix
nodes. In its simplest form, a mix node collects n equal-length packets as input and reorders
them by some metric (for example, lexicographically or randomly) before forwarding them,
thus providing unlinkability between incoming and outgoing messages.
For briefness, some essential details about layered encryption of packets are
omitted; interested readers should consult Chaum’s original work.
The number of distinct senders in the batch provides a measure of the unlinkability
between the messages coming in and going out of the mix. Chaum later abstracted this last
observation into the concept of an anonymity set. As paraphrased by Andreas Pfitzmann
and Marit Köhntopp, whose recent survey generalizes and formalizes the terminology on
anonymity and related topics, the anonymity set is “the set of all possible subjects who
might cause an action.”
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 4
In anonymous communications, the action is usually that of sending or receiving a
message. The larger the anonymity set’s size, the greater the anonymity offered.
Conversely, when the anonymity set reduces to a singleton—the anonymity set cannot
become empty for an action that was actually performed—the subject is completely
exposed and loses all anonymity.
Mix zones
A mix zone for a group of users as a connected spatial region of maximum size in
which none of these users have registered any application callback has been discussed; for
a given group of users there might be several distinct mix zones. In contrast, an application
zone as an area where a user has registered for a callback. The middleware system can
define the mix zones and calculate them separately for each group of users as the spatial
areas currently not in any application zone. Because applications do not receive any
location information when users are in a mix zone, the identities are “mixed.”
Assuming users change to a new, unused pseudonym whenever they enter a mix
zone, applications that see a user emerging from the mix zone cannot distinguish that user
from any other who was in the mix zone at the same time and cannot link people going into
the mix zone with those coming out of it. If a mix zone has a diameter much larger than the
distance the user can cover during one location update period, it might not mix users
adequately.
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 5
Figure 1. A sample mix zone arrangement with three application zones. The airline agency
(A) is much closer to the bank (B) than the coffee shop (C). Users leaving A and C at the
same time might be distinguishable on arrival at B.
Figure 1 provides a plan view of a single mix zone with three application zones around the
edge: an airline agency (A), a bank (B), and a coffee shop (C). Zone A is much closer to B
than C, so if two users leave A and C at the same time and a user reaches B at the next
update period, an observer will know the user emerging from the mix zone at B is not the
one who entered the mix zone at C. Furthermore, if nobody else was in the mix zone at the
time, the user can only be the one from A. If the maximum size of the mix zone exceeds
the distance a user covers in one period, mixing will be incomplete. The amount to which
the mix zone anonymizes users is therefore smaller than one might believe by looking at
the anonymity set size.
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 6
2.2 Facebook’s Privacy Train-wreck. Exposure, Invasion, and Social Convergence
The tech world has a tendency to view the concept of ‘private’ as a single bit that is
either 0 or 1. Data are either exposed or not. When companies make a decision to make
data visible in a more ‘efficient’ manner, it is often startling, prompting users to speak of a
disruption of ‘privacy’.
Facebook, did not make anything public that was not already public, but search
disrupted the social dynamics. The reason for this is that privacy is not simply about the
state of an inanimate object or set of bytes; it is about the sense of vulnerability that an
individual experiences when negotiating data. On Facebook, people were wandering
around accepting others as Friends, commenting on others’ pages, checking out what others
posted, and otherwise participating in the networked environment. Now, imagine that
everyone involved notices all this because it is displayed when they login.
By aggregating all this information and projecting it to everyone’s News Feed,
Facebook made what was previously obscure difficult to miss (and even harder to forget).
Those data were all there before but were not efficiently accessible; they were not
aggregated. The acoustics changed and the social faux pas was suddenly very visible.
Without having privacy features, participants had to reconsider each change that they made
because they knew it would be broadcast to all their Friends.
With Facebook, participants have to consider how others might interpret their
actions, knowing that any action will be broadcast to everyone with whom they consented
to digital Friendship. For many, it is hard to even remember whom they listed as Friends,
let alone assess the different ways in which they may interpret information. Without being
able to see their Friends’ reactions, they are not even aware of when their posts have been
misinterpreted.
Facebook imply that people can maintain infinite numbers of friends provided they
have digital management tools. While having hundreds of Friends on social network sites
is not uncommon, users are not actually keeping up with the lives of all of those people.
These digital Friends are not necessarily close friends and friendship management tools are
not good enough for building and maintaining close ties. While Facebook assumes that all
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 7
Friends are friends, participants have varied reasons for maintaining Friendship ties on the
site that have nothing to do with daily upkeep.
News Feeds does not distinguish between these – all Friends are treated equally and
updates come from all Friends, not just those that an individual deems to be close friends.
To complicate matters, when data are there, people want to pay attention, even if it doesn’t
help them. People relish personal information because it is the currency of social hierarchy
and connectivity. Cognitive addiction to social information is great for Facebook because
News Feeds makes Facebook sticky.
Privacy is not simply about zero’s and ones, it is about how people experience their
relationship with others and with information. Privacy is a sense of control over
information, the context where sharing takes place, and the audience who can gain access.
Information is not private because no one knows it; it is private because the knowing is
limited and controlled. In most scenarios, the limitations are often more social than
structural. Secrets are considered the most private of social information because keeping
knowledge private is far more difficult than spreading it. There is an immense gray are
between secrets and information intended to be broadcast as publicly as possible. By and
large, people treated Facebook as being in that gray zone. Participants were not likely to
post secrets, but they often posted information that was only relevant in certain contexts.
The assumption was that if you were visiting someone’s page, you could access information
in context. When snippets and actions were broadcast to the News Feed, they were taken
out of context and made far more visible than seemed reasonable. In other words, with
News Feeds, Facebook obliterated the gray zone. Social convergence occurs when
disparate social contexts are collapsed into one. Even in public settings, people are
accustomed to maintaining discrete social contexts separated by space. How one behaves
is typically dependent on the norms in a given social context.
How one behaves in a pub differs from how one behaves in a family park, even
though both are ostensibly public. Social convergence requires people to handle disparate
audiences simultaneously without a social script. While social convergence allows
information to be spread more efficiently, this is not always what people desire. As with
other forms of convergence, control is lost with social convergence. Can we celebrate
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 8
people’s inability to control private information in the same breath as we celebrate
mainstream media’s inability to control what information is broadcast to us?
Privacy is not an inalienable right – it is a privilege that must be protected social
and structurally in order to exist. The question remains as to whether or not privacy is
something that society wishes to support. When Facebook launched News Feeds, they
altered the architecture of information flow. This was their goal and with a few tweets, they
were able to convince their users that the advantages of News Feeds outweighed security
through obscurity. Users quickly adjusted to the new architecture; they began taking actions
solely so that they could be broadcast across to Friends’ News Feeds.
2.3 Facebook privacy settings: Who cares?
Facebook’s approach to privacy was initially network–centric. By default, students’
content was visible to all other students on the same campus, but no one else. Through a
series of redesigns, Facebook provided users with controls for determining what could be
shared with whom, initially allowing them to share with “No One”, “Friends”, “Friends–
of–Friends”, or a specific “Network”. When Facebook became a platform upon which other
companies could create applications, users’ content could then be shared with third–party
developers who used Facebook data as part of the “Apps” that they provided. The company
introduced privacy settings to allow users to determine which third parties could access
what content when users encountered a message whenever they chose to add an application.
Over time, Facebook introduced the ability to share content with “Everyone” (inside
Facebook or not). Increasingly, the controls got more complex and media reports suggest
that users found themselves uncertain about what they meant (Bilton, 2010). Recognizing
the validity of this point, Facebook eventually simplified its privacy settings page
(Zuckerberg, 2010).
At each point when Facebook introduced new options for sharing content, the
default was to share broadly. Following a series of redesigns in December 2009, Facebook
added a prompt when users logged on asking them to reconsider their privacy settings for
various types of content on the site, including “Posts I Create: Status Updates, Likes,
Photos, Videos, and Notes“ (see Figure 2). For each item, users were given two options:
“Everyone” or “Old Settings.” The toggle buttons defaulted to “Everyone.”
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 9
This message appeared when users logged into their account and it was impossible to go to
the rest of the site without addressing the prompt. Faced with this obligatory prompt, many
users may well have just clicked through, accepting the defaults that Facebook had chosen.
In doing so, these users made much of their content more accessible than was previously
the case. As part of these changes, everyone’s basic profile and Friends list became
available to the public, regardless of whether or not they had previously chosen to restrict
access; this was later redacted.
When Facebook was challenged by the Federal Trade Commission to defend its
decision about this approach, the company representative noted that 35 percent of users
who had never before edited their settings did so when prompted. Facebook used these data
to highlight that more people engaged with Facebook privacy settings than the industry
average of 5–10 percent. While 35 percent may be significantly more than the industry
average, and Facebook did not specify what percentage of users had never adjusted their
settings, there is still likely a sizeable majority that accepted the site’s defaults each time
changes had been implemented.
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 10
Figure 2: The message Facebook users saw in December 2009.
As privacy advocates and regulators investigated what these changes meant,
Facebook moved toward another series of modifications that depended on users’
willingness to share information.
In April 2010, at f8 conference, Facebook announced Instant Personalizer and
Social Plugins, two services that allowed partners to leverage the social graph — the
information about one’s relationships on the site that the user makes available to the system
— and provide a channel for sharing information between Facebook and third parties. For
example, Web sites could implement a Like button on their own pages that enables users
to share content from that site with the user’s connections on Facebook. Sites could also
implement an Activity Feed that publicizes what a person’s Friends do on that site. These
tools were built for developers and likely few users, journalists, or regulators had any sense
of what data from their accounts and actions on Facebook were being shared with whom
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 11
under what circumstances. As the discussions became more voracious, Facebook was
increasingly pressured to respond.
On 26 May 2010, Zuckerberg announced that Facebook heard the concerns and
believed that the major issue on the table was Facebook’s confusing privacy settings page.
Facebook unveiled a new privacy settings page that, while simpler, also removed many of
the controls that allowed users to limit what content could be restricted (see Figure 3). This
quelled much of the news coverage, but it is unlikely that it would lead to the end of
controversy as in July 2010, a Canadian law firm filed a class action suit against Facebook
over privacy issues.
Figure 3: Facebook’s “simplified” privacy settings, July 2010.
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 12
2.4 Protecting Location Privacy with Personalized k-Anonymity: Architecture and
Algorithms
There are two popular approaches to protect location privacy in the context of
Location Based Services (LBS) usage: policy-based and anonymity-based approaches.
In policy-based approaches, mobile clients specify their location privacy
preferences as policies and completely trust that the third-party LBS providers adhere to
these policies.
In the anonymity-based approaches, the LBS providers are assumed to be semi
honest instead of completely trusted. A k-anonymity preserving management of location
information by developing efficient and scalable system-level facilities for protecting the
location privacy through ensuring location k-anonymity. Assume that anonymous location-
based applications do not require user identities for providing service.
The concept of k-anonymity was originally introduced in the context of relational
data privacy. It addresses the question of “how a data holder can release its private data
with guarantees that the individual subjects of the data cannot be identified whereas the
data remain practically useful”. For instance, a medical institution may want to release a
table of medical records with the names of the individuals replaced with dummy identifiers.
However, some set of attributes can still lead to identity breaches. These attributes are
referred to as the quasi-identifier. For instance, the combination of birth date, zip code, and
gender attributes in the disclosed table can uniquely determine an individual. By joining
such a medical record table with some publicly available information source like a voters
list table, the medical information can be easily linked to individuals. K-anonymity
prevents such a privacy breach by ensuring that each individual record can only be released
if there are at least k - 1 distinct individuals whose associated records are indistinguishable
from the former in terms of their quasi-identifier values.
In the context of LBSs and mobile clients, location k-anonymity refers to the k-
anonymity usage of location information. A subject is considered location k-anonymous if
and only if the location information sent from a mobile client to an LBS is indistinguishable
from the location information of at least k - 1 other mobile clients. Paper proposses that
location perturbation is an effective technique for supporting location k-anonymity and
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 13
dealing with location privacy breaches exemplified by the location inference attack
scenarios. If the location information sent by each mobile client is perturbed by replacing
the position of the mobile client with a coarser grained spatial range such that there are k-
1 other mobile clients within that range (k > 1), then the adversary will have uncertainty in
matching the mobile client to a known location-identity association or an external
observation of the location-identity binding. This uncertainty increases with the increasing
value of k, providing a higher degree of privacy for mobile clients, even though the LBS
provider knows that person A was reported to visit location L, it cannot match a message
to this user with certainty. This is because there are at least k different mobile clients within
the same location L. As a result, it cannot perform further tracking without ambiguity.
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept.of CSE, VIT 2014-15 14
CHAPTER 3
PROBLEM STATEMENT
The amount of social media being uploaded into the web is growing rapidly and
there is still no end to this trend in sight. The ease-of-use of modern smartphones and the
proliferation of high-speed mobile networks is facilitating a culture of spontaneous and
carefree uploading of user-generated content. To give an idea of the scale of this
phenomenon: Just in the last two years the number of photos uploaded to Facebook per
month has risen from 2 billion to over 6 billion. From a personal perspective an
overwhelming majority of these photos have no privacy relevance for oneself. Finding the
few that are relevant is a daunting task.
While one’s own media is uploaded consciously, the flood of media uploaded by
others is so huge that it is almost impossible to stay aware of all media in which one might
be depicted. This can be classified as a Big Data problem on the user’s side, however not
on the provider side. Current social networks and photo-sharing sites mainly focus on the
privacy of users’ own media in terms of access control, but do little to deal with the privacy
implications created by other users’ media. There are ever more complex settings allowing
a user to decide who is allowed to see what content but only content owned by the user.
However, the issue of staying on top of what others are uploading (mostly in good faith),
that might also be relevant to the user, is still very much outside the control of that user.
Social networks which allow tagging of users usually inform affected users when they are
tagged. However, if no such tagging is done by the uploader or a third party, there are
currently no mechanisms to inform users of relevant media.
A second important emerging trend is the capability of many modern devices to
embed geo-data and other metadata into the created content. While the privacy issues of
location-based services such as Foursquare or Qype have been discussed at great length,
the privacy issues of location information embedded into uploaded media have not yet
received much attention. There is one very significant difference between these two
categories. In the first, users reveal their current location to access online services, such as
Google Maps, Yelp or Qype or the user actually publishes his location on a social network
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept.of CSE, VIT 2014-15 15
site like Foursquare, Google Latitude or Facebook Places. In this category, the user mainly
affects his own privacy. There is a large body of work examining privacy preserving
techniques to protect a user’s own privacy, ranging from solutions which are installed
locally on the user’s mobile device, to solutions which use online services relying on group-
based anonymisation algorithms, as for instance mix zones or k-anonymity.
The second category is created by media which contain location information. This
can have all the same privacy implications for the creator of the media, however, a critical
and hitherto often overlooked issue is the fact, that the location and other metadata
contained in pictures and videos can also affect other people than the uploader himself.
This is a critical oversight and an issue which will gain importance as the mobile smart
device boom continues.
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 16
CHAPTER 4
SYSTEM ANALYSIS
4.1 EXISTING SYSTEM
Privacy issues are categorized into two classes. Firstly, homegrown problems:
Someone uploads a piece of compromising media of himself with insufficient protection or
forethought which causes damage to his own privacy. A prime example of this category is
someone uploading compromising pictures of himself into a public album instead of a
private one or onto his Timeline instead of a message. The damage done in these cases is
very obvious since the link between the content and the user is direct and the audience
(often the peer circle) has direct interest in the content. One special facet of this problem is
that what is considered damaging content by the user can and often does change over time.
While this is a serious problem, especially amongst the Facebook generation, this issue is
a small data problem and thus is not the focus of this work.
Secondly we have the Big Data problems created by others: An emerging threat to
users’ online privacy comes from other users’ media. What makes this threat particularly
bad is the fact that the person harmed is not involved in the uploading process and thus
cannot take any pre-emptive precautions and the amount of data being uploaded is so vast
it cannot be manually sighted. Also there are currently no countermeasures, except post-
priory legal ones, to prevent others from uploading potentially damaging content about
someone.
There are two requirements for this form of privacy threat to have an effect:
Firstly, to cause harm to a person a piece of media needs to be able to be
associated/linked to the person in some way. This link can either be non-technical, such as
being recognizable in a photo, or technical such as a profile being (hyper-) linked to a photo.
There is also the grey area of textual references to a person near to the photo or embedded
in the metadata of the photo. This metadata does not directly create a technical link to a
profile, but it opens the possibility for search engines to index the information and make it
searchable, thus creating a technical link.
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 17
Secondly, a piece of media in question must contain harmful content for the person
linked to it. This can again be non-technical such as being depicted in a compromising way.
However, more interestingly it can also be technical. In these cases metadata or associated
data causes harm. For instance time and location data can indicate that a person has been
at an embarrassing location, took part in a political event, or was not where he said he was.
Since the uploading of this type of damaging media cannot be effectively prevented,
awareness is the key issue in combating this emerging privacy problem.
4.2 PROPOSED SYSTEM
The privacy awareness concept consists of a watchdog client a server side watchdog
service. Using a GPS-enabled mobile device a user can activate the watchdog client to
locally track his position during times he considers relevant to his privacy. Then his device
can request the privacy watchdog to show him media that could potentially affect him
whenever the user is interested in the state of his online privacy. For this the watchdog
client sends the location traces to the watchdog server or servers which then respond with
a list of media which was taken in the vicinity (both spatially and temporally) of the user.
The watchdog service can be operated in three different ways. The first two would
be value-added services which can be offered by the media sharing sites or social networks
(SN) themselves. In both these cases the existing services would need to be extended by an
API to allow users to search for media by time and location. The first type of service would
do this via the regular user account. Thus, it would able to see all public pictures, but also
all pictures restricted in scope but visible to the user.
The benefit of this service type is that through the integration with the account of
the user pictures which aren’t publicly available can be searched. These private pictures are
typically from social network friends and thus the likelihood of pictures involving the user
is higher and the scope of people able to view the pictures more relevant. This type of
service also has the benefit-of-sorts that the location information is valuable to the SN, so
it has an incentive to offer this kind of value-added service. The privacy downside of
searching with the user’s account is that the SN receives the information of when and where
a user was. While there are certainly users who would not mind this information being sent
to their SN if it means they get to see and remove embarrassing photos in a timely manner,
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 18
there are also certainly users who do not wish their SN to know when and where they were,
particularly amongst the clientele who wish to protect their online privacy.
The second type of watchdog service would also be operated by the SN. However,
it does not require a user account to do the search and can be queried anonymously. This
type of service would thus have a smaller scope, since it can only access publicly available
media. A further drawback of this type of service is that there is less of an incentive for the
SN provider to implement such a service. While there are sites such as Locr that allow such
queries, most SN sites do not. Without outside pressure there is less intrinsic value for them
to include such a service compared to the first type.
The third type of service would be a stand-alone servi which can be operated by a
third party. The stand-alone service operates like an indexing search machine, which crawls
publicly available media and its metadata and allows this database to be queried. Possible
incentive models for this approach include pay-per-use, subscription, ad-based or
community services. The visibility scope would be the same as for the second type of
service.
All three types of service are mainly focused on detecting relevant media events
and breaking down the Big Data problem to humanly manageable sizes. The concept is
mainly focused on bringing possibly relevant media to the attention of the user without
overburdening him. The system does not explicitly protect from malicious uploads with
which the uploader is intentionally trying to harm another while attempting the activity at
the same time. Even though the watchdog service proposed here could make the subterfuge
harder for the malicious uploader. But even without full protection from malicious activity
we believe that such a watchdog would improve the current state of the art by enabling
users to gain better awareness of the relevant part of the social media Big Dataset.
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 19
CHAPTER 5
THREAT ANALYSIS
5.1 Awareness of Damaging Media in Big Datasets
Most popular social networks and media sharing sites allow users to tag objects and
people in their uploaded media. Additionally, some services also extract embedded
metadata and use this information for indexing and linking. Media is annotated with names,
comments, or is directly linked to users’ profiles. In particular the direct linking of profiles
to pictures was initially met with an outcry of privacy concerns since it greatly facilitated
finding information about people. For this reason, social networks quickly introduced the
option to prevent people from linking them in media. However, there is also a positive side
to this form of direct linking since the linked person is usually made aware about the linked
media and can then take action to have unwanted content removed or restrict the visibility
of the link.
While the privacy mechanism of current services are still limited, hidden and often
confusing, once the link is made the affected people can take action. A more critical case
in view is the non-linked tagging of photos. In this case a free text tag contains identifying
information and/or malicious comments. However there is no automated mechanism to
inform a user that he was named in or near a piece of media. The named person might not
even be a member of the service where the media was uploaded. The threat of this kind of
linking is significantly different to the one depicted above. While the immediate damage
can be smaller since no automated notification is sent to friends of the user, the threat can
remain hidden far longer. The person can remain unaware of this media whereas others can
stumble upon it or be informed by peers.
The final case of damaging media does not contain any technical link. Without any
link to the person in question this kind of media can only cause harm to the person if
someone having some contact to that person stumbles across it and makes the connection.
While the likelihood of causing noticeable harm is smaller, it is still possible. The viral
spreading of media has caused serious embarrassment and harm in real world cases. The
critical issue here is that there is currently no way for a person to pro-actively search for
this kind of media in the Big Data deluge to mitigate this threat.
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 20
5.2 Analysis of Service Privacy
The different types of watchdog services discussed in the proposed system can help
to reduce the number of relevant pieces of media a user needs to keep an eye on if users
don’t want uncontrolled media of themselves to be online. However, the devil is in the
detail since this form of service can also have serious privacy implications itself if designed
in the wrong way. Care must be taken to facilitate the different usage and privacy
requirements of different users. The critical issue is the fact that to request the relevant
media a user must send location information to the watchdog service.
When using the first type of service, there is little which can be done to protect the
location privacy of the user since the correlation between the location query and the user
account is direct. One option to protect the privacy to a degree can be an obfuscation
approach. For every true query a number of fake queries could also be sent, making it less
easy (but far from impossible) for the SN provider to ascertain the true location. However,
this approach does not scale well for two reasons. Firstly, it creates a higher load on the
SN. However, it is more critical that the likelihood deducing the true location rises, if many
queries are sent unless great care is taken in creating the fake paths and masking the source
IP addresses.
Protecting the user’s privacy in the second case is simpler. Since the queries do not
require an account the only way a user could be tracked directly is his IP address. Using an
anonymizing service such as TOR sequential queries cannot be linked together and creating
a tracking profile becomes significantly harder. The anonymous trace data can of course
still be used by the SN, but the missing link to the user makes it less critical for the user
himself. The third type of service is probably the most interesting privacy wise, since the
economic model behind the service will significantly impact the privacy techniques which
can be applied to this model. Most payment models would require user credentials to log
in and thus would allow the watchdog service provider to track the user. In this case it
would have to be the reputation of the service provider which the user would have to trust,
similar to the case of commercial anonymizing proxies. In an ad-based approach no user-
credentials are needed, thus it would be possible to use the service anonymously via TOR.
If so, the watchdog client would need to be open source to ensure that no secret tracking
information is stored there. In community-based approaches a privacy model like that in
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 21
TOR can be used, to ensure none of the participating nodes gets enough information to
track a single user. Each of these proposed service types has different privacy benefits and
disadvantages.
The following section overviews our privacy analysis of different media hosting
sites. It includes media access control as well as metadata handling.
Flickr provides the most fine-grained privacy/access control settings of all analyzed
services. Privacy settings can be defined on metadata as well as the image itself. One
particularly interesting feature of Flickr is the geo-fence. The geo-fence feature enables
users to define privacy regions on a map by putting a pin on it and setting a radius. Access
to GPS data of the user’s photos inside these regions is only allowed for a restricted set of
users (friends, family, and contacts). Flickr allows its users to tag and add people to images.
If a user revokes a person tag of himself in an image, no one can add the person to that
image again.
Facebook extracts the title and description of an image from metadata during the
upload process of some clients. All photos are resized after upload and metadata is stripped
off. Facebook uses face recognition for friend tagging suggestions based on already tagged
friends. Access to images is restricted by Facebook’s ever changing, complex and
sometimes abstruse privacy settings.
Picasa Web & Google+ store the complete EXIF metadata of all images. It is
accessible by everyone who can access the image. The access to images is regulated on a
per-album base. It can be set to public, restricted to people who know the secret URL to
the album, or to the owner only. A noteworthy feature is that geo-location data can be
protected separately. Google+ and Picasa Web allow the tagging of people in images.
Locr is a geo-tagging focused photo-sharing site. As such, location information is
included in most images. By default all metadata is retained in all images. Access control
is set on a per image basis. Anybody who can see an image can also see the metadata. There
are also extensive location-based search options. Geo-data is extracted from uploaded files
or set by people on the Locr website. Locr uses reverse geocoding to add textual location
information to images in its database.
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 22
Instagram and PicPlz are services/mobile apps that allow posting images in a
Twitter like way. Resized images stripped of metadata but with optional location data are
stored by the services. Additionally they allow the posting of photos to different services
like Flickr, Facebook, Dropbox, Foursquare and more. Depending on the service used
metadata is stored or discarded. For instance, when uploading a photo to Flickr metadata is
stripped from the actual file, but title, description as well as geo location are extracted from
the image and can be set by the user. In contrast, the Hipstamatic mobile app reserves in-
file metadata when uploading images to Flickr.
The logos for different social media apps:
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 24
CHAPTER 6
SURVEY OF METADATA IN SOCIAL MEDIA
To underpin the growing prevalence of privacy-relevant metadata and location data
in particular and to judge potential dangers and benefits based on real-world data we
analyzed a set of 20,000 publicly available Flickr images and their metadata.
Flickr was chosen as the premiere photo-sharing website, because it can be legally
crawled, offers the full extent of privacy mechanisms and does not remove metadata in
general. Crawling one photo each from 20k random Flickr users. Of these, 68.8% were Pro
users where the original file could be accessed as well. For the others only the metadata
available via the Flickr API was accessed. This includes data automatically extracted from
EXIF data during upload and data manually added via website or semi-automatically set
by client applications. 23% of the 20k users denied access to their extracted EXIF data in
the Flickr database.
Also took a set of 3,000 images made with a camera phone from 3k random mobile
Flickr users. 46.8% of the mobile users were Pro users and only 2% denied access to EXIF
data in the Flickr database GPS location data was present in 19% of the 20k dataset and in
34% of the 3k mobile phone dataset. While Flickr hosts many semi-professional DSLR
photos, mobile phones are becoming the dominant photo generation tool with the iPhone 4
currently being the most common camera on Flickr. Textual location information like street
or city names are currently not used much on Flickr.
However, as reverse geocoding becomes more common in client applications this
will change (cf. Locr in Figure 4). To evaluate the potential privacy impact, we manual
checked which photos contained people and geo-reference, but no user profile tags – i.e.
images which could contain people who are unaware of the photo. In the set of 20k images
we found 16% and in in the set of 3k mobile photos we found 28% fulfilling this criteria
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 25
Fig. 4. Public privacy-related metadata in 5.7k random and 1k mobile user original Flickr
photos
Further the subset of images which were available from Pro users were analyzed,
since these can contain the unaltered metadata from the camera. From the 20k dataset, 5761
images contained in-file metadata. From the 3k dataset, 1050 images contained in-file
metadata.
Figure 5 shows the percentage values for the different types of metadata contained
in the files. For the rest of the images metadata was either manually removed by the
uploader or the image never has had any in the first place. Of the 20k dataset only 3% of
the in-file metadata contained GPS data compared to 32% from the mobile 3k dataset. This
shows a clear dominance of mobile devices when it comes to publishing GPS metadata.
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 26
This it is unsurprising since most compact and DSLR cameras currently do not have
GPS receivers and only few photographers add external GPS devices to these cameras –
but this will likely change with future cameras. However, combined with the fact that
mobile phones are becoming the dominant type of camera where it comes to the number of
published pictures, it is to be expected that the amount of GPS data available for scrutiny,
use and abuse will rise further.
Fig. 5. Public privacy-related metadata of 5k random photos from Locr service
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE, VIT 2014-15 27
A 5k dataset of random photos from Locr were collected, and analyzed the metadata
in the images plus the images’ HTML pages built from the Locr database. Figure 4 shows
the results from this dataset. Particularly interesting is the high rate of non-GPS based
location information.
This is a trend to watch since most location stripping mechanisms only strip GPS
information and leave other text-based tags intact. Furthermore, the amount of camera ID
meta-data is notable since these IDs can be used to link different pictures and infer meta-
data even if data has been stripped from some of the photos.
To summarize, one third of the pictures taken by dominant camera devices contains
GPS information. About one third of these images depict people on it. Thus, about 10% of
all the photos could harm other peoples’ privacy without them knowing about it.
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE,VIT 2014-15 28
CHAPTER 7
CONCLUSION AND FUTURE WORK
An analysis of the threat to an individual’s privacy that is created by other
peoples’ social media is discussed. A brief overview of privacy capabilities of common
social media services regarding their capability of protecting users from other peoples’
activities were discussed.Based on the survey, analyzed the Big Data privacy implications
and potential of the emerging trend of geo-tagged social media. Then the three concepts
how the location information can actually help users to stay in control of the flood of
potentially harmful or interesting social media uploaded by others are presented.
Applications can be built or modified to use pseudonyms rather than true user
identities, and route toward greater location privacy. Because different applications can
collide and share information about user sightings, users should adopt different
pseudonyms for different applications.
Furthermore, to stress more on sophisticated attacks, users should change
pseudonyms frequently, even while being tracked. Drawing on the methods developed for
anonymous communication, conceptual tools of mix zones and anonymity sets to analyze
location privacy can be used.
Although anonymity sets provide a first quantitative measure of location privacy,
the measure is only an upper-bound estimate. A better, more accurate metric, developed
using an information-theoretic approach, uses entropy, taking into account the a priori
knowledge that an observer can derive from historical data.The entropy measurement
shows a more pessimistic picture—in which the user hasless privacy—because it models
a more powerful adversary who might use historical data to de-anonymize pseudonyms
more accurately.
Two services worth mentioning which collect the type ofinformation needed for a
privacy watchdog are Social Camera and Locaccino. Social Camera is a mobile app that
detects faces in the picture and tries to recognize them with the help of Facebook profile
pictures of persons that are in your friends list. Recognized people can be automatically
tagged and pictures instantly uploaded to Facebook. Locaccino is a Foursquare type
application which allows users to upload location-based information into Facebook.
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE,VIT 2014-15 29
These two appsshow the willingness of users to share this kind of informationin the social
web.
Ahern et al. analyses in their work privacy decisions of mobile users in the photo
sharing process. Ahern et al. identify relationships between the location of the photo
capture and the corresponding privacy settings. Ahern et al. recommend the use context
information to help users to set privacy preferences andto increase the users’ awareness of
information aggregation.
Work by Fang and LeFevre focuses on helping the user to find appropriate
privacy settings in social networks. Fang and LeFevrepresent a system where the user
initially only needs to setup a few rules. Through the use of active machine learning
algorithms the system helps the user to protect private information based on the
individual behavior and taste.
Mannan et al. address the problem, that private user data is not only shared within
social networks, but also through personal web pages. In their work they focus on a
privacy-enabled web content sharing and utilize existing instant messaging friendship
relations to create and enforce access policies.
The three works shown above all focus on protecting auser’s privacy based on
dangers created by the user himself while sharing media. It did not discuss how users can
be protected from other peoples’ media. It is prevalent for most of the research work done
in this area.
Besmer et al present work which allows users that are tagged in photos to send a
request to the owner to hide the linked photo from certain people. This approach also
follows the idea that forewarned is forearmed and that creating awareness of critical
content is the first step towards the solution of the problem. However the work relies on
direct technical tags and as such does not cover the same scope asthe privacy watchdog
suggested in the paper.
Work that also takes into account other users’ media ispresented by Squicciarini et
al. who postulate that mostof the shared data does not only belong to a single user.
Therefore Squicciarini et al. propose a system to share the ownership of media items and
by that strive to establish a collaborative privacy management for shared content.
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE,VIT 2014-15 30
Squicciarini et al. Prototype is implemented as a Facebook app and is based on game
theory, rewarding users that promote co-ownerships of media items. While the work does
take into account other users’ media, unlike the approach it does not cope with previously
unknown and unrelated users.
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE,VIT 2014-15 31
REFERENCES
[1] S. Ahern, D. Eckles, N. Good, S. King, M. Naaman, and R. Nair.
Overexposed?:privacy patterns and considerations in online and mobile photosharing. In
Proceedings of the SIGCHI conference on Human factorsin computing systems, pages
357–366, 2007.
[2] C. a. Ardagna, M. Cremonini, S. De Capitani di Vimercati, and P. Samarati. An
Obfuscation-Based Approach for Protecting Location Privacy.IEEE Transactions on
Dependable and Secure Computing, 8(1):13–27,June 2009.
[3] A. Beresford and F. Stajano. Location privacy in pervasive computing.IEEE Pervasive
Computing, 2(1):46–55, Jan. 2003.
[4] A. Besmer and H. Richter Lipford. Moving beyond untagging. In Proceedings of the
28th international conference on Human factors incomputing systems - CHI ’10, page
1563, Apr. 2010.
[5] D. Boyd. Facebook’s privacy trainwreck: Exposure, invasion, and social convergence.
Convergence: The International Journal of Research into New Media Technologies,
14(1):13–20, 2008.
[6] d. Boyd and K. Crawford. Six Provocations for Big Data. SSRNeLibrary, 2011.
[7] D. Boyd and E. Hargittai. Facebook privacy settings: Who cares? FirstMonday, 15(8),
2010.
[8] Carnegie Mellons Mobile Commerce Laboratory. Locaccino - a usercontrollable
location-sharing tool. http://locaccino.org/, 2011.
[9] E. Eldon. New Facebook Statistics Show Big Increase in Content Sharing, Local
Business Pages. http://goo.gl/ebGQH, February 2010.
[10] Facebook. Statistics. 2011. http://www.facebook.com/press/info.php? statistics.
[11] L. Fang and K. LeFevre. Privacy wizards for social networking sites.In Proceedings
of the 19th international conference on World wide web- WWW ’10, page 351. ACM
Press, Apr. 2010.
[12] Flickr. Camera Finder. http://www.flickr.com/cameras, October 2011.
Big Data Privacy Issues in Public Social Media
M.Tech, 1st sem, Dept. Of CSE,VIT 2014-15 32
[13] B. Gedik and L. Liu. Protecting Location Privacy with Personalized k-Anonymity:
Architecture and Algorithms. IEEE Transactions on MobileComputing, 7(1):1–18, Jan.
2008.
[14] M. Mannan and P. C. van Oorschot. Privacy-enhanced sharing ofpersonal content on
the web. In Proceeding of the 17th international conference on World Wide Web - WWW
’08, page 487. ACM Press,April 2008.
[15] Metadata Working Group. Gartner special report –patternbasedstrategy: Getting
value from big data. www.gartner.com/patternbasedstrategy, 2012.
[16] A. C. Squicciarini, M. Shehab, and F. Paci. Collective privacy managementin social
networks. In Proceedings of the 18th international conference on World wide webWWW
’09, page 521. ACM Press, Apr.2009.
[17] Viewdle. SocialCamera. http://www.viewdle.com/products/mobile/index.html.