Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Introduction to Privacy
Why We Care? New Information Technologies: A) Digital storage, retrieval, distribution
Enormous cost reductions
B) Data sharing and processing i.e Data mining
C) Ubiquitous Networking, comm. (sensors, rfid, smart phones,…)
An emergent and fundamental change
Why We Care:
0
10
20
30
40
50
60
1901 1911 1921 1931 1941 1951 1961 1971 1981 1991
Number
“Privacy” Books per year (University Library database)
Privacy Concerns
Important Points Privacy bounds vary between cultures Laws, rules, conventions, vary as well Focus originally on only one relationship
Government ↔ citizen (citizens have little control over the information they
provide...)
Going Digital
Starting around 1970 Commercial databases Open data exchange standards Data exchange mechanisms (networks) exponentially increasing amounts of usable data Focus shifted to:
Government, private sector ↔ citizen
History (1)
History (2)
History (3)
History (4)
Popular Arguments against Privacy
“If you care so much about your privacy it’s because you have something to hide”
“Surveillance is good and privacy is bad for national security.
We need a tradeoff between privacy and security”
“People don’t care about privacy”
The “nothing-to-hide” argument
This information is not necessarily secret, but do you want to broadcast it?
Identity attributes :Name, age, gender, race, IQ, marital status, place of birth, address, phone number, ID number...
Location: Where you are at a certain point in time, movement patterns
Interests / preferences : Books you read, music you listen, films you like, sports you practice
Political affiliation, religious beliefs, sexual orientation
Behavior: Personality type, what you eat, what you shop, how you behave and interact with others
Health data: Medical issues, treatments you follow, DNA, health risk factors
Social network: Who your friends are, who you meet when, your different social circles
Financial data: How much you earn, how you spend your money, credit card number,...
14
The danger
Surveillance: We move into a surveillance society companies/gov. gather a huge amount of information about users
Discrimination: Profiling may reveal that a user is
suffering from a certain disease. Insurance might then deny insurance
Personalization: Filter bubble Information leakage
We need privacy-preserving systems…
PRIVACY DEFINITIONS
What is Privacy?
What is Privacy?
Abstract and subjective concept, hard to define
Dependent on cultural issues
A couple of popular definitions:
“The right to be let alone” Focus on freedom from intrusion
“Informational self-determination” Focus on control
How do we formalize privacy properties in computer
systems?
Solove's Taxonomy
A Taxonomy of Privacy by Daniel Solove
What is Privacy?
How do we formalize privacy properties in computer
systems?
Privacy properties from a technical point of view: Anonymity
Hiding link between identity and action/piece of information.
Reader of a web page, person accessing a service
Sender of an email, writer of a text
Person to whom an entry in a database relates
Pfitzmann-Hansen terminology: “Anonymity is the state of being not identifiable within a set of
subjects, the anonymity set”
“The anonymity set is the set of all possible subjects who might cause an action”
“Anonymity is the stronger, the larger the respective anonymity set is and the more evenly distributed the sending or receiving, respectively, of the subjects within that set is.”
Probabilistic definition Source: Anonymity, Unobservability, Pseudonymity, and Identity Management – A Proposal for Terminology
Privacy properties from a technical point of view: Unlinkability
Hiding link between two or more actions / identities / pieces of information. Examples:
Two anonymous letters written by the same person
Two web page visits by the same user
Entries in two databases related to the same person
Two people related by a friendship link
Same person spotted in two locations at different points in time
Pfitzmann-Hansen terminology:
“Unlinkability of two or more items means that within a system, these items are no more and no less related than they are related concerning the a-priori knowledge”
Focus on the information leakage of a system
Privacy properties from a technical point of view: Unobservability
Hiding user activity. Examples:
Impossible to see whether someone is accessing a web page
Impossible to know whether an entry in a database corresponds to a real person
Impossible to distinguish whether someone or no one is in a given location
Pfitzmann-Hansen terminology:
“Unobservability is the state of items of interest being indistinguishable from any item of interest at all”
“Sender unobservability then means that it is not noticeable whether any sender within the unobservability set sends.”
PRIVACY METRICS
Can we “measure” privacy? Need to specify
Privacy properties we want to achieve
Adversary model: goals and capabilities
Typically, adversaries are able to obtain probabilistic information.
Examples:
Probability of a person being the anonymous subject we want to identify (limited # of people in the world)
Probability of two information items being related to each other (e.g., two web page requests coming from the same user)
Many proposals, open research field
Ex: information theoretic approach
A Primer on Info. Theory & Privacy
There are around 7 billion humans on the planet: the identity of a random, unknown person contains just
under 33 bits of entropy (2^33~8 billion). When we learn a new fact about a person, that fact
reduces the entropy of their identity by a certain amount. There is a formula to say how much:
- ΔS = - log2 Pr(X=x) Where ΔS is the reduction in entropy, measured in bits, and
Pr(X=x) is simply the probability that the fact would be true of a random person.
A Primer on Info. Theory & Privacy For example:
Starsign: ΔS = - log2 Pr(STARSIGN=capricorn) = - log2 (1/12) = 3.58 bits of information
Birthday: ΔS = - log2 Pr(DOB=2nd of January) = -log2 (1/365) = 8.51 bits of information
Note that if you combine several facts together, you might not learn anything new; for instance, telling me someone's starsign doesn't tell me anything new if I already knew their birthday.
How much entropy is needed to identify someone?
if we know someone's birthday, and we know their ZIP code is 40203, we have 8.51 + 23.81 = 32.32 bits; that's almost, but perhaps not quite, enough to
know who they are there might be a couple of people who share those
characteristics. Add in their gender, that's 33.32 bits, and we can
probably say exactly who the person is!
An Application To Web Browsers how would this paradigm apply to web browsers?
In addition to the commonly discussed "identifying" characteristics of web browsers, like IP addresses and tracking cookies, there are more subtle differences between browsers that can be used to tell them apart.
One significant example is the User-Agent string, which contains the name, operating system and precise version number of the browser, and which is sent every web server you visit.
A typical User Agent string looks something like this: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6
An Application To Web Browsers (2) It turns out that that UA is quite useful for telling different
people apart on the net.
User Agent strings contain about 10.5 bits of identifying information,
if you pick a random person's browser, only one in 1,500 other Internet users will share their User Agent string.
So even if someone use TOR...the server can still get information about him!
Next: Protective Solutions
Network-Layer Privacy Web Privacy (DNT, plugins,…) Data Sanitization PETs
Network-layer privacy (slides from Stanford)
Goals: Hide user’s IP address from target web site
Hide browsing destinations from network
1st attempt: anonymizing proxy
HTTPS:// anonymizer.com ? URL=target
User1
User2
User3
anonymizer.com
Web1
Web2
Web3
Anonymizing proxy: security Monitoring ONE link: eavesdropper gets nothing
Monitoring TWO links: Eavesdropper can do traffic analysis
More difficult if lots of traffic through proxy
Trust: proxy is a single point of failure Can be corrupt or subpoenaed
Protocol issues: Long-lived cookies make connections to site linkable
How proxy works Proxy rewrites all links in response from web site
Updated links point to anonymizer.com Ensures all subsequent clicks are anonymized
Proxy rewrites/removes cookies and some HTTP headers
Proxy IP address: if a single address, could be blocked by site or ISP
anonymizer.com consists of >20,000 addresses Globally distributed, registered to multiple domains Note: chinese firewall blocks ALL anonymizer.com addresses
2nd Attempt: MIX nets
Goal: no single point of failure
Epk2( R3, Epk3
( R6,
MIX nets [C’81]
Every router has public/private key pair Sender knows all public keys
To send packet: Pick random route: R2 → R3 → R6 → srvr
Prepare onion packet:
R3 R5
R4
R1
R2 R6
Epk6( srvr , msg)
srvr
packet =
Eavesdropper’s view at a single MIX
• Eavesdropper observes incoming and outgoing traffic
• Crypto prevents linking input/output pairs • Assuming enough packets in incoming batch
• If variable length packets then must pad all to max len
user1
user2
user3
Ri
batch
Performance Main benefit:
Privacy as long as at least one honest router on path
Problems: High latency (lots of public key ops)
Inappropriate for interactive sessions May be OK for email (e.g. Babel system)
No forward security
perfect forward secrecy (or PFS) is the property that ensures that a session key derived from a set of long-term public and private keys will not be compromised if one of the (long-term) private keys is compromised in the future.
R3 R2 R6
srvr
3rd Attempt: Tor MIX circuit-based method
Goals: privacy as long as one honest router on path,
and
reasonable performance
The Tor design Trusted directory contains list of Tor routers
User’s machine preemptively creates a circuit
Used for many TCP streams
New circuit is created once a minute
R1
R2
R3
R4
srvr1
srvr2
R5
R6
one minute later
stream1
stream2
Creating circuits
R1 R2
TLS encrypted TLS encrypted
Create C1
D-H key exchange
K1 K1
Relay C1 Extend R2
D-H key exchange
K2 K2
Extend R2
Once circuit is created
User has shared key with each router in circuit
Routers only know ID of successor and predecessor
R1
R2
R3
R4
K1, K2, K3, K4 K1
K2
K3
K4
Sending data R1 R2
Relay C1 Begin site:80
Relay C2 Begin site:80
TCP handshake
Relay C1 data HTTP GET
Relay C2 data HTTP GET
HTTP GET
K1 K2
resp Relay C2 data resp Relay C1 data resp
Properties Performance:
Fast connection time: circuit is pre-established
Traffic encrypted with AES: no pub-key on traffic
Tor crypto: provides end-to-end integrity for traffic
Forward secrecy via TLS
Downside: Routers must maintain state per circuit
Each router can link multiple streams via CircuitID all steams in one minute interval share same CircuitID
Privoxy
Tor only provides network level privacy No application-level privacy
e.g. mail progs add “From: email-addr” to outgoing mail
Privoxy: Web proxy for browser-level privacy Removes/modifies cookies
Other web page filtering
Web Tracking & Privacy
Privacy considerations of online behavioural tracking, C. Castelluccia and A. Narayanan
54
Web Tracking Why are we tracked?
Personalized services to user and location.
Targeted Advertisement Most content is “free” You are paying with your data Often, you are not the customer, but the product!
55
Online Advertising: Simplified Model 3 main entities:
Advertiser (annonceur): entity that wants to advertise a service/products (i.e. hotels, car manufacturers,…)
Publisher (editeur): entity that hosts the advertisements (i.e. online news, lemonde.fr,…)
Ad-Network: entity that places advertisements on Publisher sites (i.e. google,…)
56
Online Advertising: Illustration
PUBLISHER (lemonde.fr)
AD-NETWORK (doubleclick.com)
ADVERTISER (maier.com)
57
Online Advertising: Money Flow
PUBLISHER (lemonde.fr)
AD-NETWORK (doubleclick.com)
ADVERTISER (maier.com)
$$$
$
Browsing Profiling: How?
58
Doubleclick.com
…
Cnn.com Wsj.com Lemonde.fr
Browsing Profiling (2)
59
Doubleclick.com
…
Cnn.com Wsj.com Lemonde.fr
Cookie_cnn Cookie_dc
Browsing Profiling (2)
60
Doubleclick.com
…
Cnn.com Wsj.com Lemonde.fr
Cookie_wsj Cookie_dc
Browsing Profiling
61
Doubleclick.com
…
Cnn.com Wsj.com Lemonde.fr
Cookie_lemonde
Cookie_dc
Browsing Profiling
62
Doubleclick.com
…
Cnn.com Wsj.com Lemonde.fr
Cookie_lemonde
Cookie_dc
http://www.google.com/ads/preferences
http://www.google.com/ads/
preferences
63
Online Tracking: Tracking on the Internet…
64
• Disable Cookies (at least third-party cookies) • Browser’s Private mode • Use DNT (Do Not Track)
• But almost dead
• Use Plugins • To see who is tracking you • To block ads
• Use TOR • Disconnect!!!
Some Solutions
• Collision
• Ghostery
• about:Trackers
• Adblock
PLEASE BLOCK ADS!
Some Nice Plugins
Mobile and Privacy
Online Tracking: Smart Phone Marketers are tracking smartphone users through
“apps” — games and other software on their phones.
Some apps collect information including location, unique serial-number-like identifiers for the phone, and personal details such as age and sex.
Apps routinely send the information to marketing companies that use it to compile dossiers on phone users
68
69 Source: http://online.wsj.com/
Paper Toss App (iPhone)
phoneID Location
Third-party (google, flurry,…)
App Server
70
http://blogs.wsj.com/wtk-mobile/2010/12/17/angry-birds/
71
Some Recommendations to App Developpers
• Enforce Privacy-by-Design • Be transparent:
• tell users who you are, what you collect • Ask user’s active consent • Inform,…
• Be minimalist: only collect minimal information • Be user-friendly:
• Help users manage their privacy • Give users easy to understand choices and
mechanismes for managing their privacy
Some Recommendations to App Developpers (2)
• Keep data secure • Keep data in a portable format • Set data retention and deletion periods • Ensure default settings are privacy protective. • Take measures to protect children from
endangering themselves. • Create appropriate tools to deactivate and
delete data from applications and accounts. • Target only on legitimetely collected data. Source: « Privacy Design Guidelines for Mobile Application Development », GSM Assoc.
Data Sanitization
Goals: How to prevent data leakage from public
dataset?
• Predict flu • Improve transportation, Logistic … improve knowledge and efficience • Data is the power…. But…
BIG DATA is Useful
Possible Privacy Breach Examples: AOL, Netflix, ..
In 2006, AOL released 20 million search queries for 650.000 users
« Anonymized » by removing AOL id and IP address
Easily de-anonymized in a couple of days by looking at queries
BIG DATA and PRIVACY
Possible Privacy Breach Examples: AOL, Netflix, ..
In 2006, AOL released 20 million search queries for 650.000 users
« Anonymized » by removing AOL id and IP address
Easily de-anonymized in a couple of days by looking at queries
BIG DATA and PRIVACY
Source of Problem The data contains:
Attribute values which can uniquely identify an individual { zip-code, nationality, age } or/and {name} or/and {SSN}
sensitive information corresponding to individuals { medical condition, salary, location }
Non-Sensitive Data Sensitive Data
# Zip Age Nationality Name Condition
1 13053 28 Indian Kumar Heart Disease 2 13067 29 American Bob Heart Disease 3 13053 35 Canadian Ivan Viral Infection 4 13067 36 Japanese Umeko Cancer
Source of Problem
Even if we remove the direct uniquely identifying attributes There are some fields that may still uniquely identify
some individual! The attacker can join them with other sources and identify
individuals
Non-Sensitive Data Sensitive Data # Zip Age Nationality Condition
… … … … …
Quasi-Identifiers
Source of Problem
Non-Sensitive Data Sensitive Data # Zip Age Nationality Condition
1 13053 28 Indian Heart Disease 2 13067 29 American Heart Disease 3 13053 35 Canadian Viral Infection 4 13067 36 Japanese Cancer
# Name Zip Age Nationality
1 John 13053 28 American 2 Bob 13067 29 American 3 Chris 13053 23 American
Published Data
Voter List
Data leak!
81
Sanitization…
82
Several Data anonymization methods…
Random perturbation Input perturbation Output perturbation
Generalization The data domain has a natural hierarchical structure.
Suppression Permutation
Destroying the link between identifying and sensitive attributes that could lead to a privacy leakage.
83
Randomization Methods
84
K-anonymity
85
Some Other Sanitization Schemes
The failure of Anonymization paper here…
Why Data Anonymization is Hard: External Information
88
Sweeney’s Original Attack
Netflix Data Release [Narayanan, Shmatikov 2008]
Netflix Data Release
Other Attacks
92
A Simple Exercice
Existing Privacy models such a k-anonymity, L-diversity seem weak/broken
Differential Privacy Relatively recent [Dwork2006] Provide some strong and measurable guarantees Secure even with external sources of data
Toward « Secure » Anonymization
Differen'al Privacy �
94
Pr(M(D) = D⇤)
Pr(M(D0) = D⇤) e"
Intuition: Changes to my data not noticeable Output is “independent” of my data
Differential Privacy
Differen'al Privacy �
96
Histogram Release with Laplace Mechanism
97
Add random Laplace noise to each bin before publishing!
• Global sensitivity: ΔH = Σ|Hi – Hi’|
• For histograms: ΔH = 1
• If λ = ΔH / ε, we have ε-differential privacy!
Qi Pr(Hi + Laplace(�) = H⇤
i )Qi Pr(H
0i + Laplace(�) = H⇤
i ) exp
✓Pi |Hi �H 0
i|�
◆= e
1�
H1 H2 H3 H4 H5
H1 H2+1 H3 H4 H5
H’
H
97
Utility / privacy trade-off Strong privacy means large noise … which reduces utility
Provide good performance when values are much larger than noise Noise depends on sensitivity, not on data values! Most algorithms use aggregation to increase count
values Not very efficient for high-dimentional data, where
aggregation is not easy, such as sequential data
Differential Privacy
Some PETS
Crypto. Can also help…
• Anonymous credential : anonymous variant of credentials that can be used to prove a property linked to his owner or the right of access to some ressources, but without having to reveal his identity. • Group signature : method to prove that someone
belongs to a group by signing a message anonymously on behalf of the group.
• Zero-knowledge proof : cryptographic protocol by which a prover can convince a verifier of the validity of a statement (for which he knows a proof) without having to reveal any other information that the veracity of this statement.
Some useful tools
• Private information retrieval : cryptographic primitive by which a client can learn the element of a databases stored on a server but without the server which element has been learned (to protect the privacy of the query).
• Homomorphic encryption: cryptosystem by which it is possible to perform operations on encrypted data (additions/multiplications) without any knowledge of the secrete key.
• Private Set Intersection, Secret-Handshakes,… • …and many more
Some useful tools
Conclusion
Active research areas Data anonymization of database records and other data structures
(e.g., network graphs)
Private communication (prevention of traffic analysis)
Anonymous and covert communication
Crypto protocols
Privacy-enhanced authentication and identity management
Operations in the encrypted domain
Anonymous search and retrieval of information
Privacy-preserving biometric authentication
Location privacy
Ubiquitous environments
Constrained devices
Securing the physical link
Social networks
Problems not quite solved yet…
Privacy is an important, and will become even more important with ubiquitous networking
Some important topics:
Big Data Privacy
Genomic Privacy
Reality/Physical mining
infers human relationship and behaviour from information collected by smartphones
Augmented Reality
Convergence of face recognition, social networks, data mining
TV, mobile Advertising
Scary!