Introduction to Privacy - Security mattersplanete.inrialpes.fr/~ccastel/COURS/PrivacyIntro.pdf · 2012. 12. 13. · “Surveillance is good and privacy is bad for nationa l security

Introduction to Privacy

Why We Care? New Information Technologies: A) Digital storage, retrieval, distribution

Enormous cost reductions

B) Data sharing and processing i.e Data mining

C) Ubiquitous Networking, comm. (sensors, rfid, smart phones,…)

An emergent and fundamental change

Why We Care:

0

10

20

30

40

50

60

1901 1911 1921 1931 1941 1951 1961 1971 1981 1991

Number

“Privacy” Books per year (University Library database)

Privacy Concerns

Important Points Privacy bounds vary between cultures Laws, rules, conventions, vary as well Focus originally on only one relationship

Government ↔ citizen (citizens have little control over the information they

provide...)

Going Digital

Starting around 1970 Commercial databases Open data exchange standards Data exchange mechanisms (networks) exponentially increasing amounts of usable data Focus shifted to:

Government, private sector ↔ citizen

History (1)

History (2)

History (3)

History (4)

Popular Arguments against Privacy

  “If you care so much about your privacy it’s because you have something to hide”

  “Surveillance is good and privacy is bad for national security.

  We need a tradeoff between privacy and security”

  “People don’t care about privacy”

  The “nothing-to-hide” argument

This information is not necessarily secret, but do you want to broadcast it?

  Identity attributes :Name, age, gender, race, IQ, marital status, place of birth, address, phone number, ID number...

  Location: Where you are at a certain point in time, movement patterns

  Interests / preferences : Books you read, music you listen, films you like, sports you practice

  Political affiliation, religious beliefs, sexual orientation

  Behavior: Personality type, what you eat, what you shop, how you behave and interact with others

  Health data: Medical issues, treatments you follow, DNA, health risk factors

  Social network: Who your friends are, who you meet when, your different social circles

  Financial data: How much you earn, how you spend your money, credit card number,...

14

The danger

Surveillance: We move into a surveillance society companies/gov. gather a huge amount of information about users

Discrimination: Profiling may reveal that a user is

suffering from a certain disease. Insurance might then deny insurance

Personalization: Filter bubble Information leakage

We need privacy-preserving systems…

PRIVACY DEFINITIONS

What is Privacy?

What is Privacy?

  Abstract and subjective concept, hard to define

  Dependent on cultural issues

  A couple of popular definitions:

  “The right to be let alone”   Focus on freedom from intrusion

  “Informational self-determination”   Focus on control

  How do we formalize privacy properties in computer

systems?

Solove's Taxonomy

A Taxonomy of Privacy by Daniel Solove

What is Privacy?

  How do we formalize privacy properties in computer

systems?

Privacy properties from a technical point of view: Anonymity

  Hiding link between identity and action/piece of information.

  Reader of a web page, person accessing a service

  Sender of an email, writer of a text

  Person to whom an entry in a database relates

  Pfitzmann-Hansen terminology:   “Anonymity is the state of being not identifiable within a set of

subjects, the anonymity set”

  “The anonymity set is the set of all possible subjects who might cause an action”

  “Anonymity is the stronger, the larger the respective anonymity set is and the more evenly distributed the sending or receiving, respectively, of the subjects within that set is.”

  Probabilistic definition Source: Anonymity, Unobservability, Pseudonymity, and Identity Management – A Proposal for Terminology

Privacy properties from a technical point of view: Unlinkability

Hiding link between two or more actions / identities / pieces of information. Examples:

  Two anonymous letters written by the same person

  Two web page visits by the same user

  Entries in two databases related to the same person

  Two people related by a friendship link

  Same person spotted in two locations at different points in time

  Pfitzmann-Hansen terminology:

  “Unlinkability of two or more items means that within a system, these items are no more and no less related than they are related concerning the a-priori knowledge”

  Focus on the information leakage of a system

Privacy properties from a technical point of view: Unobservability

  Hiding user activity. Examples:

  Impossible to see whether someone is accessing a web page

  Impossible to know whether an entry in a database corresponds to a real person

  Impossible to distinguish whether someone or no one is in a given location

  Pfitzmann-Hansen terminology:

  “Unobservability is the state of items of interest being indistinguishable from any item of interest at all”

  “Sender unobservability then means that it is not noticeable whether any sender within the unobservability set sends.”

PRIVACY METRICS

Can we “measure” privacy?   Need to specify

  Privacy properties we want to achieve

  Adversary model: goals and capabilities

  Typically, adversaries are able to obtain probabilistic information.

  Examples:

  Probability of a person being the anonymous subject we want to identify (limited # of people in the world)

  Probability of two information items being related to each other (e.g., two web page requests coming from the same user)

  Many proposals, open research field

  Ex: information theoretic approach

A Primer on Info. Theory & Privacy

  There are around 7 billion humans on the planet:   the identity of a random, unknown person contains just

under 33 bits of entropy (2^33~8 billion).   When we learn a new fact about a person, that fact

reduces the entropy of their identity by a certain amount.   There is a formula to say how much:

-  ΔS = - log2 Pr(X=x) Where ΔS is the reduction in entropy, measured in bits, and

Pr(X=x) is simply the probability that the fact would be true of a random person.

A Primer on Info. Theory & Privacy   For example:

  Starsign: ΔS = - log2 Pr(STARSIGN=capricorn) = - log2 (1/12) = 3.58 bits of information

  Birthday: ΔS = - log2 Pr(DOB=2nd of January) = -log2 (1/365) = 8.51 bits of information

  Note that if you combine several facts together, you might not learn anything new; for instance, telling me someone's starsign doesn't tell me anything new if I already knew their birthday.

How much entropy is needed to identify someone?

  if we know someone's birthday, and we know their ZIP code is 40203, we have 8.51 + 23.81 = 32.32 bits;   that's almost, but perhaps not quite, enough to

know who they are   there might be a couple of people who share those

characteristics.   Add in their gender, that's 33.32 bits, and we can

probably say exactly who the person is!

An Application To Web Browsers   how would this paradigm apply to web browsers?

  In addition to the commonly discussed "identifying" characteristics of web browsers, like IP addresses and tracking cookies, there are more subtle differences between browsers that can be used to tell them apart.

  One significant example is the User-Agent string, which contains the name, operating system and precise version number of the browser, and which is sent every web server you visit.

  A typical User Agent string looks something like this: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6

An Application To Web Browsers (2)   It turns out that that UA is quite useful for telling different

people apart on the net.

  User Agent strings contain about 10.5 bits of identifying information,

  if you pick a random person's browser, only one in 1,500 other Internet users will share their User Agent string.

  So even if someone use TOR...the server can still get information about him!

Next: Protective Solutions

  Network-Layer Privacy   Web Privacy (DNT, plugins,…)   Data Sanitization   PETs

Network-layer privacy (slides from Stanford)

Goals: Hide user’s IP address from target web site

Hide browsing destinations from network

1st attempt: anonymizing proxy

HTTPS:// anonymizer.com ? URL=target

User1

User2

User3

anonymizer.com

Web1

Web2

Web3

Anonymizing proxy: security Monitoring ONE link: eavesdropper gets nothing

Monitoring TWO links: Eavesdropper can do traffic analysis

More difficult if lots of traffic through proxy

Trust: proxy is a single point of failure Can be corrupt or subpoenaed

Protocol issues: Long-lived cookies make connections to site linkable

How proxy works Proxy rewrites all links in response from web site

Updated links point to anonymizer.com Ensures all subsequent clicks are anonymized

Proxy rewrites/removes cookies and some HTTP headers

Proxy IP address: if a single address, could be blocked by site or ISP

anonymizer.com consists of >20,000 addresses Globally distributed, registered to multiple domains Note: chinese firewall blocks ALL anonymizer.com addresses

2nd Attempt: MIX nets

Goal: no single point of failure

Epk2( R3, Epk3

( R6,

MIX nets [C’81]

Every router has public/private key pair Sender knows all public keys

To send packet: Pick random route: R2 → R3 → R6 → srvr

Prepare onion packet:

R3 R5

R4

R1

R2 R6

Epk6( srvr , msg)

srvr

packet =

Eavesdropper’s view at a single MIX

•  Eavesdropper observes incoming and outgoing traffic

•  Crypto prevents linking input/output pairs •  Assuming enough packets in incoming batch

•  If variable length packets then must pad all to max len

user1

user2

user3

Ri

batch

Performance Main benefit:

Privacy as long as at least one honest router on path

Problems: High latency (lots of public key ops)

Inappropriate for interactive sessions May be OK for email (e.g. Babel system)

No forward security

perfect forward secrecy (or PFS) is the property that ensures that a session key derived from a set of long-term public and private keys will not be compromised if one of the (long-term) private keys is compromised in the future.

R3 R2 R6

srvr

3rd Attempt: Tor MIX circuit-based method

Goals: privacy as long as one honest router on path,

and

reasonable performance

The Tor design Trusted directory contains list of Tor routers

User’s machine preemptively creates a circuit

Used for many TCP streams

New circuit is created once a minute

R1

R2

R3

R4

srvr1

srvr2

R5

R6

one minute later

stream1

stream2

Creating circuits

R1 R2

TLS encrypted TLS encrypted

Create C1

D-H key exchange

K1 K1

Relay C1 Extend R2

D-H key exchange

K2 K2

Extend R2

Once circuit is created

User has shared key with each router in circuit

Routers only know ID of successor and predecessor

R1

R2

R3

R4

K1, K2, K3, K4 K1

K2

K3

K4

Sending data R1 R2

Relay C1 Begin site:80

Relay C2 Begin site:80

TCP handshake

Relay C1 data HTTP GET

Relay C2 data HTTP GET

HTTP GET

K1 K2

resp Relay C2 data resp Relay C1 data resp

Properties Performance:

Fast connection time: circuit is pre-established

Traffic encrypted with AES: no pub-key on traffic

Tor crypto: provides end-to-end integrity for traffic

Forward secrecy via TLS

Downside: Routers must maintain state per circuit

Each router can link multiple streams via CircuitID all steams in one minute interval share same CircuitID

Privoxy

Tor only provides network level privacy No application-level privacy

e.g. mail progs add “From: email-addr” to outgoing mail

Privoxy: Web proxy for browser-level privacy Removes/modifies cookies

Other web page filtering

Web Tracking & Privacy

Privacy considerations of online behavioural tracking, C. Castelluccia and A. Narayanan

54

Web Tracking Why are we tracked?

Personalized services to user and location.

Targeted Advertisement Most content is “free” You are paying with your data Often, you are not the customer, but the product!

55

Online Advertising: Simplified Model 3 main entities:

Advertiser (annonceur): entity that wants to advertise a service/products (i.e. hotels, car manufacturers,…)

Publisher (editeur): entity that hosts the advertisements (i.e. online news, lemonde.fr,…)

Ad-Network: entity that places advertisements on Publisher sites (i.e. google,…)

56

Online Advertising: Illustration

PUBLISHER (lemonde.fr)

AD-NETWORK (doubleclick.com)

ADVERTISER (maier.com)

57

Online Advertising: Money Flow

PUBLISHER (lemonde.fr)

AD-NETWORK (doubleclick.com)

ADVERTISER (maier.com)

$$$

$

Browsing Profiling: How?

58

Doubleclick.com

…

Cnn.com Wsj.com Lemonde.fr

Browsing Profiling (2)

59

Doubleclick.com

…


Cookie_cnn Cookie_dc

Browsing Profiling (2)

60

Doubleclick.com

…


Cookie_wsj Cookie_dc

Browsing Profiling

61

Doubleclick.com

…


Cookie_lemonde

Cookie_dc

Browsing Profiling

62

Doubleclick.com

…


Cookie_lemonde

Cookie_dc

http://www.google.com/ads/preferences

http://www.google.com/ads/

preferences

63

Online Tracking: Tracking on the Internet…

64

•  Disable Cookies (at least third-party cookies) •  Browser’s Private mode •  Use DNT (Do Not Track)

•  But almost dead

•  Use Plugins •  To see who is tracking you •  To block ads

•  Use TOR •  Disconnect!!!

Some Solutions

•  Collision

•  Ghostery

•  about:Trackers

•  Adblock

PLEASE BLOCK ADS!

Some Nice Plugins

Mobile and Privacy

Online Tracking: Smart Phone Marketers are tracking smartphone users through

“apps” — games and other software on their phones.

Some apps collect information including location, unique serial-number-like identifiers for the phone, and personal details such as age and sex.

Apps routinely send the information to marketing companies that use it to compile dossiers on phone users

68

69 Source: http://online.wsj.com/

Paper Toss App (iPhone)

phoneID Location

Third-party (google, flurry,…)

App Server

70

http://blogs.wsj.com/wtk-mobile/2010/12/17/angry-birds/

71

Some Recommendations to App Developpers

•  Enforce Privacy-by-Design •  Be transparent:

•  tell users who you are, what you collect •  Ask user’s active consent •  Inform,…

•  Be minimalist: only collect minimal information •  Be user-friendly:

•  Help users manage their privacy •  Give users easy to understand choices and

mechanismes for managing their privacy

Some Recommendations to App Developpers (2)

•  Keep data secure •  Keep data in a portable format •  Set data retention and deletion periods •  Ensure default settings are privacy protective. •  Take measures to protect children from

endangering themselves. •  Create appropriate tools to deactivate and

delete data from applications and accounts. •  Target only on legitimetely collected data. Source: « Privacy Design Guidelines for Mobile Application Development », GSM Assoc.

Data Sanitization

Goals: How to prevent data leakage from public

dataset?

•  Predict flu •  Improve transportation, Logistic … improve knowledge and efficience •  Data is the power…. But…

BIG DATA is Useful

Possible Privacy Breach Examples: AOL, Netflix, ..

In 2006, AOL released 20 million search queries for 650.000 users

« Anonymized » by removing AOL id and IP address

Easily de-anonymized in a couple of days by looking at queries

BIG DATA and PRIVACY

Possible Privacy Breach Examples: AOL, Netflix, ..

In 2006, AOL released 20 million search queries for 650.000 users

« Anonymized » by removing AOL id and IP address

Easily de-anonymized in a couple of days by looking at queries

BIG DATA and PRIVACY

Source of Problem The data contains:

Attribute values which can uniquely identify an individual { zip-code, nationality, age } or/and {name} or/and {SSN}

sensitive information corresponding to individuals { medical condition, salary, location }

Non-Sensitive Data Sensitive Data

# Zip Age Nationality Name Condition

1 13053 28 Indian Kumar Heart Disease 2 13067 29 American Bob Heart Disease 3 13053 35 Canadian Ivan Viral Infection 4 13067 36 Japanese Umeko Cancer

Source of Problem

Even if we remove the direct uniquely identifying attributes There are some fields that may still uniquely identify

some individual! The attacker can join them with other sources and identify

individuals

Non-Sensitive Data Sensitive Data # Zip Age Nationality Condition

… … … … …

Quasi-Identifiers

Source of Problem

Non-Sensitive Data Sensitive Data # Zip Age Nationality Condition

1 13053 28 Indian Heart Disease 2 13067 29 American Heart Disease 3 13053 35 Canadian Viral Infection 4 13067 36 Japanese Cancer

# Name Zip Age Nationality

1 John 13053 28 American 2 Bob 13067 29 American 3 Chris 13053 23 American

Published Data

Voter List

Data leak!

81

Sanitization…

82

Several Data anonymization methods…

Random perturbation Input perturbation Output perturbation

Generalization The data domain has a natural hierarchical structure.

Suppression Permutation

Destroying the link between identifying and sensitive attributes that could lead to a privacy leakage.

83

Randomization Methods

84

K-anonymity

85

Some Other Sanitization Schemes

The failure of Anonymization paper here…

Why Data Anonymization is Hard: External Information

88

Sweeney’s Original Attack

Netflix Data Release [Narayanan, Shmatikov 2008]

Netflix Data Release

Other Attacks

92

A Simple Exercice

Existing Privacy models such a k-anonymity, L-diversity seem weak/broken

Differential Privacy Relatively recent [Dwork2006] Provide some strong and measurable guarantees Secure even with external sources of data

Toward « Secure » Anonymization

Differen'al Privacy �

94

Pr(M(D) = D⇤)

Pr(M(D0) = D⇤) e"

Intuition: Changes to my data not noticeable Output is “independent” of my data

Differential Privacy

Differen'al Privacy �

96

Histogram Release with Laplace Mechanism

97

Add random Laplace noise to each bin before publishing!

•  Global sensitivity: ΔH = Σ|Hi – Hi’|

•  For histograms: ΔH = 1

•  If λ = ΔH / ε, we have ε-differential privacy!

Qi Pr(Hi + Laplace(�) = H⇤

i )Qi Pr(H

0i + Laplace(�) = H⇤

i ) exp

✓Pi |Hi �H 0

i|�

◆= e

1�

H1 H2 H3 H4 H5

H1 H2+1 H3 H4 H5

H’

H

97

Utility / privacy trade-off Strong privacy means large noise … which reduces utility

Provide good performance when values are much larger than noise Noise depends on sensitivity, not on data values! Most algorithms use aggregation to increase count

values Not very efficient for high-dimentional data, where

aggregation is not easy, such as sequential data

Differential Privacy

Some PETS

Crypto. Can also help…

•  Anonymous credential : anonymous variant of credentials that can be used to prove a property linked to his owner or the right of access to some ressources, but without having to reveal his identity. •  Group signature : method to prove that someone

belongs to a group by signing a message anonymously on behalf of the group.

•  Zero-knowledge proof : cryptographic protocol by which a prover can convince a verifier of the validity of a statement (for which he knows a proof) without having to reveal any other information that the veracity of this statement.

Some useful tools

•  Private information retrieval : cryptographic primitive by which a client can learn the element of a databases stored on a server but without the server which element has been learned (to protect the privacy of the query).

•  Homomorphic encryption: cryptosystem by which it is possible to perform operations on encrypted data (additions/multiplications) without any knowledge of the secrete key.

•  Private Set Intersection, Secret-Handshakes,… •  …and many more

Some useful tools

Conclusion

Active research areas   Data anonymization of database records and other data structures

(e.g., network graphs)

  Private communication (prevention of traffic analysis)

  Anonymous and covert communication

  Crypto protocols

  Privacy-enhanced authentication and identity management

  Operations in the encrypted domain

  Anonymous search and retrieval of information

  Privacy-preserving biometric authentication

  Location privacy

  Ubiquitous environments

  Constrained devices

  Securing the physical link

  Social networks

Problems not quite solved yet…

  Privacy is an important, and will become even more important with ubiquitous networking

  Some important topics:

  Big Data Privacy

  Genomic Privacy

  Reality/Physical mining

  infers human relationship and behaviour from information collected by smartphones

  Augmented Reality

  Convergence of face recognition, social networks, data mining

  TV, mobile Advertising

Scary!

Documents

Introduction to Privacy - Security mattersplanete.inrialpes.fr/~ccastel/COURS/PrivacyIntro.pdf · 2012. 12. 13. · “Surveillance is good and privacy is bad for nationa l security