58
Privacy Preserving Indexing of Documents on the Network Mayank Bawa Roberto J. Bayardo Jr. Rakesh Agrawal [email protected]

Privacy Preserving Indexing of Documents on the Network Mayank Bawa Roberto J. Bayardo Jr. Rakesh Agrawal [email protected]

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Privacy Preserving Indexing of Documents on

the Network

Mayank BawaRoberto J. Bayardo Jr.

Rakesh [email protected]

Sharing Private Content

• Rapid growth in Private & Semi-Private information on the network – Experimental results of drug tests– Drafts of research papers, patents,…– Architectural CAD documents

• Mechanisms to search information have failed to keep pace– Public Information: Google, Yahoo!– Private Information: ???

Talk Overview

1. Content Privacy issues in sharing access-controlled content

2. Data structure for search on access-controlled content

3. Algorithm for building such a data structure

Privacy issues in sharing access-controlled content

Provider

• Shares documents• Enforces access policy

P1

Alzheimer’s Disease (Alice, Bob)

AIDS (Alice)

Small-Pox (Alice, Bob, Lisa, …)

P1 P2 P3

P32 P2026

Searcher

• Wants documents that match her keyword query Q

• Has an identity

Alice

P1 P2 P3

P32 P2026

Q = “Amyloid Peptide”

Retrieve a Document

Alice

P2 P1 P3

P32 P2026

Q = “Amyloid Peptide”

?

Alice

Retrieve a Document

George

P2 P1 P3

P32 P2026

Q = “Amyloid Peptide”

?

George

Search Process (Today)

Alice

P2 P1 P3

P32 P2026

Q = “Amyloid Peptide”

Automating Search

A searcher s issues a query q expecting a set of documents d such that

1. d is shared by some provider p

2. d matches the query q

3. d is accessible to s as dictated by p’s access policy

Automating Search

George

P2 P1 P3

P32 P2026

Q = “Amyloid Peptide”

(Alzheimer’s Disease, Alice) ???

Content Privacy

An adversary A should not be able to deduce, using the search mechanism, that provider P is sharing document d with keywords q unless A has been granted access to d by P

An access-controlled search mechanism with content privacy

Soln #1: Document Index

P2 P1 P3

P32 P2026

Alice

Q = “Amyloid Peptide”

Inverted Index

P1

Documents

Access Policy

?Alice

Soln #1: Document Index

P2 P1 P3

P32 P2026

George

Q = “Amyloid Peptide”

Inverted Index

?George

Soln #1: Document Index

P2 P1 P3

P32 P2026

“Knows Everything”

Soln #2: Keyword Index

P2 P1 P3

P32 P2026

Alice/George

Q = “Amyloid Peptide”

Keyword Index

P1

Keywords

Soln #2: Keyword Index

P2 P1 P3

P32 P2026

Alice/George

P1 has a document with

words “Amyloid Peptide”

Keyword Index

Keyword Index

ti {p: ti d,provider(d)= p}

ExampleAmyloid {…, P1, …}Peptide {…, P1, …}

Problem Cause Every term is mapped precisely

Soln #2: Keyword Index

Intuition

Add “false positives”

Example

Amyloid {…, P1, P2,…}

Peptide {…, P1, P2,…}

Soln #3: Privacy Preserving Index

Soln #3: Privacy Preserving Index (PPI)

P2 P1 P3

P32 P2026

Alice/George

Q = “Amyloid Peptide”

Privacy Preserving Index

P1

P2

Soln #3: Privacy Preserving Index (PPI)

P2 P1 P3

P32 P2026

Alice/George

P1 or P2 may have a document

with words “Amyloid Peptide”

Privacy Preserving Index

Soln #3: Privacy Preserving Index

Privacy Preserving Index

ti M P

[A] M = only if dj:ti dj

[B] M = Ptrue Pfalse,|Pfalse| |Ptrue|

[C] M = P

Completeness, Quantifiable Privacy on Reiter-Rubin scale, Loss in Selectivity

Consistency of Behavior

1. Results for “Peptide” should tally with results from searches earlier

2. Results for “Amyloid Peptide” “Amyloid” and “Peptide” should tally

3. …

Filtering of “noise” impossible

A mechanism for constructing a Privacy Preserving Index (PPI)

Step 1: Content Vectors

01

0

Step 2:Privacy Groups

Group A Group F Group Z

Step 3:Group (OR) Vector

]1log[,3max(

10:Error

)}1(78{

)1(1

c

r

Theorem: After r rounds, the Group Vector

subsumes with prob. 1iGiV

Step 4:Global Index

P2 P1 P3

P32 P2026

Keyword Index (PPI)Group Vector

Group A

Group F

Group S

Searches

P2 P1 P3

P32 P2026Group A

Group F

Group S

Keyword Index (PPI)

Alice/George

Q = “Amyloid Peptide”

Group

F

Intuition:3.Group Vector

Group Vector is a logical OR => Members are indistinguishable

Intuition:3.Group Vector

Group Vector is a logical OR => Members are indistinguishable

Privacy size of group

Intuition:3.Group Vector

Group Vector is a logical OR => Members are indistinguishable

Privacy size of group

Search Cost size of group

Privacy vs Performance Tradeoff

Empirical Evaluation

Number of Rounds(Step 3)

]1log[,3max( )}1(78{

)1(1

c

r

Evaluation Procedure

• YouServ: Personal web-server deployed within IBM corporate intranet since 2001

• Content from 324 YouServ web-servers

• Partitioned into privacy groups of size C

• Query Set consisting of 100 queries chosen randomly from YouServ query logs

Loss in Recall

Summary

• Searches on access-controlled data– Privacy Preserving Indexes– Randomized Construction

• Project Home– Google: Stanford Peers– Google: IBM YouServ

The End

Growing Privacy Concerns

• Popular Press– Economist: The End of Privacy(’99)– Time: The Death of Privacy(’97)

• Govt. Directives/Commissions– European Union Directive on Privacy Protection(’98)

– Canadian Personal Information Protection Act(’01)

Context

“The misuse of subpoena process by an adult entertainment company emphasizes the potential for abuse with insufficient privacy protections in the law.”

--- Cindy Cohen(Legal Director, Electronic Frontier Foundation)

Context

“Better support for anonymity and privacy is sorely needed […] amid the RIAA’s campaign to subpoena information about customers.”

--- Wendy Seltzer

(Staff Attorney, Electronic Frontier Foundation)

Growing Privacy Concerns

In 07/2003, the RIAA began filing - at the rate of 75 or more per day – DMCA Section 512(h) subpoenas to force ISPs to identify file sharers.

DMCA 512(h) subpoenas are issued without prior judicial review […and so…] may be used to obtain identity information in cases where there is no copyright infringement.

Growing Privacy Concerns

• Unfair Walmart/KMart against a customer who posted their prices at a comparison-shopping site

• Errors RIAA against Prof. Usher at Penn State Dept. of Astronomy & Astrophysics [+dozen other cases]

• Vested A person against ISPs to erase record of his past messages

• Others Against Internet Archive,…

Automating Search

Alice

P2 P1 P3

P32 P2026

Q = “Amyloid Peptide”

(Alzheimer’s Disease, Alice)

Adversary

Passive (observes sent messages: queries, responses, indexes)

Active (acts deliberately: searcher, provider, indexer)

Global/Local view

Collude/Independent actions

Absolute

Privacy

Provable

Exposure

Quantifying Privacy

0 1/2 1

Probabilistic Scale [Rei98]

Search Methodology

Privacy Preserving Index

ti M P

[A] M = only if dj:ti dj

[B] M = Ptrue Pfalse,|Pfalse| |Ptrue|

[C] M = P

Loss in Selectivity |Pfalse|/|Ptrue| for [B]; at most 2 for [C]

Search Methodology

Privacy Preserving Index

ti M P

[A] M = only if dj:ti dj

[B] M = Ptrue Pfalse,|Pfalse| |Ptrue|

[C] M = P

Correctness No true positives excluded; provider enforces access control

Search Methodology

Privacy Preserving Index

ti M P

[A] M = only if dj:ti dj

[B] M = Ptrue Pfalse,|Pfalse| |Ptrue|

[C] M = P

Privacy All providers equivalent in [A,C]

0 1/2 1

[B]

3.Constructing OR Vector

Group F outi

ii

ini

ii

ii

PprobwithB

Bbifelse

PprobwithB

Bbifelse

nopBbif

. 0

)10(

. 1

)01(

)(

inout

in

PP

P

Start

1 2

1

: iBib

3.Constructing OR Vector

Group F outi

ii

ini

ii

ii

PprobwithB

Bbifelse

PprobwithB

Bbifelse

nopBbif

. 0

)10(

. 1

)01(

)(

inout

inin

PP

PP

RoundEvery

1 2

: ib iB

Construction Properties

Completeness: For any query q, the result set Mq contains all providers that share documents matching q

Correctness: The mapping Mq is expected to be a Privacy Preserving Index

Construction Properties

Privacy: Within a privacy group G, an active adversary can only breach its neighbor’s privacy with probability < 0.71 (Possible Innocence)

0 1/2 1

Data Characteristics

Selectivity of a Term

Related Work

• Private Information Retrieval– Information theoretic privacy– Inefficient for keyword searching

• Secure Databases– Single trusted data host

• Anonymity Channels– Source of message to be anonymous

• Secure Multi-Party/Coprocessors