27
A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft Research Joint work with Eric Horvitz Microsoft Research 23 rd Conference on Artificial Intelligence | July 16, 2008

A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

A Utility-Theoretic Approach to Privacy and Personalization

Andreas Krause Carnegie Mellon University

work performed during an internship at Microsoft Research

Joint work with Eric HorvitzMicrosoft Research

23rd Conference on Artificial Intelligence | July 16, 2008

Page 2: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

2

Value of private information to enhancing search

Personalized web search is a prediction problem:“Which page is user X most likely interested in for query Q?”

The more information we have about the user, the better services can be provided to users

Users are reluctant to share private information (or don’t want search engines to log data)

We apply utility theoretic methods to optimize tradeoff:

Getting the biggest “bang” for the “personal data buck”

Page 3: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

3

Utility theoretic approach

Sharing personal information (topic interests, search history, IP address etc.)

Utility of knowing

Sensitivityof sharing

Net benefitto user=

Page 4: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

4

Utility theoretic approach

Sharing more information might decrease net benefit

Utility of knowing

Sensitivityof sharing–

Net benefit

to user=

Page 5: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

5

Maximizing the net benefit

How can we find optimal tradeoff maximizing net benefit?

?

Net

ben

efit

Share noinformation

Share muchinformation

Page 6: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

6

Trading off utility and privacy

Set V of 29 possible attributes (each · 2 bits)Demographic data (location)

Query details (working hours / week day?)

Topic interests (ever visited business / science / … website)

Search history (same query / click before / searches/day?)

User behavior (ever changed Zip, City, Country)?

For each A µ V compute utility U(A) and cost C(A)

Find A maximizing U(A) while minimizing C(A)

Page 7: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

7

Estimating utility U(A) of sharing data

Ideally: how does knowing A help increase the relevance of displayed results?

Very hard to estimate from data

Proxy [Mei and Church ’06, Dou et al ‘07]: Click entropy!Learn probabilistic model for P( C | Q, A) = P( click | query, attributes )U(A) = H( C | Q ) – H( C | Q, A )

Entropy beforerevealing attributes

Entropy afterrevealing attributes

E.g.: A = {X1, X3}U(A) = 1.3

CSearch goal

X1

AgeX2

GenderX3

Country

QQuery

Page 8: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

8

Click entropy example

U(A) = expected click entropy reduction knowing A

Query: sports

Pages 1 2 3 4 5 6

Freq

1 2 3 4 5 6

FreqCountry:USA

Entropy H = 2.6 H = 1.7

Entropy Reduction:

0.9

CSearch goal

X1

AgeX2

GenderX3

Country

QQuery

Page 9: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

9

Study of Value of Personal Data

Estimate click entropy from volunteer search log data.~15,000 usersOnly frequent queries (¸ 30 users)Total ~250,000 queries during 2006

Example: Consider topics of prior visits, V = {topic_arts,topic_kids}

Query: “cars”, prior entropy: 4.55U({topic_arts}) = 0.40U({topic_kids}) = 0.41

How does U(A) increase as we pick more attributes A?

Page 10: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

10

none ATLV THOM ACTY TGAM TSPT AQRY ACLK AWDY AWHR TCIN TADT DREG TKID AFRQ TSCI THEA TNWS TCMP ACRY TREF0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Diminishing returns for click entropy

The more attributes we add, the less we gain in utility

Theorem: Click entropy U(A) is submodular!*

A*: Search activityT*: Topic interests

Mor

e uti

lity

(ent

ropy

redu

ction

)

More private attributes (greedily chosen)

*See store for details

Page 11: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

11

Trading off utility and privacy

Set V of 29 possible attributes (each · 2 bits)Demographic data (location)

Query details (working hours / week day?)

Topic interests (ever visited business / science / … website)

Search history (same query / click before / searches/day?)

User behavior (ever changed Zip, City, Country)?

For each A µ V compute utility U(A) and cost C(A)

Find A maximizing U(A) while minimizing C(A)

Page 12: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

12

Getting a handle on cost

Identifiability: “Will they know it’s me?”Sensitivity: “I don’t feel comfortable sharing this!”

Page 13: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

13

Identifiability costIntuition: The more attributes we already know, the more identifying it is to add another

Goal: Avoid identifiabilityFor example: k-anonymity [Sweeney ‘02], and others

Age

Gender

Occupation

Page 14: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

14

Predict person Y from attributes AExample: P(Y | gender = female, country = US)

Define “loss” function [c.f., Lebanon et al.]

Identifiability cost

Identifiability cost

User 1 2 3 4 5 6

Freq

User 1 2 3 4 5 6

Freq

Good! Predicting user is hard. Bad! Predicting user is easy!

Worst-case probability of detection

Page 15: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

15

Identifiability cost

The more attributes we add, the larger the increase in cost: Accelerating cost

Theorem: Identifiability cost C(A) is supermodular!*

none TCMPAWDY AWHRAQRY ACLK ACRY TREG TWLD TART TREF ACTY TBUS THEA TREC AZIP TNWS TSPT TSHP TSOC AFRQ TSCI ATLV TKID DREG TADT THOM TGMS TCIN0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Less

iden

tifiab

ility

cos

t

More private attributes (greedily chosen)

*See store for details

Page 16: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

16

Trading off utility and privacy

Set V of 29 possible attributes (each · 2 bits)Demographic data (location)

Query details (working hours / week day?)

Topic interests (ever visited business / science / … website)

Search history (same query / click before / searches/day?)

User behavior (ever changed Zip, City, Country)?

For each A µ V compute utility U(A) and cost C(A)

Find A maximizing U(A) while minimizing C(A)

Page 17: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

17

Trading off utility and cost

Want: A* = argmax F(A)

Optimizing value of private information is a submodular problem! Can use algorithms for optimizing submodular functions:

Goldengorin et al. (branch and bound), Feige et al. (approx. algorithm),..

Can efficiently get provably near-optimal tradeoff!

- λ =

U(A) C(A) F(A)Trade-off

parameter

Utility Cost Final objective

no occ adult age whour gdr homewday ctry kids refs bus compworld arts reg0

0.2

0.4

0.6

0.8

1

1.2

1.4

Red

uctio

n in

clic

k en

trop

y

Greedy forward selection for utility

wdaywhourreg busadultworldctry gdr artscomprefskidshomeage occ all0

0.2

0.4

0.6

0.8

1

1.2

1.4

Priv

acy

cost

(p

log(

1-p)

)

none occ homewhourage wdaygenderreg bus worldadult artscountrycomp ref kids-0.2

0

0.2

0.4

0.6

0.8

Util

ity -

Cos

t

(Lazy) Greedy forward selection

submodular supermodular submodular (non-monotonic)

NP hard (and large: 229 subsets)

Page 18: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

18

Finding the “sweet spot”

Which λ should we choose?

Tradeoff-curve purely based on log data.

What do users prefer?0 0.1 0.2 0.3 0.4 0.5 0.6

0

0.5

1

1.5

2

Mor

e uti

lity

U(A

)

Less cost C(A)

Want: A* = argmax U(A) - C(A)

= 1

= 0= 1

= 10 “ignorecost”

“ignore utility”

Sweet spot!Maximal utility at maximal privacy

Page 19: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

19

Survey for eliciting costMicrosoft internal online surveyDistributed internationallyN=1451 responses from 35 countries (80% US)Incentive: 1 Zune™ digital music player

Page 20: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

20

Identifiability vs sensitivity

Page 21: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

21

Sensitivity vs utility

Page 22: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

22

Seeking a common currency

Sensitivity acts as common currency to estimate utility-privacy tradeoff

1 2 3 4 5Region

Country

State

City

Zip

Address

Loca

tion

Gra

nul

arity

Sensitivity1 2 3 4 5

1.251.5

2

4

never

Sp

eed

up r

equi

red

Sensitivity

Page 23: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

23

region country state city zip0

1

2

3

4

En

trop

y re

duct

ion

requ

ired

Survey data(median)

Identifiability cost(from search logs)

region country state city zip0

1

2

3

4

En

trop

y re

duct

ion

requ

ired

Survey data(median)

Identifiability cost(from search logs)

Survey data(median)

0 0.1 0.2 0.3 0.4 0.5 0.60

0.5

1

1.5

2

Cost (maxprob)

Util

ity (

en

tro

py

red

uct

ion

)

= 100

= 10

= 1

Calibrating the tradeoff

Can use survey data to calibrate utility privacy tradeoff!

User preferences map into sweet spot!

Best fit forλ = 5.12 F(A) = U(A) - λ C(A)

Page 24: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

24

Understanding Sensitivities:“I don’t feel comfortable sharing this!”

Page 25: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

25

Attribute sensitivities

FRQ NEWS SCI ART GMS BUS HEA SOC GDR QRY CLK TCY OCC MTL CHD WHR1

2

3

4

5

Sen

sitiv

ity

We incorporate sensitivity in our cost function by calibration

Significant differencesbetween topics!

Page 26: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

26

Comparison with heuristics

Optimized solution: Repeated visit / query, workday / working hour, top-level domain, avg. queries per day, topic: sports, topic: games

0.573

0.899

-1.81 -3.3

-0.5

0

0.5

1

1.5

2

Optimized tradeoff

Search statistics (ATLV, AWDY, AWHR, AFRQ)

-1.73

All topic interests

IP Address Bytes 1&2

IP Address

Utility U(A)

Cost C(A)

Net Benefit F(A)

Mor

e ne

t ben

efit (

bits

of i

nfo.

)

Optimized solution outperforms naïve selection heuristics!

Page 27: A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

27

Summary

Use of private information by online services as an optimization problem (with user permission /awareness)

Utility (Click entropy) is submodularPrivacy (Identifiability) is supermodular

Can use theoretical and algorithmic tools to efficiently find provably near-optimal tradeoff

Can calibrate tradeoff using user preferences

Promising results on search logs and survey data!