Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical

Personalization with user’s local dataPersonalization with user’s local data

Personalizing Search via Automated Analysis

of Interests and Activities

1

Sungjick LeeDepartment of Electrical and Computer Engineering

2008-11-05

Query ‘IR’ on Google

IntroductionIntroduction

Current Web Search EnginesCurrent Web Search EnginesWe are studying about ‘information retrieval’, so we need the pages

about that!NO!

NO!

NO!

And what is this?

2

Query ‘IR’ on Google• For the information-retrieval researcher• the SIGIR homepage

• For the financial analyst• stock quotes for Ingersoll-Rand

• For the chemist• pages about infrared light

IntroductionIntroduction

We want the results like thisWe want the results like this

3

Two methodologies for a Web search engine to incorporate information about a user1. A user profile is communicated to the

server

2. The results are downloaded and re-ranked

Two methodologies(1/2)Two methodologies(1/2)

4

User profile

top rankedpagesre-ranked

Focusing the 2nd method for several reasons1.Ensuring privacy

2.Feasible to include computationally-intensive procedures

3. the re-ranking methods facilitate straightforward evaluation

Two methodologies(2/2)Two methodologies(2/2)

5

Traditional FB vs. PersonalTraditional FB vs. Personal Profile Profile FBFB

# of documents in

the corpus

# of documents in the corpus

that contains the term i

# of documents for which relevance

feedback has been provided# of documents for which

relevance feedback has been provided

that contains the term i

Relevance information (R, ri) comes from the corpus

Profiles are derived from a personal store

A well known probabilistic weighting scheme• Essentially sums over query terms the log odds of the

query terms occurring in relevant and non-relevant documents

• Without relevance information

• relevance :

• With relevance information

• relevance :

BM25 (Traditional FB)BM25 (Traditional FB)

7

tfi: the frequency with which that term appears in the documentN : the number of document in the corpusni : the number of documents in the corpus that contains the term i

R: the number of documents for which relevance feedback has been providedri: The number of documents for which relevance feedback has been provided that contain the term

Using information outside of the Web corpus• pulling the relevant document outside the document

space

• Extending the notion of corpus

• Relevance

Personal Profile FBPersonal Profile FB

8

N’ = (N+R) , ni’=(ni+ri)Substituti

ng

Estimating…• N : the number of documents on the web• Using the most frequent word in English, “the”, as

the query

• The result

• ni : the number of document on the web that contain term i• Probing the web by issuing on word queries

9

RepresentationRepresentation

Corpus(N, nCorpus(N, nii) (1/2)) (1/2)

Focusing the corpus presentation• Corpus statistics can either be gathered from

• all of the documents on the Web

• or, only the subset of documents that are relevant to the query ( referred as a query focus )• An example, the query is “IR”

• a query-focused corpus consists only of documents that contain the term “IR”

• When the corpus representation is limited to a query focus, the user representation is correspondingly query focused

10


Corpus(N, nCorpus(N, nii) (2/2)) (2/2)

A rich index of personal content that captured a user’s interests and computational activities• could be obtained from desktop indices such as Google

Desktop, Mac Tiger, Windows Desktop Search In this paper, indexed all of the information

created, copied or viewed by a user used• Web pages, email messages, calendar items,

documents stored on the client machine The most straightforward way to use this index

• Treating every document in it as a source of the user’s interests• R : the number of documents in the index

• ri : the number of documents in the index that contain term i

11


User (R, rUser (R, rii) (1/2)) (1/2)

Considering limiting documents• Restricting the document type to the Web pages

• Limiting documents the most recent ones

• In this paper, considering documents indexed in the last month versus the full index of documents

Two lighter-weight representation• Using the query terms that the user had issued in the

past

• Boosting the search results with URLs from domains that the user had visited in past

12


User (R, rUser (R, rii) (2/2)) (2/2)

Using the full text of documents in the results set• Accessing the full text of each document takes

considerable time Also, experimented with using only the title and

the snippet of the document returned by the Web search engine• the snippet is inherently query focused

Query Expansion• The inclusion of all of the terms occurring in the

relevant documents• a kind of blind or pseudo-relevance feedback in which the

top-k documents are considered relevant

13


Document(tfDocument(tfii) and Query) and Query

An evaluation collection• 15 participants

• evaluate the top 50 Web search results for approximately 10 self-selected queries each• For each search result, asked to

determinehighly relevant, relevant, or not relevant to the query

• Web search results from MSN Search

14

Evaluation Framework(1/4)Evaluation Framework(1/4)

Selecting the queries to be evaluated1. users choose a query to mimic a search they

had performed earlier that day2. users select a query from a list formulated to

be of general interest (e.g., “cancer”, “Bush”, “Web search”)

• A total of 131 queries• 53 were pre-selected

• 78 were self-generated

15


Each participant provided us with an index of the information on their personal computer• in size from 10,000 to 100,000 items

• used to compute their personalized term weights

All participants were employees of Microsoft• software engineers, researchers, program managers,

and administrators

16


Discounted Cumulative Gain(DCG)

[9]IR evaluation methods for retrieving highly relevant documents• Cumulative Gain(GC)

• Example) G=<3, 2, 3, 0, 0, 1, 2, 2, 3, 0, … > CG = <3, 5, 8, 8, 8, 9, 11, 13, 16, 16, … >

17


if i = 1

otherwise

if i = 1

otherwise

18

ResultsResults

Alternative Alternative Representations(1/2)Representations(1/2)

Ric

her

mod

el

Poore

r m

od

el

When using only documents

related to the query to

represent the corpus, the term

weights represent how different the user is from the average person who submits the

query

a rich user profile is more important that a rich document representation

The best combination of 67 different combinations• Corpus : Approximated by the result set title and

snippets, which is inherently query focused

• User : Built from the user’s entire personal index, query focused

• Document and Query : Documents represented by the title and snippet returned by the search engine, with query expansion based on words that occur near the query term

19

ResultsResults

Alternative Alternative Representations(2/2)Representations(2/2)

20

ResultsResultsBaseline ComparisonsBaseline Comparisons

21

Thank you.Thank you.