Upload
tamsin-perkins
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
Personalization with user’s local dataPersonalization with user’s local data
Personalizing Search via Automated Analysis
of Interests and Activities
1
Sungjick LeeDepartment of Electrical and Computer Engineering
2008-11-05
Query ‘IR’ on Google
IntroductionIntroduction
Current Web Search EnginesCurrent Web Search EnginesWe are studying about ‘information retrieval’, so we need the pages
about that!NO!
NO!
NO!
And what is this?
2
Query ‘IR’ on Google• For the information-retrieval researcher• the SIGIR homepage
• For the financial analyst• stock quotes for Ingersoll-Rand
• For the chemist• pages about infrared light
IntroductionIntroduction
We want the results like thisWe want the results like this
3
Two methodologies for a Web search engine to incorporate information about a user1. A user profile is communicated to the
server
2. The results are downloaded and re-ranked
Two methodologies(1/2)Two methodologies(1/2)
4
User profile
top rankedpagesre-ranked
Focusing the 2nd method for several reasons1.Ensuring privacy
2.Feasible to include computationally-intensive procedures
3. the re-ranking methods facilitate straightforward evaluation
Two methodologies(2/2)Two methodologies(2/2)
5
Traditional FB vs. PersonalTraditional FB vs. Personal Profile Profile FBFB
# of documents in
the corpus
# of documents in the corpus
that contains the term i
# of documents for which relevance
feedback has been provided# of documents for which
relevance feedback has been provided
that contains the term i
Relevance information (R, ri) comes from the corpus
Profiles are derived from a personal store
A well known probabilistic weighting scheme• Essentially sums over query terms the log odds of the
query terms occurring in relevant and non-relevant documents
• Without relevance information
• relevance :
• With relevance information
• relevance :
BM25 (Traditional FB)BM25 (Traditional FB)
7
tfi: the frequency with which that term appears in the documentN : the number of document in the corpusni : the number of documents in the corpus that contains the term i
R: the number of documents for which relevance feedback has been providedri: The number of documents for which relevance feedback has been provided that contain the term
Using information outside of the Web corpus• pulling the relevant document outside the document
space
• Extending the notion of corpus
• Relevance
Personal Profile FBPersonal Profile FB
8
N’ = (N+R) , ni’=(ni+ri)Substituti
ng
Estimating…• N : the number of documents on the web• Using the most frequent word in English, “the”, as
the query
• The result
• ni : the number of document on the web that contain term i• Probing the web by issuing on word queries
9
RepresentationRepresentation
Corpus(N, nCorpus(N, nii) (1/2)) (1/2)
Focusing the corpus presentation• Corpus statistics can either be gathered from
• all of the documents on the Web
• or, only the subset of documents that are relevant to the query ( referred as a query focus )• An example, the query is “IR”
• a query-focused corpus consists only of documents that contain the term “IR”
• When the corpus representation is limited to a query focus, the user representation is correspondingly query focused
10
RepresentationRepresentation
Corpus(N, nCorpus(N, nii) (2/2)) (2/2)
A rich index of personal content that captured a user’s interests and computational activities• could be obtained from desktop indices such as Google
Desktop, Mac Tiger, Windows Desktop Search In this paper, indexed all of the information
created, copied or viewed by a user used• Web pages, email messages, calendar items,
documents stored on the client machine The most straightforward way to use this index
• Treating every document in it as a source of the user’s interests• R : the number of documents in the index
• ri : the number of documents in the index that contain term i
11
RepresentationRepresentation
User (R, rUser (R, rii) (1/2)) (1/2)
Considering limiting documents• Restricting the document type to the Web pages
• Limiting documents the most recent ones
• In this paper, considering documents indexed in the last month versus the full index of documents
Two lighter-weight representation• Using the query terms that the user had issued in the
past
• Boosting the search results with URLs from domains that the user had visited in past
12
RepresentationRepresentation
User (R, rUser (R, rii) (2/2)) (2/2)
Using the full text of documents in the results set• Accessing the full text of each document takes
considerable time Also, experimented with using only the title and
the snippet of the document returned by the Web search engine• the snippet is inherently query focused
Query Expansion• The inclusion of all of the terms occurring in the
relevant documents• a kind of blind or pseudo-relevance feedback in which the
top-k documents are considered relevant
13
RepresentationRepresentation
Document(tfDocument(tfii) and Query) and Query
An evaluation collection• 15 participants
• evaluate the top 50 Web search results for approximately 10 self-selected queries each• For each search result, asked to
determinehighly relevant, relevant, or not relevant to the query
• Web search results from MSN Search
14
Evaluation Framework(1/4)Evaluation Framework(1/4)
Selecting the queries to be evaluated1. users choose a query to mimic a search they
had performed earlier that day2. users select a query from a list formulated to
be of general interest (e.g., “cancer”, “Bush”, “Web search”)
• A total of 131 queries• 53 were pre-selected
• 78 were self-generated
15
Evaluation Framework(2/4)Evaluation Framework(2/4)
Each participant provided us with an index of the information on their personal computer• in size from 10,000 to 100,000 items
• used to compute their personalized term weights
All participants were employees of Microsoft• software engineers, researchers, program managers,
and administrators
16
Evaluation Framework(3/4)Evaluation Framework(3/4)
Discounted Cumulative Gain(DCG)
[9]IR evaluation methods for retrieving highly relevant documents• Cumulative Gain(GC)
• Example) G=<3, 2, 3, 0, 0, 1, 2, 2, 3, 0, … > CG = <3, 5, 8, 8, 8, 9, 11, 13, 16, 16, … >
17
Evaluation Framework(4/4)Evaluation Framework(4/4)
if i = 1
otherwise
if i = 1
otherwise
18
ResultsResults
Alternative Alternative Representations(1/2)Representations(1/2)
Ric
her
mod
el
Poore
r m
od
el
When using only documents
related to the query to
represent the corpus, the term
weights represent how different the user is from the average person who submits the
query
a rich user profile is more important that a rich document representation
The best combination of 67 different combinations• Corpus : Approximated by the result set title and
snippets, which is inherently query focused
• User : Built from the user’s entire personal index, query focused
• Document and Query : Documents represented by the title and snippet returned by the search engine, with query expansion based on words that occur near the query term
19
ResultsResults
Alternative Alternative Representations(2/2)Representations(2/2)
20
ResultsResultsBaseline ComparisonsBaseline Comparisons
21
Thank you.Thank you.