Upload
johnathan-kennedy
View
218
Download
3
Tags:
Embed Size (px)
Citation preview
Contextual Ranking of Keywords Using Click Data
Utku Irmak, Vadim von Brzeski, Reiner Kraft
Yahoo! Inc
ICDE 09’ Datamining session
2010. 04. 09.
Summarized by Park,Sung Eun , IDS Lab., Seoul National University
Presented by Park,Sung Eun ,IDS Lab., Seoul National University
Copyright 2008 by CEBT
Contents
Introduction
Contextual Shortcuts
Concept Ranking Method
Feature Space
Interestingness and Relevance of a Concept
Evaluation
Cross Validation Approach, Editorial Evaluation, Real World Results
Conclusion
2
Copyright 2008 by CEBT
Introduction
Determining and ranking the key concepts in a docu-ment
Goal
Given the candidate set of entities, learn a ranking function which orders the entities by their interestingness and rele-vance
Applications
Contextual advertising systems
Text summarization
User centric entity detection systems
– Detect entities and concepts within text
– Transform those detected entities into actionable like “intelli-gent hyperlinks”
3
Copyright 2008 by CEBT
Contextual Shortcut
4
Copyright 2008 by CEBT
A concept vector
Concepts : A piece of text that refers to an abstract thought or idea. Ex) car insurance, justice
Generating concept vector
– Term vector : TF/IDF from documents in Yahoo! Search
– Unit vector : all units found in the document
Units are constructed from query logs in an iterative statistical ap-proach using the frequencies of the distinct queries
– Concept vector : the term vector and the unit vector are merged
Contextual Shortcut
5
)()(
),(log),(
ypxp
yxpyxI
Copyright 2008 by CEBT
Previous Concept Ranking Method
AG(TF,Unit)
1. A term appears in the term vector, but not in the unit vec-tor
– punish its term vector weight
2. A term appears in the unit vector, but not in the term vec-tor
– its unit weight
3. add this term to the concept vector with its unit weight
– um its term vector and unit vector weights
6
DocumentDocument
Concept AG(TF,Unit) Score Ranking
President bush
1.1549 1
Iraq war 1.1833 2
Political par-ties
0.6147 3
…
Copyright 2008 by CEBT
Proposed Concept Ranking Method
Ranking Function : SVM(Support Vector Machine)
SVMlight : an open source library for ranking SVM
Interestingness : 9 Features of a concept
Relevance: pre-mined terms of the concept
7
Term 1
Term 2
Term 3 Term 3
Term 4 Term 5
Term 7Term 6
… …
Interesting-ness
Relevance Ranking
Con-cept1
I1 R1 1
Con-cept2
I2 R2 2
Con-cept3
I3 R3 3
… … …
TermsFeatures
SVMlight
Copyright 2008 by CEBT
Interestingness of a concept
Category Features Details
Search En-gine Query Logs
Freq exact # of queries received that are exactly same as the concept
Freq phrase contained
# of queries that are exactly same as the con-cept
Unit score The score in the unit vector
Search En-gine Result Pages
Search engine phrase
The number of pages returned to the concept as a query
Text Based Features
Concept size # of terms in the concept
Number of characters
# of characters in the concept
Subconcepts # of subconcepts contained in the concept
Taxanomy High level type If the concept exists in one of the editorially maintained lists, use it as a feature
Others Wiki word count
The length of the Wikipedia articles
8
Copyright 2008 by CEBT
Relevance of a Concept in a Context
A mining approach to obtain a good relevance scoring mechanism
Use pre-mined keywords for each concepts
Relevant terms of
Relevance of the concept can be computed based on the co-occurrence of the pre-mined keyword.
9
)},(),...,,(),,{( 2211 kki stststrmsrelevantTe
},...,,{ 21 ncccC
it
Copyright 2008 by CEBT
Relevance of a Concept in a Context
Relevant term scoring
1. Search engine snippets
– Using Yahoo! Developer Network API
– Treat returned snippets as a document and compute score= tf*idf
– Top m=100 terms based on the score
2. Prisma query refinement tool
– Prisma is a tool which assists users to augment or replace their queries by providing feedback terms by considering the top 50 documents in a large collection based on factors such as count and position of the terms, document rank, occurrence of query terms within the input phrase.
– Construct single document from the concepts returned by Prisma for concept ci and compute the score based on the tf*idf
values
10
Copyright 2008 by CEBT
Relevance of a Concept in a Context
Relevant term scoring
3. Related query suggestions
– Using Yahoo! Developer Network API
– 300 suggestions and the query frequencies of the suggestions
– Say k is the number of term appeared in suggestion lists
11
k
i i termidffreqquerytermScore1
)(*)_ln()(
Snippet
PrismaQuerySuggetions
Copyright 2008 by CEBT
Intuition of Query Suggestion and Prisma
12
Copyright 2008 by CEBT
Evaluation
Cross Validation Approach
Data
– Randomly sampled news stories that were annotated by Con-textual Shortcuts
– The number of times these stories viewed and the number of clicks received by each concept that was detected in the sto-ries
– 870 stoires,6420 concepts of 16549 sample clicks
Weighted Error Rate
Where Click-through-rate=(the number of clicks) / (the number of views)
13
PairsAll
rsedictedPaiMistakenlyRateError
|Pr|
||
||
1
||
1
pairsall
i i
mistakes
i i
differenceCTR
differemceCTRRateErrorWeighted
Copyright 2008 by CEBT
Evaluation
NDCG(Normalized discounted cumulative gain measure)
– A valuable metric for those applications that require high preci-sion at top ranks
– Score for a sorted list of k concepts on documenti
– Where score(j)=bucketNo(CTR(j)/100), bucketNo() returns a bucket number between 0 and 1000 considering all the CTR values observed in the system in increasing order.
14
k
j
jscore
idocument jNNDCG
i1
)(
)1log(
12
Copyright 2008 by CEBT
Evaluation
Interestingness features
15
Copyright 2008 by CEBT
Evaluation
Relevance score
16
Copyright 2008 by CEBT
Evaluation
Interestingness Features and Relevance Score
17
Copyright 2008 by CEBT
Evaluation
Editorial Evaluation
1. Processed set of documents is presented to the judges
2. A judge is asked to select a document from the pool.
3. Ask to read the document and rate each entity or concept highlighted in the document in terms of its interestingness and relevance
18
Copyright 2008 by CEBT
Contributions
We propose to use implicit user feedback in the form of click data to determine the most interesting and relevant concepts in a context via a machine learning approach.
We describe a feature space pertinent to the interesting-ness of a concept, and present algorithms to identify rel-evance of a concept in a given context.
We evaluate the proposed techniques extensively using click data, an editorial study, and an analysis on produc-tion system. The results show significant improvements.
We provide a detailed description of a framework that enables efficient implementation of the proposed tech-niques in a production system.
19
Copyright 2008 by CEBT
Discussion
No theoretical base on their feature selection assump-tions.
No references or base theory at all
Depending on the technology already developed in pre-vious studies.
Huge advantage on having valuable dataset.
20
Q&A
Thank you
21