Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Contextual Ranking of Keywords Using Click Data

Utku Irmak, Vadim von Brzeski, Reiner Kraft

Yahoo! Inc

ICDE 09’ Datamining session

2010. 04. 09.

Summarized by Park,Sung Eun , IDS Lab., Seoul National University

Presented by Park,Sung Eun ,IDS Lab., Seoul National University

Copyright 2008 by CEBT

Contents

Introduction

Contextual Shortcuts

Concept Ranking Method

Feature Space

Interestingness and Relevance of a Concept

Evaluation

Cross Validation Approach, Editorial Evaluation, Real World Results

Conclusion

2


Introduction

Determining and ranking the key concepts in a docu-ment

Goal

Given the candidate set of entities, learn a ranking function which orders the entities by their interestingness and rele-vance

Applications

Contextual advertising systems

Text summarization

User centric entity detection systems

– Detect entities and concepts within text

– Transform those detected entities into actionable like “intelli-gent hyperlinks”

3


Contextual Shortcut

4


A concept vector

Concepts : A piece of text that refers to an abstract thought or idea. Ex) car insurance, justice

Generating concept vector

– Term vector : TF/IDF from documents in Yahoo! Search

– Unit vector : all units found in the document

Units are constructed from query logs in an iterative statistical ap-proach using the frequencies of the distinct queries

– Concept vector : the term vector and the unit vector are merged

Contextual Shortcut

5

)()(

),(log),(

ypxp

yxpyxI


Previous Concept Ranking Method

AG(TF,Unit)

1. A term appears in the term vector, but not in the unit vec-tor

– punish its term vector weight

2. A term appears in the unit vector, but not in the term vec-tor

– its unit weight

3. add this term to the concept vector with its unit weight

– um its term vector and unit vector weights

6

DocumentDocument

Concept AG(TF,Unit) Score Ranking

President bush

1.1549 1

Iraq war 1.1833 2

Political par-ties

0.6147 3

…


Proposed Concept Ranking Method

Ranking Function : SVM(Support Vector Machine)

SVMlight : an open source library for ranking SVM

Interestingness : 9 Features of a concept

Relevance: pre-mined terms of the concept

7

Term 1

Term 2

Term 3 Term 3

Term 4 Term 5

Term 7Term 6

… …

Interesting-ness

Relevance Ranking

Con-cept1

I1 R1 1

Con-cept2

I2 R2 2

Con-cept3

I3 R3 3

… … …

TermsFeatures

SVMlight


Interestingness of a concept

Category Features Details

Search En-gine Query Logs

Freq exact # of queries received that are exactly same as the concept

Freq phrase contained

# of queries that are exactly same as the con-cept

Unit score The score in the unit vector

Search En-gine Result Pages

Search engine phrase

The number of pages returned to the concept as a query

Text Based Features

Concept size # of terms in the concept

Number of characters

# of characters in the concept

Subconcepts # of subconcepts contained in the concept

Taxanomy High level type If the concept exists in one of the editorially maintained lists, use it as a feature

Others Wiki word count

The length of the Wikipedia articles

8


Relevance of a Concept in a Context

A mining approach to obtain a good relevance scoring mechanism

Use pre-mined keywords for each concepts

Relevant terms of

Relevance of the concept can be computed based on the co-occurrence of the pre-mined keyword.

9

)},(),...,,(),,{( 2211 kki stststrmsrelevantTe

},...,,{ 21 ncccC

it



Relevant term scoring

1. Search engine snippets

– Using Yahoo! Developer Network API

– Treat returned snippets as a document and compute score= tf*idf

– Top m=100 terms based on the score

2. Prisma query refinement tool

– Prisma is a tool which assists users to augment or replace their queries by providing feedback terms by considering the top 50 documents in a large collection based on factors such as count and position of the terms, document rank, occurrence of query terms within the input phrase.

– Construct single document from the concepts returned by Prisma for concept ci and compute the score based on the tf*idf

values

10



Relevant term scoring

3. Related query suggestions

– Using Yahoo! Developer Network API

– 300 suggestions and the query frequencies of the suggestions

– Say k is the number of term appeared in suggestion lists

11

k

i i termidffreqquerytermScore1

)(*)_ln()(

Snippet

PrismaQuerySuggetions


Intuition of Query Suggestion and Prisma

12


Evaluation

Cross Validation Approach

Data

– Randomly sampled news stories that were annotated by Con-textual Shortcuts

– The number of times these stories viewed and the number of clicks received by each concept that was detected in the sto-ries

– 870 stoires,6420 concepts of 16549 sample clicks

Weighted Error Rate

Where Click-through-rate=(the number of clicks) / (the number of views)

13

PairsAll

rsedictedPaiMistakenlyRateError

|Pr|

||

||

1

||

1

pairsall

i i

mistakes

i i

differenceCTR

differemceCTRRateErrorWeighted


Evaluation

NDCG(Normalized discounted cumulative gain measure)

– A valuable metric for those applications that require high preci-sion at top ranks

– Score for a sorted list of k concepts on documenti

– Where score(j)=bucketNo(CTR(j)/100), bucketNo() returns a bucket number between 0 and 1000 considering all the CTR values observed in the system in increasing order.

14

k

j

jscore

idocument jNNDCG

i1

)(

)1log(

12


Evaluation

Interestingness features

15


Evaluation

Relevance score

16


Evaluation

Interestingness Features and Relevance Score

17


Evaluation

Editorial Evaluation

1. Processed set of documents is presented to the judges

2. A judge is asked to select a document from the pool.

3. Ask to read the document and rate each entity or concept highlighted in the document in terms of its interestingness and relevance

18


Contributions

We propose to use implicit user feedback in the form of click data to determine the most interesting and relevant concepts in a context via a machine learning approach.

We describe a feature space pertinent to the interesting-ness of a concept, and present algorithms to identify rel-evance of a concept in a given context.

We evaluate the proposed techniques extensively using click data, an editorial study, and an analysis on produc-tion system. The results show significant improvements.

We provide a detailed description of a framework that enables efficient implementation of the proposed tech-niques in a production system.

19


Discussion

No theoretical base on their feature selection assump-tions.

No references or base theory at all

Depending on the technology already developed in pre-vious studies.

Huge advantage on having valuable dataset.

20

Q&A

Thank you

21

Documents

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized