21
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized by Park,Sung Eun , IDS Lab., Seoul National University Presented by Park,Sung Eun ,IDS Lab., Seoul National University

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Embed Size (px)

Citation preview

Page 1: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Contextual Ranking of Keywords Using Click Data

Utku Irmak, Vadim von Brzeski, Reiner Kraft

Yahoo! Inc

ICDE 09’ Datamining session

2010. 04. 09.

Summarized by Park,Sung Eun , IDS Lab., Seoul National University

Presented by Park,Sung Eun ,IDS Lab., Seoul National University

Page 2: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Copyright 2008 by CEBT

Contents

Introduction

Contextual Shortcuts

Concept Ranking Method

Feature Space

Interestingness and Relevance of a Concept

Evaluation

Cross Validation Approach, Editorial Evaluation, Real World Results

Conclusion

2

Page 3: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Copyright 2008 by CEBT

Introduction

Determining and ranking the key concepts in a docu-ment

Goal

Given the candidate set of entities, learn a ranking function which orders the entities by their interestingness and rele-vance

Applications

Contextual advertising systems

Text summarization

User centric entity detection systems

– Detect entities and concepts within text

– Transform those detected entities into actionable like “intelli-gent hyperlinks”

3

Page 4: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Copyright 2008 by CEBT

Contextual Shortcut

4

Page 5: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Copyright 2008 by CEBT

A concept vector

Concepts : A piece of text that refers to an abstract thought or idea. Ex) car insurance, justice

Generating concept vector

– Term vector : TF/IDF from documents in Yahoo! Search

– Unit vector : all units found in the document

Units are constructed from query logs in an iterative statistical ap-proach using the frequencies of the distinct queries

– Concept vector : the term vector and the unit vector are merged

Contextual Shortcut

5

)()(

),(log),(

ypxp

yxpyxI

Page 6: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Copyright 2008 by CEBT

Previous Concept Ranking Method

AG(TF,Unit)

1. A term appears in the term vector, but not in the unit vec-tor

– punish its term vector weight

2. A term appears in the unit vector, but not in the term vec-tor

– its unit weight

3. add this term to the concept vector with its unit weight

– um its term vector and unit vector weights

6

DocumentDocument

Concept AG(TF,Unit) Score Ranking

President bush

1.1549 1

Iraq war 1.1833 2

Political par-ties

0.6147 3

Page 7: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Copyright 2008 by CEBT

Proposed Concept Ranking Method

Ranking Function : SVM(Support Vector Machine)

SVMlight : an open source library for ranking SVM

Interestingness : 9 Features of a concept

Relevance: pre-mined terms of the concept

7

Term 1

Term 2

Term 3 Term 3

Term 4 Term 5

Term 7Term 6

… …

Interesting-ness

Relevance Ranking

Con-cept1

I1 R1 1

Con-cept2

I2 R2 2

Con-cept3

I3 R3 3

… … …

TermsFeatures

SVMlight

Page 8: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Copyright 2008 by CEBT

Interestingness of a concept

Category Features Details

Search En-gine Query Logs

Freq exact # of queries received that are exactly same as the concept

Freq phrase contained

# of queries that are exactly same as the con-cept

Unit score The score in the unit vector

Search En-gine Result Pages

Search engine phrase

The number of pages returned to the concept as a query

Text Based Features

Concept size # of terms in the concept

Number of characters

# of characters in the concept

Subconcepts # of subconcepts contained in the concept

Taxanomy High level type If the concept exists in one of the editorially maintained lists, use it as a feature

Others Wiki word count

The length of the Wikipedia articles

8

Page 9: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Copyright 2008 by CEBT

Relevance of a Concept in a Context

A mining approach to obtain a good relevance scoring mechanism

Use pre-mined keywords for each concepts

Relevant terms of

Relevance of the concept can be computed based on the co-occurrence of the pre-mined keyword.

9

)},(),...,,(),,{( 2211 kki stststrmsrelevantTe

},...,,{ 21 ncccC

it

Page 10: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Copyright 2008 by CEBT

Relevance of a Concept in a Context

Relevant term scoring

1. Search engine snippets

– Using Yahoo! Developer Network API

– Treat returned snippets as a document and compute score= tf*idf

– Top m=100 terms based on the score

2. Prisma query refinement tool

– Prisma is a tool which assists users to augment or replace their queries by providing feedback terms by considering the top 50 documents in a large collection based on factors such as count and position of the terms, document rank, occurrence of query terms within the input phrase.

– Construct single document from the concepts returned by Prisma for concept ci and compute the score based on the tf*idf

values

10

Page 11: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Copyright 2008 by CEBT

Relevance of a Concept in a Context

Relevant term scoring

3. Related query suggestions

– Using Yahoo! Developer Network API

– 300 suggestions and the query frequencies of the suggestions

– Say k is the number of term appeared in suggestion lists

11

k

i i termidffreqquerytermScore1

)(*)_ln()(

Snippet

PrismaQuerySuggetions

Page 12: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Copyright 2008 by CEBT

Intuition of Query Suggestion and Prisma

12

Page 13: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Copyright 2008 by CEBT

Evaluation

Cross Validation Approach

Data

– Randomly sampled news stories that were annotated by Con-textual Shortcuts

– The number of times these stories viewed and the number of clicks received by each concept that was detected in the sto-ries

– 870 stoires,6420 concepts of 16549 sample clicks

Weighted Error Rate

Where Click-through-rate=(the number of clicks) / (the number of views)

13

PairsAll

rsedictedPaiMistakenlyRateError

|Pr|

||

||

1

||

1

pairsall

i i

mistakes

i i

differenceCTR

differemceCTRRateErrorWeighted

Page 14: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Copyright 2008 by CEBT

Evaluation

NDCG(Normalized discounted cumulative gain measure)

– A valuable metric for those applications that require high preci-sion at top ranks

– Score for a sorted list of k concepts on documenti

– Where score(j)=bucketNo(CTR(j)/100), bucketNo() returns a bucket number between 0 and 1000 considering all the CTR values observed in the system in increasing order.

14

k

j

jscore

idocument jNNDCG

i1

)(

)1log(

12

Page 15: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Copyright 2008 by CEBT

Evaluation

Interestingness features

15

Page 16: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Copyright 2008 by CEBT

Evaluation

Relevance score

16

Page 17: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Copyright 2008 by CEBT

Evaluation

Interestingness Features and Relevance Score

17

Page 18: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Copyright 2008 by CEBT

Evaluation

Editorial Evaluation

1. Processed set of documents is presented to the judges

2. A judge is asked to select a document from the pool.

3. Ask to read the document and rate each entity or concept highlighted in the document in terms of its interestingness and relevance

18

Page 19: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Copyright 2008 by CEBT

Contributions

We propose to use implicit user feedback in the form of click data to determine the most interesting and relevant concepts in a context via a machine learning approach.

We describe a feature space pertinent to the interesting-ness of a concept, and present algorithms to identify rel-evance of a concept in a given context.

We evaluate the proposed techniques extensively using click data, an editorial study, and an analysis on produc-tion system. The results show significant improvements.

We provide a detailed description of a framework that enables efficient implementation of the proposed tech-niques in a production system.

19

Page 20: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Copyright 2008 by CEBT

Discussion

No theoretical base on their feature selection assump-tions.

No references or base theory at all

Depending on the technology already developed in pre-vious studies.

Huge advantage on having valuable dataset.

20

Page 21: Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session 2010. 04. 09. Summarized

Q&A

Thank you

21