Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1

Context-aware Query Suggestion by Mining Click-through and Session Data

Authors: H. Cao et.alKDD 08

Presented by Shize Su

1

Outline

Introduction

Framework of the Proposed Method

Mining Query Concepts

Concept Sequence Suffix Tree

Experimental Evaluation

Summary

2

Introduction

What is query suggestion in search engine? Guess user’s search intent ( user query )

suggest queries

Why query suggestion is important? Easy to issue appropriate query? No! A “bottleneck issue” of search engine usability (Google, Yahoo, Bing, Baidu, etc)

3

Better describe user’s information need?

1 2, or i iq q q q

,j kq q

IntroductionMajor existing approaches (with search log data) :

Approach I: clustering queries using clicked URL data to find similar queries,

Approach II: mining pairs of queries which are adjacent or co-occur in the same query session,

4

Fig1: An example of search log data

frequent?i jq q

and similar?i jq q

Introduction

Key Limitation: None of them are context-aware: do not consider the

immediately preceding queries as context,

The clustering algorithms cannot scale up to very large data well.

An example: “apple” “steve jobs” “apple”

5

User’s search intent?

1 2 1 i iq q q q

1.8 billion query (151 million unique), 2.6 billion clicked URL(114 million unique)

Proposed Method Framework

6

Key steps: Capture the context: concept sequence Quickly find the queries that many users ask in that context

Clustering queries


An example of click-through bipartites data from search log:

7


For each query : a -normalized vector,

iq

2L

( ), if edge ,[ ]

0, oterwiseij ij

i

norm w e existq j

2with ( )

ik

ijij

ike

wnorm w

w

2

distance( , )

( [k] [k])k

i j

i iu U

q q

q q

Key challenges to cluster queries: Search log click-through bipartite could be huge: e.g.,

151 million unique queries Number of clusters is unknown Extremely high dimensionality of query vector: 114

million unique URLs Search logs increase dynamically

Existing query clustering algorithms: Hierarchical agglomerative method DBSCAN method (Wen, WWW’01) K-means, etc.

8


Proposed clustering method:

9


for each query : Step 1: first find the closest cluster to among the

clusters obtained so far Step 2: compute the diameter of cluster Step 3: 1) diameter , is assigned to ,

2) otherwise, create a new cluster containing only

quite efficient: Only need one scan of queries Can run efficiently on a PC of 2GM main memory

10


C C q

qC

q

maxD q C

C q

q

Tricks for algorithm efficiency improvement: A dimension array data structure used in step 1 (sparse

data) Prune edges of low weights

11


2

distance( , )

( [k] [k])k

i j

i iu U

q q

q q

Extract query sessions data each individual user’s behavior (query/click) data segment into sessions (time interval>30mins) discard the click event data

12


Fig: An example of search log data

Concept sequence suffix tree A structure used to efficiently find (search) the queries that

many users ask in that context (concept sequence)

13


Fig: An example

Algorithm to build concept sequence suffix tree: 1) Map training session data to

2) Enumerate subsequence of (distributed, map-duce)

3) Get all frequent concept subsequences

4) Organize these into concept sequence suffix tree

14


1 2 iqs q q q 1 2 ,

j i jc c c

1 2 1 ,1,

j i i lc c c c c ci l j

cs

cs

Algorithm for organizing into concept sequence suffix tree:

15

Concept Sequence Suffix Treecs

Organize into concept sequence suffix tree : 1) start from root node (empty), and scan through all frequent concept subsequence cs

2) for each first find node corresponding to

if cr doesn’t exist, create it

3) update the list of candidate concepts of if is among the top K (a specified threshold , e.g., K=5) candidates so far;

4) representative query of the top K candidate concepts are candidate suggestions for sequence

16


1 2 ,lcs c c c

1 2 1' ,lcs c c c

lc

cs

cr

'cs

'cs

Review an example of Concept Sequence Suffix Tree:

17


1 2 ,lcs c c c

1 2 1' ,lcs c c c

Online query suggestion algorithm:

18


For a query sequence : Map it to concept sequence : if is a new query,

stop mapping, and returned concept sequence corresponding to ;

Search the tree to find the longest matched subsequence of the form

Use candidate suggestions for as query suggestion for

19

Concept Sequence Suffix Tree1 2 lq q q

1 2 lc c c

iq

1 2i i lq q q

1 , 1j j lc c c j

1 , 1j j lc c c j 1 2 lq q q

Review an example of Concept Sequence Suffix Tree:

20


1 2 iqs q q q 2 ,

1 j ij ics c c c

Experimental EvaluationTraining Data:

A commercial search engine search log (Bing) in US 1.8 billion queries (151 million unique ), 2.6 billion URL

clicks (115 million unique), 840million sessions

Baseline algorithms: Adjacency: given , rank based on frequency of N-Gram: given , rank based on frequency

of

Test set data: Test -0: 1000 randomly selected single-query case sessions Test-1: 1000 randomly selected multi-query case sessions

21

i jq q

1 2 i jq q q q

jqiq

1 2 iqs q q q jq

Experimental Results

Coverage of suggestion:

22

Fig: The coverage of the three methods on (a) Test-0 and (b) Test-1

Experimental Results

Quality of suggestion: (collect relevance grading from 10 judges)

23

Fig: The quality of the three methods on (a) Test-0 and (b) Test-1

Summary

Three things to know: Some basics about query suggestion using search log

The proposed efficient query clustering algorithm for search-log click-through bipartites data

The proposed efficient context-aware query suggestion method using concept sequence suffix tree

24

Hints: “concept” level N-gram with varied length N

+ A structure for efficient search

Thank You!

25

Documents

Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1