Upload
eleanor-washington
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Context-aware Query Suggestion by Mining Click-through and Session Data
Authors: H. Cao et.alKDD 08
Presented by Shize Su
1
Outline
Introduction
Framework of the Proposed Method
Mining Query Concepts
Concept Sequence Suffix Tree
Experimental Evaluation
Summary
2
Introduction
What is query suggestion in search engine? Guess user’s search intent ( user query )
suggest queries
Why query suggestion is important? Easy to issue appropriate query? No! A “bottleneck issue” of search engine usability (Google, Yahoo, Bing, Baidu, etc)
3
Better describe user’s information need?
1 2, or i iq q q q
,j kq q
IntroductionMajor existing approaches (with search log data) :
Approach I: clustering queries using clicked URL data to find similar queries,
Approach II: mining pairs of queries which are adjacent or co-occur in the same query session,
4
Fig1: An example of search log data
frequent?i jq q
and similar?i jq q
Introduction
Key Limitation: None of them are context-aware: do not consider the
immediately preceding queries as context,
The clustering algorithms cannot scale up to very large data well.
An example: “apple” “steve jobs” “apple”
5
User’s search intent?
1 2 1 i iq q q q
1.8 billion query (151 million unique), 2.6 billion clicked URL(114 million unique)
Proposed Method Framework
6
Key steps: Capture the context: concept sequence Quickly find the queries that many users ask in that context
Clustering queries
Concept Sequence Suffix Tree
An example of click-through bipartites data from search log:
7
Mining Query Concepts
For each query : a -normalized vector,
iq
2L
( ), if edge ,[ ]
0, oterwiseij ij
i
norm w e existq j
2with ( )
ik
ijij
ike
wnorm w
w
2
distance( , )
( [k] [k])k
i j
i iu U
q q
q q
Key challenges to cluster queries: Search log click-through bipartite could be huge: e.g.,
151 million unique queries Number of clusters is unknown Extremely high dimensionality of query vector: 114
million unique URLs Search logs increase dynamically
Existing query clustering algorithms: Hierarchical agglomerative method DBSCAN method (Wen, WWW’01) K-means, etc.
8
Mining Query Concepts
Proposed clustering method:
9
Mining Query Concepts
for each query : Step 1: first find the closest cluster to among the
clusters obtained so far Step 2: compute the diameter of cluster Step 3: 1) diameter , is assigned to ,
2) otherwise, create a new cluster containing only
quite efficient: Only need one scan of queries Can run efficiently on a PC of 2GM main memory
10
Mining Query Concepts
C C q
qC
q
maxD q C
C q
q
Tricks for algorithm efficiency improvement: A dimension array data structure used in step 1 (sparse
data) Prune edges of low weights
11
Mining Query Concepts
2
distance( , )
( [k] [k])k
i j
i iu U
q q
q q
Extract query sessions data each individual user’s behavior (query/click) data segment into sessions (time interval>30mins) discard the click event data
12
Concept Sequence Suffix Tree
Fig: An example of search log data
Concept sequence suffix tree A structure used to efficiently find (search) the queries that
many users ask in that context (concept sequence)
13
Concept Sequence Suffix Tree
Fig: An example
Algorithm to build concept sequence suffix tree: 1) Map training session data to
2) Enumerate subsequence of (distributed, map-duce)
3) Get all frequent concept subsequences
4) Organize these into concept sequence suffix tree
14
Concept Sequence Suffix Tree
1 2 iqs q q q 1 2 ,
j i jc c c
1 2 1 ,1,
j i i lc c c c c ci l j
cs
cs
Algorithm for organizing into concept sequence suffix tree:
15
Concept Sequence Suffix Treecs
Organize into concept sequence suffix tree : 1) start from root node (empty), and scan through all frequent concept subsequence cs
2) for each first find node corresponding to
if cr doesn’t exist, create it
3) update the list of candidate concepts of if is among the top K (a specified threshold , e.g., K=5) candidates so far;
4) representative query of the top K candidate concepts are candidate suggestions for sequence
16
Concept Sequence Suffix Tree
1 2 ,lcs c c c
1 2 1' ,lcs c c c
lc
cs
cr
'cs
'cs
Review an example of Concept Sequence Suffix Tree:
17
Concept Sequence Suffix Tree
1 2 ,lcs c c c
1 2 1' ,lcs c c c
Online query suggestion algorithm:
18
Concept Sequence Suffix Tree
For a query sequence : Map it to concept sequence : if is a new query,
stop mapping, and returned concept sequence corresponding to ;
Search the tree to find the longest matched subsequence of the form
Use candidate suggestions for as query suggestion for
19
Concept Sequence Suffix Tree1 2 lq q q
1 2 lc c c
iq
1 2i i lq q q
1 , 1j j lc c c j
1 , 1j j lc c c j 1 2 lq q q
Review an example of Concept Sequence Suffix Tree:
20
Concept Sequence Suffix Tree
1 2 iqs q q q 2 ,
1 j ij ics c c c
Experimental EvaluationTraining Data:
A commercial search engine search log (Bing) in US 1.8 billion queries (151 million unique ), 2.6 billion URL
clicks (115 million unique), 840million sessions
Baseline algorithms: Adjacency: given , rank based on frequency of N-Gram: given , rank based on frequency
of
Test set data: Test -0: 1000 randomly selected single-query case sessions Test-1: 1000 randomly selected multi-query case sessions
21
i jq q
1 2 i jq q q q
jqiq
1 2 iqs q q q jq
Experimental Results
Coverage of suggestion:
22
Fig: The coverage of the three methods on (a) Test-0 and (b) Test-1
Experimental Results
Quality of suggestion: (collect relevance grading from 10 judges)
23
Fig: The quality of the three methods on (a) Test-0 and (b) Test-1
Summary
Three things to know: Some basics about query suggestion using search log
The proposed efficient query clustering algorithm for search-log click-through bipartites data
The proposed efficient context-aware query suggestion method using concept sequence suffix tree
24
Hints: “concept” level N-gram with varied length N
+ A structure for efficient search
Thank You!
25