Upload
logan-morrison
View
212
Download
0
Embed Size (px)
Citation preview
Clustering User Queries of a Search
Engine
Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang
Smarter searches
Search engines were moving beyond simple keyword matching.
The big idea was to “understand” the users queries,then suggest similar queries.
The significance of these “similar queries”:Other users have asked them, and received correct answers.
Two assumptions
1. Users click on the same documents, having used different queries, then the queries are similar.
2. If a set of documents is often selected for a set of queries, then the terms in the documents are related to the terms in the queries.
Key point – similar queries would have been grouped into multiple clusters using keywords alone.
The aims
The editors were seeking to improve the encyclopaedia so that the users could locate information in a more precise way. In particular:1. If Encarta does not provide sufficient information for FAQ,
then improve the entries.2. If an FAQ is emerging as a “hot topic”, then check the results
set, and provide direct links.
This paper is about helping out with issue 2.
Raw material
User logs for searches against the online Encarta encyclopaedia.
Session means query session rather than user session
session := queryText [clickedDocument]*
The Encarta titles were carefully crafted, so the assumption is that if user clicks were based on relevance.
Clustering principles1. Using query contents.If two queries contain the same of similar terms, they denote the same or similar information needs.More useful for longer queries.
2. Using document clicks.If two queries lead to the selection of the same documents, then they are similar.
Both principles were used.
Clustering algorithm requirements1. No manual configuration of the clusters
2. Filter out queries with low frequencies
3. Fast
4. Incremental
Selected DBSCAN & incremental DBSCAN,
But provided their own similarity function.
Similarity Based on Query Contents
Simply: Similaritykeyword(p,q) = KN(p,q) / Max( kn(p), kn(q))
When weighted:
Weightings provided by tf-ifd
Similarity Based on Query Contents
Plus refinements:
• If phrases can be identified: they can be treated as single term in the calculations.
Easy in this case as Encarta supplied a dictionary of phrases.
• There were plans to include syntactic analysis to identify noun phrases.
Similarity Based on Query Contents
• Similarity based on edit distance:The number of insertions, deletions, and/or replacements needed to unify two queries.
Found to be useful for long and complex queries in preliminary tests. Implemented?
• Also mentioned the possibility of using Wordnet synonyms.
Similarity Based on User Feedback
Single documents:
Similaritydoc = RD(p,q)/Max( rd(p), rd(q))
Similarity Based on User Feedback
Encarta documents are hierarchal: A concept taxonomy.
The lower the common branch,The higher the similarity.
S(di, dj) = (L(F(di, dj))-1)/L_Total)
Outcomes
The authors stated the need for more empirical results data, but were happy with their progress.But – no actual results.
Their approach was certainly successful in detecting similarities missed by other approaches.