Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang

Clustering User Queries of a Search

Engine

Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang

Smarter searches

Search engines were moving beyond simple keyword matching.

The big idea was to “understand” the users queries,then suggest similar queries.

The significance of these “similar queries”:Other users have asked them, and received correct answers.

Two assumptions

1. Users click on the same documents, having used different queries, then the queries are similar.

2. If a set of documents is often selected for a set of queries, then the terms in the documents are related to the terms in the queries.

Key point – similar queries would have been grouped into multiple clusters using keywords alone.

The aims

The editors were seeking to improve the encyclopaedia so that the users could locate information in a more precise way. In particular:1. If Encarta does not provide sufficient information for FAQ,

then improve the entries.2. If an FAQ is emerging as a “hot topic”, then check the results

set, and provide direct links.

This paper is about helping out with issue 2.

Raw material

User logs for searches against the online Encarta encyclopaedia.

Session means query session rather than user session

session := queryText [clickedDocument]*

The Encarta titles were carefully crafted, so the assumption is that if user clicks were based on relevance.

Clustering principles1. Using query contents.If two queries contain the same of similar terms, they denote the same or similar information needs.More useful for longer queries.

2. Using document clicks.If two queries lead to the selection of the same documents, then they are similar.

Both principles were used.

Clustering algorithm requirements1. No manual configuration of the clusters

2. Filter out queries with low frequencies

3. Fast

4. Incremental

Selected DBSCAN & incremental DBSCAN,

But provided their own similarity function.

Similarity Based on Query Contents

Simply: Similaritykeyword(p,q) = KN(p,q) / Max( kn(p), kn(q))

When weighted:

Weightings provided by tf-ifd


Plus refinements:

• If phrases can be identified: they can be treated as single term in the calculations.

Easy in this case as Encarta supplied a dictionary of phrases.

• There were plans to include syntactic analysis to identify noun phrases.


• Similarity based on edit distance:The number of insertions, deletions, and/or replacements needed to unify two queries.

Found to be useful for long and complex queries in preliminary tests. Implemented?

• Also mentioned the possibility of using Wordnet synonyms.

Similarity Based on User Feedback

Single documents:

Similaritydoc = RD(p,q)/Max( rd(p), rd(q))

Similarity Based on User Feedback

Encarta documents are hierarchal: A concept taxonomy.

The lower the common branch,The higher the similarity.

S(di, dj) = (L(F(di, dj))-1)/L_Total)

Outcomes

The authors stated the need for more empirical results data, but were happy with their progress.But – no actual results.

Their approach was certainly successful in detecting similarities missed by other approaches.

Documents

Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang