13
Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon- Jian Zhang

Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang

Embed Size (px)

Citation preview

Page 1: Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang

Clustering User Queries of a Search

Engine

Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang

Page 2: Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang

Smarter searches

Search engines were moving beyond simple keyword matching.

The big idea was to “understand” the users queries,then suggest similar queries.

The significance of these “similar queries”:Other users have asked them, and received correct answers.

Page 3: Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang

Two assumptions

1. Users click on the same documents, having used different queries, then the queries are similar.

2. If a set of documents is often selected for a set of queries, then the terms in the documents are related to the terms in the queries.

Key point – similar queries would have been grouped into multiple clusters using keywords alone.

Page 4: Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang

The aims

The editors were seeking to improve the encyclopaedia so that the users could locate information in a more precise way. In particular:1. If Encarta does not provide sufficient information for FAQ,

then improve the entries.2. If an FAQ is emerging as a “hot topic”, then check the results

set, and provide direct links.

This paper is about helping out with issue 2.

Page 5: Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang

Raw material

User logs for searches against the online Encarta encyclopaedia.

Session means query session rather than user session

session := queryText [clickedDocument]*

The Encarta titles were carefully crafted, so the assumption is that if user clicks were based on relevance.

Page 6: Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang

Clustering principles1. Using query contents.If two queries contain the same of similar terms, they denote the same or similar information needs.More useful for longer queries.

2. Using document clicks.If two queries lead to the selection of the same documents, then they are similar.

Both principles were used.

Page 7: Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang

Clustering algorithm requirements1. No manual configuration of the clusters

2. Filter out queries with low frequencies

3. Fast

4. Incremental

Selected DBSCAN & incremental DBSCAN,

But provided their own similarity function.

Page 8: Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang

Similarity Based on Query Contents

Simply: Similaritykeyword(p,q) = KN(p,q) / Max( kn(p), kn(q))

When weighted:

Weightings provided by tf-ifd

Page 9: Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang

Similarity Based on Query Contents

Plus refinements:

• If phrases can be identified: they can be treated as single term in the calculations.

Easy in this case as Encarta supplied a dictionary of phrases.

• There were plans to include syntactic analysis to identify noun phrases.

Page 10: Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang

Similarity Based on Query Contents

• Similarity based on edit distance:The number of insertions, deletions, and/or replacements needed to unify two queries.

Found to be useful for long and complex queries in preliminary tests. Implemented?

• Also mentioned the possibility of using Wordnet synonyms.

Page 11: Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang

Similarity Based on User Feedback

Single documents:

Similaritydoc = RD(p,q)/Max( rd(p), rd(q))

Page 12: Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang

Similarity Based on User Feedback

Encarta documents are hierarchal: A concept taxonomy.

The lower the common branch,The higher the similarity.

S(di, dj) = (L(F(di, dj))-1)/L_Total)

Page 13: Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang

Outcomes

The authors stated the need for more empirical results data, but were happy with their progress.But – no actual results.

Their approach was certainly successful in detecting similarities missed by other approaches.