Topical Clustering of Search Results Scaiella et al [Originally published in – “Proceedings of the fifth ACM international conference on Web search and

1Topical Clustering., Scaiella et al

Topical Clustering of

Search ResultsScaiella et al

[Originally published in – “Proceedings of the fifth ACM international conference on Web search and data

mining, ACM, New York, NY, USA (2012)”, pp. 223-232]

4/23/2013

Presented byAnupam Kumar / CSCI 572 / Spring 2013

Topical Clustering., Scaiella et al 2

Topics covered in CSCI 572 course

• Lecture – Rich Text Snippetso Basic text snippets (Relates to tools needed extract snippets and

assign a set of topics to it)

• Lecture – Search Engine Query Processingo Query Expansion (Relates to query topic decomposition)o Google Wonderwheel (performed syntactical clustering to aid topic

discovery)

• Lecture – Intelligent Information Retrievalo AI and machine learningo Judging success – Precision/ Recallo Retrieval Models – (Vector Space/ Probabilistic /Graph Based)

4/23/2013


Clustering 101

4/23/2013

• What is clustering & why should we study it

• What is search result clustering (SRC) and how do we do it

• What are the challenges with the current SRC face & how to address them


Bag of Words vs Graph of Topics

• Bag of Words• It’s syntactical• Can never bridge the gap

between user’s intentions and query result relevance

• Not very effective for clustering

• Easier to implement, but still a computationally intensive (NP Complete problem)

4/23/2013

• Graph of Topics• It’s semantical• Associating the pages with topics

and clustering them based on topics inferred helps bridge the subjective user’s query intentions and objective query results

• Still harder to implement and introduces newer variables into clustering model

Topical Clustering., Scaiella et al 54/23/2013


Semantic Recipe to solve SRC Problem

To solve the problem using Graph of Topics paradigm Follow these steps….• #0 Get the results from your favorite search engine• #1 Annotate those results with topics using a

knowledge base (like wikipedia) and a annotation tool (like TAGME)

• #2 Represent each result’s text-snippet as a graph of topics containing two types of nodes – topic nodes & text snippet nodes

• #3 Plug in their algorithm on this graph, you create to get meaningful & intelligent clusters (More on algorithm coming up…)

4/23/2013


Topical Clustering Algorithm

• Let S be set of snippets S = { s1,s2,s3…sn}• Let T be set of Topics To = { t1, t2, t3, …. tr} identified• For each snippet s and a topic, associate a annotation score , p(s,t). p(s,t)

represents the reliability/importance of topic t w.r.t s• For each pair of topics ta, tb rel(ta,tb) is the degree of their relatedness,

measured by mutual citation and co-citation of ta in tb or vice versa in the knowledge base (here wikipedia.org)

• Let S(t) be the set of snippets that are annotated by topic t. For any set of Topics T = { t1, t2…tk} S(T) = S(t1) U S(t2) U… S(tk) is the set of snippets annotated with at least one topic of T.

• Let the subgraph of the topics in set To be called G’. For a given “m” (where m is maximum the number of topical clusters we want), perform a topical decomposition of G’, consisting of C = { T1, T2… Tm} where T1, T2 … Tm are disjoint subsets of T

• Let h(Ti) be a labelling function that labels each Ti in C that defines its general theme

• Create m clusters of topics, with each cluster contain S(Ti) snippets• These m clusters are your topical clusters.

4/23/2013


How to Run the algorithm

• What are the decomposition criteria to get the desirable properties for needed to create intelligible clusters ?

• How to achieve this ? o Create G’ o Perform topical decomposition on G’ to produce a set of topical clusters o Label each element topical cluster appropriately

4/23/2013


Creating G’

4/23/2013

• For each element in To remove the elements that represent the topics that cover more than 50% of snippets, because they are too generic to form intelligent cluster

• Let U = S = { s1, s2 … sn}• Let B = { {t1, t3, t4 } , { t2, t5}… }• This is an optimization version of set cover

problem, where the goal is to cover all elements of U by having minimum cardinality of B.

• This is NP hard, but Greedy method CAN provide a solution in polynomial time, though it is always complete, it may not always be optimal


Creating set C out of G’

4/23/2013

• This step is where the topic decomposition of results take place for the topic graph G’.

• Perform Eigen decomposition / spectral decomposition of the graph S + G’

• Iteratively bi-section or cut G’ into clusters such that:o A random walk seldom transitions across the bisection (or cut). o Of the clusters filtered above, pick the cluster that covers more

snippets than empirically derived delta-max valueo Of the clusters filtered above, pick the selected cluster for bisection has

second lowest eigen value

• Stop when m is reached or if there are no more clusters to be cut


Clustering and Labeling

4/23/2013

• Assign snippet to topic, based to annotations discovered by TAGME

• The main topic of a cluster Ti is calculated by looking at a topic t belonging to Ti that maximizes the sum of rho-score between t and its covered snippets.

• The title of the wikipedia page is the standard label of the topic

• If the topic title is same as query string then the title is appended with the most frequent anchor text used to refer to the topic page


Experimental Evaluation

• The setup – o 2 Datasets – AMBIENT (from yahoo search engine) and ODP-239 from

DMOZ.orgo AMBIENT contained 44 ambiguous queries with 100 results returned

for each query; ODP-239 contained 239 topics with 100 documents per topic spread across 25580 topics – Each document had a name and very short text snippet

o The queries used on ODP-239 was 100 query from TextRerieval Conference (TREC) 2009-2010

• The dataset and query were same across Topical Clustering, Lingo3G and CLUSTY clustering engines

4/23/2013


Experimental Evaluation (Contd)

• EXPERIMENTAL FINDINGSo COVERAGE ANALYSIS – TAGME

tags 98% of snippets & annotates 5 tags per snippet on average

o CLUSTERING EVALUATION – Topical algorithm outperformed the rest by a large margin

4/23/2013


Experimental Evaluation (Contd)

4/23/2013

• EXPERIMENTAL FINDINGSo Diversification of topics – Topical clustering worked better than all other popular algorithms

o Time EfficiencyThe algorithm runs in polynomial time mostly with the help of optimization and pruning at all the 3 stages of the algorithm.


Pros & Cons• Pros

o Attempts to look at semantic aspect of resulto Attempts to provide a promising data structure and a novel algorithmo Provides a new way to look at the problem and departs from traditional “bag of words”

paradigmo Provided extensive experiment data that showed promising results

• Conso It still uses syntactical search results to gather snippets and talks nothing about what

would happen if we used semantic search results (Would it even work if we applied this to semantic resultsets ?)

o Does not provide any information on what would happen if we used any other knowledge base other than wikipedia

o This perhaps will not cluster well if the topics are closely related. Thus it probably may not function for clustering vertical search engine results as well as it does for horizontal search engines

o The paper did not discuss or show how the system scales to increasingly larger datasets

o The paper did not talk about effective would the clustering be if we are trying results that are changing over time(eg: Result set containing snippets from RSS feed, which may change over time)

4/23/2013


Thank You

4/23/2013

Documents

Topical Clustering of Search Results Scaiella et al [Originally published in – “Proceedings of the fifth ACM international conference on Web search and