Tutorial overview The Cluster Hypothesis in Information …kurland/clustHypoECIR.pdf•Testing the cluster hypothesis •Cluster‐based document retrieval •Using topic models for

1

The Cluster Hypothesis in Information Retrieval

ECIR 2014 tutorial

Oren Kurland

Technion ‐‐‐ Israel Institute of Technology

Email: [email protected]

Web: http://iew3.technion.ac.il/~kurland

Slides: http://iew3.technion.ac.il/~kurland/clustHypoECIR.pdf1

Tutorial overview

• The cluster hypothesis• Historical view of the effect of the hypothesis on work on ad hoc information retrieval

• Testing the cluster hypothesis• Cluster‐based document retrieval

• Using topic models for ad hoc information retrieval

• Graph‐based methods for ad hoc retrieval that utilize inter‐document similarities

• Additional tasks/applications• Search results visualization, query‐performance prediction, fusion, federated search, query expansion, microblog retrieval, relevance feedback, adversarial search

• Concluding notes

2

The ad hoc retrieval task

• Ranking the documents in a corpus by their relevance to the information need expressed by a query

• Vector space model

• Probabilistic approaches

• Language modeling framework

• Divergence from randomness framework

• Learning to rank

3

The cluster hypothesis

Closely associated documents tend to be relevant to the same requests

(Jardine&van Rijsbergen ’71, van Rijsbergen ’79)

4

A quick historical tour

• Mid‐end 60’s• Using document clusters to improve search efficiency (Salton ’68)

• 70’s‐80’s• Using document clusters to improve search effectiveness (Jardine&vanRijsbergen ’71)

• ~2004‐today • Using document clusters to improve search effectiveness (Azzopardi et al. ’04, Kurland&Lee ’04, Liu&Croft ’04)

• 90’s‐00’s• Using document clusters to improve results browsing (Preece ’73)

• 90‐today• Using topic models to improve search effectiveness (Deerwester et al. ’90)

• ~00’s‐today• Using graph‐based approaches that utilize inter‐document similarities (Salton&Buckley ’88)

5

Ph.D. dissertations • Ivie, E. L. Search procedures based on measures of relatedness between documents. PhD thesis, Massachusetts

Institute of Technology, 1966.

• Marcia Davis Kerchner. Dynamic document processing in clustered collections. PhD thesis, Cornell University, 1971.

• Daniel McClure Murray. Document retrieval based on clustered files. PhD thesis, Cornell Univeristy, 1972.

• Ellen Voorhees. The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval. PhD thesis, Cornell Univeristy, 1986.

• Anton Leuski. Interactive information organization: Techniques and evaluation. PhD thesis, University Massachusetts Amherst, 2001.

• Anastasios Tombros. The effectiveness of hierarchic query‐based clustering of documents for information retrieval. PhD thesis, Department of Computing Science, University of Glasgow, 2002.

• Leif Azzopardi. Incorporating context within the language modeling approach for ad hoc information retrieval. PhD thesis, University of Paisley, 2005.

• Oren Kurland. Inter‐document similarities, language models, and ad hoc retrieval. PhD thesis, Cornell University, 2006.

• Xiaoyong Liu. Cluster‐based retrieval from a language modeling perspective. PhD thesis, University Massachusetts Amherst, 2006.

• Xing Wei. Topic models in information retrieval. PhD thesis, University Massachusetts Amherst, 2007.

• Fernando Diaz. Autocorrelation and regularization of query‐based retrieval scores. PhD thesis, University of Massachusetts Amherst, 2008.

• Mark Smucker. Evaluation of find‐similar with simulation and network analysis. PhD thesis, University Massachusetts Amherst, 2008. 6

2

Improving search efficiency

• Cluster the corpus offline

• Represent each cluster by its centroid

• At query time, compare the centroids with the query and select the clusters to present

7

The cluster hypothesis

Closely associated documents tend to be relevant to the same requests

(Jardine&van Rijsbergen ’71, van Rijsbergen ’79)

8

Does the cluster hypothesis hold?

• Depends on the inter‐document similarity used?

• Maybe we should assume that the hypothesis holds, and accordingly devise inter‐document similarity measures?

• More details later

9

The Jardine&van Rijsbergen’s (’71) (overlap) test• The similarity between two relevant documents vs. the similarity between a relevant and a non‐relevant document

• Measuring the overlap between the similarity distributions

10

Jardine&van Rijsbergen’s cluster hypothesis test

(Figure is taken from Voorhees ‘85)

11

Voorhees’ (’85) nearest‐neighbor test• The percentage of relevant documents among the 5 nearest neighbors of a relevant document

• The cosine similarity between tf.idf vectors is used

12

3

Voorhees’ nearest‐neighbor test applied to the result list of the n highest ranked documents (Raiber&Kurland ’12)• The KL divergence between language models of documents is used for the similarity measure

13

The density‐based cluster hypothesis test (El‐Hamdouchi and Willett ‘87)

• The test value is the ratio between the number of postings in the index and the size of the vocabulary

• There is also a weighted version

• The test was empirically shown to be more correlated than the overlap and nearest‐neighbor tests with the relative improvement posted by cluster‐based retrieval over document‐based retrieval

• Nearest‐neighbor clusters (Griffiths et al. ’85) were used• Retrieval performance was measured by recall at some cutoff

14

Alternative cluster hypothesis test (Smucker&Allan ’09)• Claim: The nearest neighbor test is insufficient for query‐biased similarities

• The nearest‐neighbor test is a good measure of local clustering

• A graph‐based normalized mean reciprocal distance measure

15

Query‐sensitive similarity measures (Tombros&van Rijsbergen ’01)

• Claim: the cluster hypothesis should hold for every collection; it is the inter‐document similarity measure that needs to be adjusted so that the hypothesis holds

• Heretofore, all inter‐document similarities were query‐independent

• Idea: bias the inter‐document similarity measure to emphasize relations between the documents and the query

, ≜ cos , cos , ;

; + ; ), where the i’th term in the vocabulary is common to and , and ; is its weight in

16

Query‐sensitive similarity measures (Tombros&van Rijsbergen ’01)

Nearest neighbor test with 5 nearest neighbors

17

The cluster hypothesis for entity retrieval (Raviv et al. ‘13)

18

4

Using the cluster hypothesis to induce clustering• The optimum clustering framework (Fuhr et al. 2012)

• Basic principle: documents that are co‐relevant to many queries should be clustered together

• Definitions for the expected recall and precision of a clustering based on co‐relevance

• Well known clustering methods can be viewed as based on principles of the framework

• The framework was shown to provide a more effective internal clustering quality criterion than commonly used alternatives

• In terms of correlation to ground truth

19

The connection between the cluster hypothesis and cluster‐based retrieval effectiveness

• “The extent to which the cluster hypothesis characterized a collection seemed to have little effect on how well cluster searching performed as compared to a sequential search of the collection.” (Voorhees ’85)

• There is (high) correlation between the extent to which the nearest‐neighbor cluster hypothesis hold, and the effectiveness of cluster‐based document retrieval. (Na et al. ’08)

• One potential reason for the contradicting findings: different cluster‐based retrieval methods have been used

20

Using clusters of similar documents for document retrieval• Visualizing results

• Using clusters to select documents

• Using cluster‐based (and topic‐based) information to enrich document representations

21

Types of document clusters

• Hard (partitioning (e.g., K‐Means) vs. agglomerative) vs. soft (overlapping nearest‐neighbor clusters, topic models)

• More details later

• Offline (query‐independent)• Created from all documents in the corpus• Help address recall issues with the initial search• Efficiency considerations

• Large scale and dynamic corpora

• Query specific (Preece ’73, Willett ’85)• Created from the documents most highly ranked by an initial search• Used either for visualization of results or for automatic re‐ranking of the initial result list

• Drawback: dependence on the effectiveness of the initial search

22

Cluster‐based search results visualization

• The scatter‐gather system (Cutting et al. ’92, Hearst&Pedersen ’96)

• Browsing strategies for cluster‐based result interfaces (Leuski ’01)

• Interactive retrieval using a cluster‐based interface (Leuski&Allan ’04)

• Interactive exploration of corpora based on inter‐document similarities (Smucker ’08)

23 24

5

Challenges to address

• Fast online creation of clusters• Cutting et al. ’92, Zamir&Etzioni ’98

• Automatic labeling of clusters • Treeratpituk&Callan ’06 (agglomerative clusters)

• Mei et al. ’07 (topic models)

25

Using document clusters for ad hoc retrieval• The user needn’t be aware of the fact that clustering was performed

• Clusters often serve one of two roles (or both) (Kurland&Lee ’04)A: Document selection

B: Enriching (“smoothing”) document representations

26

Part A: Cluster‐based document

selection

27

Using offline‐created clusters for document selection (Kurland ’06)

Query = {truck, bus}

d1=school bus, classes, teachers

d2=school , classes, teachers, class

d3=bus, taxi, boat, bike

d4=taxi, boat, truck, scooter

d5=boat, horse, taxi, bike, scooter

d6=home, house, kids, floor

(x , x ) | x |i j i jsim x

( , 1) ( , 5) ( , 6)

( , 2) ( , 1) ( , 3)

( 5, 2) ( 4, 2) ( 3, 2)

3, , , , 6,1 2

, , , , , 6

4 5

3 25 14

Rank using documents

Rank using clusters and docs

sim q d sim q d sim q d

sim q C sim q C sim q C

sim d C sim d C sim d c

Ranking d

Ra

d d d

d d

d d

nki d dn dg d

c2

c1

c3

28

Using clusters for document selection

• Given a query and a list of document clusters • Cl can be a set of offline‐created or query‐specific clusters

• Rank the clusters in using a query‐cluster similarity measure, or any other approach:

• ; ≜ ,• A key estimation issue which will be further discussed

• Transform the cluster ranking to document ranking

29

Transforming cluster ranking to document ranking

• Strategy #1 (originally known as “cluster‐based retrieval”)

• Replace each cluster with its constituent documents (omitting repeats)

• Within‐cluster document ranking is based on the initial document scores which were assigned in response to the query or on the similarity between the document and the cluster centroid

• Jardine&Rijsbergen ’71, Croft ’80, Voorhees ’85, Willett ’85, Liu&Croft ’04, Kurland&Lee ’06, Kurland ’08, Kurland&Domshlak ’08, Liu&Croft ’08

30

6

The CQL method (Liu&Croft ’04)(Example for strategy #1 using query‐specific clusters)

; ≜ , ; similarity is measured using language modelsA cluster is represented by the concatenation of its constituent documents

31

Transforming cluster ranking to document ranking (cont.)• Strategy #2

• Rank all the documents in the top‐retrieved clusters using some criterion

• Examples to follow

• Strategy #3• Traverse the clustering dendrogram until finding the cluster with the best match to the query or using any other stopping criterion (Jardine&Rijsbergen ’71, Croft ’80, Voorhees ’85, Griffiths et al. ‘86)

• Mixed results with respect to whether cluster‐based retrieval is consistently more effective than document retrieval

• Cluster‐based retrieval was shown to be more effective in terms of precision (Jardine&Rijsbergen ’71, Croft ’80)

• Bottom up search was shown to be more effective than top down (Croft ’80)

32

Algorithmic Framework (Kurland&Lee’04, ‘09)

(Example for strategy #2)

is the set of nearest-neighbor clusters created from

all documents in the corpus (cf., Griffiths et al. '86)

Given query and (the number of docs to retrieve):

1. For each document ,

- C

Cl

q N

d

Score by a weighted combination of ( , ) and

the ( , ) ' for all ( , )

hoose ( , )

-

2. Set ( ) to the ranked-ordered list of

-to

d s

Facets q d C

im q d

sim q c s c Facets q

l

TopD c

d

o s N

N

Optional: re-rank

p scoring documents

3.

4. Return

( ) by (

(

,

)

)d Top

To

Docs N si

pDoc

m d

s

q

N

33

The Set‐Select Algorithm (Kurland&Lee ’04, ‘09)

• Instantiation of the framework:

• The procedure

( , ) : ( )

( ) ( , ) | ( , ) | 0

( ) is the set of clusters that are the most similar

to the query

q

q

Facets q d c d c TopClusters m

Score d sim q d Facets q d

TopClusters m m

Rank only documents in top retrieved clusters using ( , )sim q d

34

The Bag‐Select Algorithm (Kurland&Lee ’04, ‘09)

• Instantiation of the framework:

• The procedure

( , ) : ( )

( ) ( , ) | ( , ) |

qFacets q d c d c TopClusters m

Score d sim q d Facets q d

Rank only documents in top retrieved clusters

using ( , ) # of top clusters belongs to

sim q d d

35

Set‐Select, Bag‐Select (Kurland&Lee ’04, ‘09)

c1

c2 c3

d1d2

d40

40 1

2 1

2

1

2 3

1

( , )

( , ) { ,

( , }

,

) {

{ , }

}

Facets q d c c

Facet

Facets q d c

q

c

d

c

s c

40 40

1

2 2

1

( ) (

( ) ( , )

,

( ) ( , )

)

Score d s

Scor

Sco

e d

re d sim q d

si

d

d

i

m

q

q

m

1 1

2

40 40

2

( ) 2 * ( ,

(

( ) (

) 3* (

,

)

)

)

,

Score d sim q d

Score d sim q

Score d sim

d

q d

Set‐Select Bag‐Select

q

36

7

Empirical Results (Kurland&Lee‘09)

15%

17%

19%

21%

23%

25%

27%

29%Baseline(doc-based)

Set-s

Bag-s

*

*

* *

AP89 AP88+89 LA+FR

‘*’ marks a statistically significant difference with the baseline

*

*

*

MAP

37

The cluster ranking challenge

• ; ≜ ,• How do we represent cluster c?

• Binary term vector (Jardine&van Rijsbergen ’71, van Rijsbergen ’74)

• A centroid of the vectors representing c’s constituent documents (Jardine&Rijsbergen ’71, Croft ’80, Voorhees ’85, El‐Hamdouci&Willett ’87,Liu&Croft ‘08)

• Cosine, for example, serves for the similarity measure in the vector space

• The big document that results from concatenating c’sconstituent documents (Kurland&Lee ’04, Liu&Croft ‘04)

• Language‐model‐based similarity estimates

38

Cluster representations (Liu&Croft ’08)

• The most effective representation for a cluster, among those studied, was the geometric mean of the language models of its constituent documents• Seo&Croft ’10 provide arguments based on information geometry for the effectiveness of the geometric‐mean‐based representation

39

Some more cluster ranking methods• Using the min/max query‐similarity score of a document in the cluster (Leuski ’01, Shanahan et al. ’03, Liu&Croft ’08)

• Document‐cluster graphs (Kurland&Lee ’06; more details later)

• Variance of document retrieval scores in the cluster (Liu&Croft ’06; more details later)

• Aggregating measures of properties of a cluster (Kurland&Domshlak ’08)

• Using the similarity between the cluster and an expanded query form (Liu&Croft ’04, Wei&Croft ’06; more details later)

40

The optimal cluster

• The percentage of relevant documents in cluster cof k documents for which ; is the highest is the precision@k attained by cluster‐based retrieval with strategy #1

Using nearest neighbor query‐specific clusters of 5 documents (Kurland&Domshlak ’08)

41

The optimal cluster (Liu&Croft’06)

42

8

The optimal cluster (cont.)

• Jardine and van Rijsbergen ’71 were the first to report the existence of an optimal (offline created) cluster

• This was later re‐asserted by, for example, by Hearst&Pedersen ’96, Tombros et al. ’02, Kurland ’06 and Liu+Croft ’06

• Offline‐created optimal clusters contain a smaller percentage of relevant documents than optimal query‐specific clusters (Tombros et al. ’02)

43

A probabilistic graph‐based approach for ranking clusters (Kurland ’08, Kurland&Krikon ’11)

• Whatistheprobabilitythatthisclusterisrelevanttothisquery? cf.,Croft’80

• TheClustRanker method:; ≜ | 1 | |

∈

• p(d) and p(c) are estimated based on graphs where documents (clusters) are vertices, edge‐weights represent inter‐item similarities, and the PageRank score of item x serves as an estimate for p(x)

44

ClustRanker (Kurland ’08, Kurland&Krikon’11)(Using nearest‐neighbor query‐specific clusters for re‐ranking)

45

Cluster ranking using Markov Random Fields –the ClustMRF method (Raiber&Kurland ‘13)

• C‐ arandomvariableforclusters,Q‐ arandomvariableforqueries

• , /p Q• Estimate , using Markov Random Fields

• G: a graph with nodes corresponding to C’s documents and Q

• L(G): cliques in G

• is a feature function defined over the clique

( )

, l ll L G

p C Q f l

46

The clique

Q

d3d2d1

lQD lQD lQD

• Contains the query Q and a single document d in cluster C

• Consider query‐similarity values of C’s documents independently

1

| |log ( , ) Cgeo qsim QDf l sim Q d

( , )sim

def

(cf., Liu&Croft ’08)

. . is an inter‐text similarity measure

number of documents in C| |C

47

The clique

• Contains the query Q and all C’s documents

• Induce information from relations between query‐similarity values of C’s documents

Q

d3d2d1

lQC

log ( , )QC d CA qsimf l A sim Q d def

A min, max, standard deviation (stdv)

48

9

The clique• Contains only C’s documents

• Induce information based on query‐independent properties of C’s documents

Q

d3d2d1

lC

logdP C CAf l A P d

def

A min, max, geometric mean (geo)

P is a query‐independent document (quality) measure

49

Query‐independent measures

For document d in C:

Similarity with all documents in Cdsim

Entropy of the term distribution in dentropy

Inverse compression ratio (Fetterly et al. ’04)icompress

Ratio between stopwords and non‐stopwords (Bendersky et al. ’11)

sw1

Fraction of stopwords in a stopword list that appear in d(Bendersky et al. ’11)

sw2

PageRank score (Brin&Page ’98)pr

Confidence level that d is not spam (Cormack et al. ’11)spam

50

11

12.5

14

15.5

17

ClueWebB

MAP

11

12

13

14

15

GOV2

MAP

Comparison with the initial ranking

■Init: Document‐based Markov Random Field (Metzler&Croft ’05)

■ClustMRF

♦ Statistically significant differences with ClustMRF

♦♦

ClustMRF

ClustMRF

51

Comparison with other cluster ranking methods■AMean: Arithmetic mean of query similarity values (Liu&Croft ’08)

■GMean: Geometric mean of query‐similarity values (Liu&Croft ’08 ,Seo&Croft ’10)

■ClustRanker: Uses measures of document and cluster biases (Kurland ’08)

■ClustMRF


11

14

17

ClueWebB

MAP

12

13

14

15

GOV2

MAP

♦♦♦ ♦ ♦

ClustMRF

ClustMRF

52

Comparison with query expansion

■RM3: Relevance model number 3 (Abdul‐Jaleel et al. ’04)

■ClustMRF


12

13

14

GOV2

MAP

13

15

17

ClueWebB

MAP

ClustMRF

ClustMRF

♦♦

53

Selective cluster‐based retrieval

• Griffiths et al. (’86) observed that cluster‐based retrieval and document‐based retrieval can be of the same effectiveness

• But, different relevant documents were retrieved in the two cases

• Some previous work on selecting retrieval strategy per query

• Croft&Thompson ’84

• Amati et al. ’04

• Balasubramanian&Allan ‘10

54

10

Selective cluster‐based retrieval (Liu&Croft ’06)• A “good” cluster is one which (1) exhibits high similarity to the query and (2) contains documents with query‐similarity values that do not deviate much from that of the cluster; (Kurland et al. ’12 found some contrasting evidence with respect to the deviation)

• For queries with “good” clusters perform cluster‐based retrieval, for the others perform document‐based retrieval

55

Selective cluster‐based retrieval (Liu&Croft ’06)

56

Intermediate summary• We ranked document clusters• We transformed the cluster ranking to document ranking

• Some observations• There are clusters that contain a very high percentage of relevant documents (whether static offline‐created clusters or query‐specific clusters); the optimal cluster

• Optimal query‐specific clusters contain a higher percentage of relevant documents than optimal static clusters (Tombros et al. ’02)

• Using small clusters results in more effective retrieval (Griffiths et al. ’86, Tombros et al. ’02, Kurland&Lee ‘04)

• Cluster representation is highly important• A geometric‐mean‐based representation seems to be the most effective among those proposed (Liu&Croft ’08, Seo&Croft ’10, Kurland&Krikon ’11)

• The performance of cluster‐based retrieval can be much better than that of document‐based retrieval and that of using query expansion

57

Part B: Cluster‐based document representations

58

Using clusters to enrich (smooth) document representations• Clusters provide (corpus) context for documents

• Enrich the document representation using information induced from similar documents

• Example: use Rocchio’s method to smooth the vector representing the document with those representing similar documents (Singhal&Pereira ’99)

∑ ;

, … , are d’s nearest neighbors in the vector space

59

Similar document‐expansion methods• Lavrenko (’00) employed a nearest‐neighbor smoothing method for the query model

• Ogilvie (’00) and Kurland&Lee (’04) smoothed a document language model with language models of its nearest neighbors

• Tao et al. (’06) created pseudo counts for terms in a document that are smoothed using the counts of terms in similar documents

• Wi&Allan (’09) and Efron et al. (’12) use the (weighted) language models of nearest‐neighbors of a document to smooth the document language model

• Efron et al. (’12) use this document model for Twitter search

60

11

A quick recap of the language modeling approach (Ponte&Croft ’98)

'

: query, : document, : corpus of documents, : term

( )( | ) ; ( )

A maximu

is the

m likelihood est

number of times

Jelinek-Merc

appears in ( ' )

(

i

er smo

mate:

othi

|

g

1 ) (

n :

) (

MLE

w d

JM MLE

q d C w

tf w dp w d tf w d w d

tf w d

p w d p

(1) The query likelihood model: (Song&Croft '99) (2) The KL retrieval method: (Lafferty&Zhai '0

| ) ( |

Dir

1

ichle

)

t smo

)

| |

( ; ) ( | ) ( | )

oth

ing:

i

MLE

iq q

w d p w C

d

score d q p q d p q d

( | ) ( ; ) ( | ) log

( | )i

ii

q q i

p q qscore d q p q q

p q d

61

Using pLSA for retrieval (Hofmann ’99)• pLsa (probabilistic latent semantic analysis) is a “probabilistic successor” of LSA (Deerwester et al. ’90), and an implementation of the aspect model (Hofmann et al. ’97)

• Additional topic models (LDA Blei et al. ’03, Pachinko Allocation Model Li&McCallum ’06)

• The generative story of pLSA:• Select a document with probability P

• Pick a latent class (topic) with probability P |• Generate a word w with probability P |

62

Using pLSA for retrieval (Hofmann ’99)• P ,

• ∑ |∈

• Data likelihood:

∑ ∑ , ,∈∈n(d,w) is the number of occurrences of term w in doc d

• Maximizing data likelihood using tempered EM• Note the potential metric divergence problem (Azzopardi et al. ’03)

• Using the topic model for retrieval• Smoothing a document language model (more details later)• Folding the query into the lower dimensional space

• Vector‐based representation, cosine measure

• Retrieval performance is better than that attained by LSA and the cosine method; very small collections are used

63

Cluster‐based document smoothing (Liu&Croft ’04)

1 2 3

1 2 3

T he C BD M m odel:

( | ) ( | ) ( | c) ( | );

1

Let be the single hard cluster to w hich belongs; c is represented by the concatenation of its constituent docum ents

MLE MLE M LEp w d p w d p w p w C

c d

Using offline K‐MEANS clustering (K is the number of clusters)

64

Topic‐based document smoothing (Wei&Croft ’06)• Apply Latent Dirichlet Allocation (LDA; Blei et al. ‘03) to induce topics from the corpus

• Use the resultant topics to smooth document language models

• Using the KL divergence between a document LM and that of the query for ranking

• A generalization of the CBDM model

• Earlier work by Azzopardi et al. (’04) used LDA and pLSA to induce document prior distributions

• Lu et al. ’11 found that the retrieval performance of using LDA and pLSA was comparable

1 2 3

1

1 2 3

1

ˆ ˆ( | , ) ( | , )

( | ) ( | ) ( | ) ( | );

( | )k

i

M LE M LELDA

LDA p w z p z d

p w d p w d p w d p w C

p w d

65

Topic‐based document smoothing (Wei&Croft ’06)

66

12

A study of using topic‐based document language models (Yi&Allan ’09)

• Using more sophisticated topic models (e.g, Pachinko Allocation Model Li&McCallum ’06) doesn’t yield better retrieval performance (e.g., than that attained by LDA)

• Using nearest‐neighbor smoothing results in performance that is as good as that of using topic models

• Pseudo‐feedback‐based query expansion is more effective than using topic models (either in an offline fashion or in a query‐specific fashion)

≜∈

T is the set of topics

| ≜ |

67

Score‐based smoothing (Kurland&Lee ’04 ‘10, Kurland ’09)

( ; ) ( | ) ( | , ) ( | )

Estimate ( | , ) using ( | ) (1- ) ( | )

The interpolation algorithm:

( ; ) ( | ) (1-

Using a single term for we get a cl

) ( | ) (

uster/top

| )

i

c Cl

c Cl

score d q p q d p q d c p c d

p q d c p q d p q c

score d q p q d p q c p c

w

d

q

Use exp(- ( ( | ) || ( | ))), where ( | ) is the unigram language model

c-based d

induced from ,

as an estimate fo

ocument language model

r ( | )

KL p x p y p z z

p x y

68

Interpolation algorithm (Kurland&Lee’04, ‘09)

• Ranking the corpus using nearest‐neighbor offline‐created clusters

‘b’ and ‘I’ mark statistically significant differences with the baseline and interpolation, respectively 69

Interpolation algorithm (Kurland&Lee ’04, ‘09) – comparison between using nearest neighbor (NN) clusters and K‐Means clusters (created offline)

El‐Hamdouci and Willett (’89) found that when using cluster ranking with offline created clusters(i) Using nearest neighbor clusters (of two documents) resulted in better

performance than that of using various hard clustering methods; (the same as Griffiths et al. (’86) findings)

(ii) Using small agglomerative clusters yielded better performance than using larger clusters; and,

(iii) Complete‐link agglomerative clustering was more effective than single‐link and Ward’s method for cluster‐based retrieval; (the same as Voorhees’ (’85) findings)70

The interpolation algorithm (Kurland ’09)• Comparison with pseudo‐feedback‐based query expansion

• Using nearest‐neighbor query‐specific clusters

‘*’ marks a statistically significant difference with the initial ranking (Init. Rank.)71

The interpolation method with different query‐specific clustering algorithms (Kurland ’09)

• The findings with respect to (i) nn‐LM and nn‐VS being superior to hard clustering schemes, and (ii) agg‐comp’s relative effectiveness, are reminiscent of those of El‐Hamdouci and Willett (’89) who used cluster ranking with clusters created offline

72

13

Comparison of cluster‐based retrieval methods (Raiber&Kurland ’12)

73

Cluster types and their effectiveness for smoothing document LMs (or scores)

• Nearest‐neighbor clusters (with a small number of nearest neighbors)>= topic models > hard clustering schemes

• Note: The first to suggest the use of nearest‐neighbor overlapping clusters were Griffiths et al. (’86)

• Used offline created clusters but there is much recent evidence for the effectiveness of using nearest‐neighbor query‐specific clusters (Kurland&Lee ‘06,Liu&Croft ‘08,Kurland ‘09)

74

Integrating query‐specific and offline‐created clusters• Meister et al. (’09) used the interpolation algorithm (Kurland&Lee ’04) with both offline and query‐specific clusters

• Small (but consistent) performance improvements over using only offline‐created or query‐specific clusters

• Lee et al. (’01) use a “query‐specific” view of static (offline‐created) clusters

75

Cluster‐based fusion

76

Fusion of retrieved lists

1Given query and document lists ,..., that were retrieved in response to

from corpus , produce a single list of results

Integrating various information sources for retrieval

Motivation

The taskmq L L q

D

(Croft '00)(e.g., document representations, query representations, retrieval models)

1d

2d

1d

2d

3d3d

2d

4d

3d

1L 2L 3L

fuse

1d

2d

3d

77

A common fusion principle

Documents that are highly ranked in many of the lists are rewarded

(1) overlap of relevant documents in the lists is higher than that of non‐relevant documents (Lee ’97) – the chorus effect

(2) the skimming effect (Saracevic&Kantor ‘88, Vogt&Cottrell ’99)

78

14

But …

Different retrieved lists might contain different relevant documents

e.g., Das‐Jupta and Katzer ’83, Griffiths et al. ’86, Soboroff et al. ’01, Beitzel et al. ’03

fuse2r

1r

nr

1r

nr

2r

1r

nr

3r

1r

nr

2r

1L 2L 3L

79

A cluster‐based fusion approach (Kozorovitsky&Kurland ’11)Let similar documents across the lists provide relevance‐status support to each other using ClustFuse ‐ a variant of the Interpolation algorithm

• cluster hypothesis

• utilize information induced from clusters of similar documents that are created across the lists

fuse2r

1r

nr

1r

nr

2r

1r

nr

3r

1r

2r

3r

80

ClustFuse (Kozorovitsky&Kurland ’11)

Document d is rewarded based on its • standard fusion score

• reflects the extent to which d is highly ranked in many of the lists

• similarity to clusters that contain documents that are highly ranked in many of the lists

• =0 amounts to the standard fusion method (F) that ClustFuse incorporates

( )

( ) (1 ) ( | ) ( | ) ( | )L

ClustFusec Cl C

F d p d q p c q p d c

;

;∈

∏ ;∈

∑ ∏ ;∈∈

∑ ,∈

∑ ∑ ,∈∈

F(d;q) – the score assigned to d by some standard fusion approach

81

MAP performance of fusing TREC runs (Kozorovitsky&Kurland ’11)

4

5

6

7

8

9

10

trec3

run1

CombMNZ

ClustFuseCombMNZ

Borda

ClustFuseBorda

‘r’, ‘f’ – statistically significant differences with run1 and the standard fusion method, respectively

r

r

r

r

r

f f

r,f r,f

Fusing 3 randomly selected TREC runs (run1 is the best performing among the three)

82

Optimal clusters in the fusion setting (Kozorovitsky&Kurland ’11)

OptCluster is the optimal cluster among all clusters created from all the documents in the 3 runs that are fused (run1, run2, run3)OptCluster(runi) is the optimal cluster among clusters created from runi‘a’, ’b’ and ‘c’ mark statistically significant differences with run1, run2, and run3, respectively

83

Cluster‐based federates search(Khalman&Kurland ’12)

• Retrieving the lists from disjoint corpora(federated/distributed search)

• Crestani and Wu (’06) showed that in the federated search setting there exist clusters that contain a high percentage of relevant documents

p@10p@5

42.046.0init

49.853.6ClustFuse

41.043.2init

49.050.4ClustFuse

CORI

SSL

84

15

Cluster‐based query expansion

• Treating clusters as pseudo queries (Kurland et al. ’05)

• Using cluster‐based (or topic‐based) smoothed document language models for both constructing an expanded query form and for ranking (Liu&Croft ’04, Tao et al. ’06, Wei&Croft ‘06)

• Constructing a query‐expanded form by rewarding top‐retrieved documents that are members of many query‐specific overlapping clusters (Lee et al. ’08)

• Using top‐retrieved clusters instead (or in addition to) top‐retrieved documents for constructing an expanded query form (Na et al. ’07, Gelfer Kalmanovich&Kurland ’09)

• Cluster‐based query expansion for federated search (Shokouhi et al. ’09)

85

Cluster‐based results diversification

• e.g., Maximal Marginal Relevance (Carbonell&Goldstein’98)

• LetR betheresultlistofthedocumentsmosthighlyrankedusing 1 ,

• LetS be thenewlistwecreatefromR i.e.,re‐ranking ;theorderofinsertingdocumentstoS istheinducedranking:

• ; ≜ ∈ \ 1 , 1∈ 2 ,

• A cluster‐based approach: estimate 1 , using a cluster ranking method (He et al. ’11, Raiber&Kurland’13)

86

Cluster‐based results diversification (cont.)

• A second cluster‐based approach: Cluster R and pick documents from the clusters (e.g., in a round robin fashion) which are viewed as potential aspects

• e.g., Leelanupab et al. ’10

87

Utilizing relevance feedback using document clusters• Shanhan et al. ’03 found that the optimal cluster is a very good basis for relevance feedback

• Re‐emphasizing claims in older literature (e.g., Jardine&Rijsbergen ’71, Croft ’80) about the motivation to find good clusters

• Active relevance feedback (Shen&Zhai ’05)• Diversify the feedback set by picking documents from query‐specific clusters of top‐retrieved results

• Baseline: asking for feedback for the top‐k retrieved documents

• Interactive retrieval• Ivie ’66, Hearst&Pedersen ’95, Leuski ’01

88

Using clusters for query‐performance prediction (QPP)• The query‐performance prediction task: estimating the effectiveness of a search performed in response to a query in lack of relevance judgments (Carmel&Yom Tov ’10)

• The clustering tendency of the results is an indicator for search effectiveness (Vinay et al. ’06)

• The extent to which the retrieval scores of documents “respect” the cluster hypothesis is an effective query‐performance predictor (Diaz ’07)

89

On the connection between cluster ranking and query‐performance prediction (Kurland et al. ’12)• Cluster ranking: estimate the probability that a cluster (set of documents) is relevant to a query

• Query performance prediction (QPP): estimate the probability that a result list (ranked list of documents) is relevant to a query

• As it turns out, quite a few QPP and cluster ranking methods are based on the exact same principles

• The geometric mean of retrieval scores in a result list is a high quality performance predictor (Zhou&Croft ’07)

• The geometric mean of retrieval scores in a cluster is an effective criterion for ranking clusters (Liu&Croft ’08, Seo&Croft ’10, Kurland&Krikon ’11)

90

16

Cluster‐based retrieval –intermediate summary• Using clusters to select documents

• Using clusters to enrich (smooth) document representations• or topic models

• Offline vs. query‐specific clustering

• Soft vs. hard clustering

• The optimal cluster

• Cluster‐based fusion and federated search

• Cluster‐based query expansion

• Cluster‐based results diversification

• Using document clusters to utilize relevance feedback

• The connection between query‐performance prediction and cluster ranking

91

Graph‐based methods utilizing inter‐document similarities

92

Graph‐based framework for re‐ranking (Kurland&Lee '05, ‘10)

Inspiration: Web Retrieval

Common approach to web retrieval:

• Re‐rank an initial retrieved list of documents by the degree of centrality (Brin&Page '98, Kleinberg '98)

• Centrality of a document is estimated using explicit hyperlink structure (PageRank, HITS)

Can we use the scoring by centrality approach for ranking non‐hypertext documents?

A

B C

X

Y

93

A possible strategy:Structural re‐ranking

• Use inter‐document similarities to infer links between documents in

• On the resultant graph (of documents and induced links) define centrality measures and use them as criteria for ranking

94

How to induce links ?

One might suggest:

Vector Space Model (VSM) for information representation and cosine for similarity metric

Erkan&Radev '04 : text summarization• Cosine similarity between sentences• See the book: Rada Mihalcea and Dragomir Radev. Graph‐based natural language processing and information retrieval. Cambridge University Press, 2011.

but …

95

Inducing links

PortlandPortlandPortland

Relevant Relevant

2d

DublinPortlandBeijing

1d

Relevant Relevant

“spiky” distribution

“flat” distribution

1d 2d 1d 2d

2 1 1 2( | ) ( | )p d d p d dLMs:1 2 2 1cos( , ) cos( , )d d d dVSM:

Zhang et al. (’05) used asymmetric cosine‐based edge weights, but these were found by Kurland&Lee (’10) to be somewhat less effective than the language‐model‐based weights

96

17

Generation graphs

For document :

( ) documents that yield the highest ( | )

The complete graph with edge weig( , )

hts:

[ ( )] ( | )

The smoothed (Brin&P

( )

ag

init ini

init

ini

t init

t

G D D D

wt

o D

TopGen o k g D p o g

g TopGen o p o go g

'

[ ]

[ ]

e '98) complete graph

with edge weights:

1 +

|

( , )

( )

( ' ( )

)|init

init

init ini

g D

t init

wt o g

wt o

G D D

wt o gD

D

g

is a “generator” of o (o is an offspring of g)( )g TopGen o

97

Inducing centrality: Recursive Weighted Influx Algorithm

• Smoothed graph : ergodic Markov chain, power method converges

• The Recursive Weighted Influx algorithm is a weighted analog of PageRank

[ ] [ ] [ ]

[ ]

( ; ) ( ) ( ; )

( ; ) 1

init

init

RWI RWIo D

RWId D

Cen d G wt o d Cen o G

Cen d G

[ ]G

98

The language modeling framework

( ; ) ( | )Cen d G p q d

doc “prior” initial ranking

Lafferty&Zhai '01: “with hypertext, [a document prior] might be the distribution calculated using the ‘PageRank’ scheme”

Algorithm: Recursive Weighted Influx+LM

Score by: cf. ( ) ( | )p d p q d

99

Evaluation

• : 50 documents retrieved by an optimized language‐model‐based retrieval method

• Evaluation measure: precision@5

• Reference comparison (initial ranking)• Can we push relevant documents to the top 5 ranks and move from there non‐relevant documents ?

initD

100

LM framework with centrality scores as “priors”

26

31

36

41

46

51

56

61

init rank

RW-Influx+LM

AP TREC8 WSJ AP89prec @ 5

*

*

* Statistical significance difference with init rank

101

Comparing centrality measures(e.g., Miller et al. '99; doc length as "prior")

25

30

35

40

45

50

55

60uniform

log(length)

RW-Influx

precision@5 AP TREC8 WSJ AP89

*

*


102

18

Cosine vs. LM probabilities

25

30

35

40

45

50

55

60 init rank

RW-In+LM(COS)

RW-In+LM(LM)

precision@5 AP TREC8 WSJ AP89

*

*


103

Relevance score propagation (cf., Daniłowicz and Bali´nski ’01, Otterbacher et al. 05)

For document :

( ) documents that yield the highest ( | )

The complete graph with edge weig( , )

hts:

[ ( )] ( | )

The smoothed (Brin&P

( )

ag

init ini

init

ini

t init

t

G D D D

wt

o D

TopGen o k g D p o g

g TopGen o p o go g

[

'

[ ]

'

] ( , )

e '98) complete graph

with edge weights:

( , )(1 ) + (

( )

( '( , ') ))

init initg D g

init init init

D

G D D D

wwt os

t o ggim q

wt

g

sim q g o g

104

Label propagation (Yang et al. ’06)• Treat the query and documents in the highest ranked (query‐specific) cluster as relevant

• Treat the documents at the tail of the result list as non‐relevant

• Apply Zhu&Ghahramani’s (’02) label propagation algorithm:

2, j

, j 2

Y: the vector of documents' labels (relevant/not-relevant/unlabeled)

: the distance between docs i and j

: regularization factor

exp( ) (j i)

The algorithm:

1. Propag

ij

ij ii ij

kj

d

d ww T P

w

ate Y TY

2. Row-normalize Y

3. Clamp the labeled data and go to step 1 until convergence

105

Label propagation (Yang et al. ’06)

Experiments with TREC8; Okapi BM25 is the initial ranking methodM: size of the initial list; K: size of the base set from which pseudo relevant documents are selected; N: # of pseudo non relevant documents

106

Document‐cluster graphs (Kurland&Lee ’06)

• The cluster‐document duality (for query‐specific clusters)

• Clusters that are most representative of the information need contain (are associated with) many relevant documents

• Clusters that contain (are associated with) many relevant documents are most representative of the information need

107

Hub/Authority Cluster/Document

Document as authority Document as hubd cc d

id

jd

lc

mc

nc

id

jd

lc

mc

nc

Doc‐only graph d d

jdid

kdld

108

19

30%

35%

40%

45%

50%

55%

60%

65% init. rank

auth[d->d]

PR[d->d]

auth[c->d]

Re‐ranking using document centrality (query‐specific clusters)

AP TREC8 WSJprec @ 5

Authority scoresdoc‐only graph d ddoc as authority graph c d

PageRank scores doc‐only graph d d

**

using nearest‐neighbor clusters

* significant difference between auth[c d] and auth[d d]

109

Cluster centralityThe percentage of relevant documents in the highest ranked cluster of 5 documents

39.2% 39.6% 44%

48.7% 48% 51.2%

49.5% 50.8% 53.6%

AP TREC8 WSJ

influx[d c]

auth[d c]

query likelihood

(sim(q,c))q q

q q q

'q': significant difference with query likelihood

110

Passage‐based document retrieval

• Motivation for using passage‐based information: long and/or topically heterogeneous documents that are relevant to a query might contain a single (short) passage that contains query‐pertaining information

TheInterPsgDoc method Callan’94 :; ≜ , 1 ) ∈ ,

(∈ is a passage in document

111

Passage‐document graphs (Bendersky&Kurland ’08)

• Re‐ranking an initially retrieved list

• Documents in the list are hubs, passages of documents in the lists are authorities

• ; ≜ , ∈ g)

Centrality(g) is g’s influx (weighted in‐degree) or authority value (induced by the HITS algorithm)

id

jd

lg

mg

See Krikon et al. (’10) for using simultaneously doc‐only and passage‐only graphsSee Krikon&Kurland (’11) for integrating documents, passages and clusters

112

Score regularization (Diaz ’05, ’07)• Theidea:similardocumentsshouldbeassignedwithsimilarretrievalscores

• But,maintainsome“consistency”withtheinitialretrievalscores• Otherwise,aflatscoredistributionwouldbethebest

• Couldbeviewedas“iterativescoresmoothing”

• f: vector of regularized scores• S f :score function associated with inter‐document consistency of scores; penalizes large differences between scores of similar documents

• :consistency with original scores (L2 distance)• Objective function:

∗ ≜∈

113

1

1

( , ..., ) is the vector o f o rig inal retrieval scores o f the docum ents

to be re-ranked (i.e., the h ighest ranked docum ents by the in itia l search)

( , ..., ) is the vector o f regu larized (new )

n

n

y y y n

n

f f f

1

2

, 1

2

1

scores

: affin ity m atrix ( - affin ity betw een doc i and doc j; = 0)

: d iagonal m atrix w here

( )

( ) ( )

)(

ij ii

n

ii iji

nij ij

i ji j ii jj

n

i ii

W W W

D D W

W WS f f f

D D

G f f y

∗ ≜∈

There is a closed form solution

114

20

Score regularization – empirical results (Diaz ’07)

115

Additional approaches

• Spreading activation networks (Salton&Buckley ’88)• Markov chains (Balin´ski& Daniłowicz ’05)

• Hyperlinks and inter‐document similarities• Biasing PageRank using query‐similarity values (Richardson&Domingos ’04)

• Both the random jump and the jump via hyperlinks• The bias can be based on inter‐document similarities

• Are semantically related (hyper) links more effective for retrieval? (Koolen&Kamps ’11)

• Graph‐based fusion (Kozorovitzky&Kurland ’09, ’11)• Using random walks with absorbing states to diversify search results (Zhu et al. ’07)

• Term‐based graphs (out of the scope of this tutorial)116

Adversarial search

• Mishne et al. ’05• Finding spam comments in blogs by comparing the language models of the (i) comment, (ii) pages to which the comment has outgoing links, and (iii) the blog post

• Benczúr et al. ’06• Finding nepotistic links by comparing a language model induced from the anchor text and that induced from the target page

• A similar approach was used by Martinez‐Romo&Araujo ‘09

• Raiber et al. ’12 • Using inter‐document similarities to address keyword stuffing

117

Concluding notes

• The cluster hypothesis gave rise to much work in the IR field. We surveyed:

• Cluster hypothesis tests

• Cluster‐based document retrieval• Using clusters to select documents

• Using clusters (or topic models) to enrich (smooth) document representations

• Using graph-based methods that utilize inter‐document similarities for ad hoc retrieval

• Applications/tasks for which cluster‐based or graph‐based methods have been used

• Query‐performance prediction, fusion, federated search, microblog retrieval, results diversification, query expansion, using relevance feedback, adversarial search

118

Some open challenges

• The optimal cluster

• Selective application of cluster‐based and document‐based retrieval

• Devising query‐sensitive inter‐document measures that will result in the cluster hypothesis holding to a larger extent

• A theoretical basis for using document clusters for retrieval? (cf., Fang&Zhai’s axiomatic framework to document retrieval)

119 120

21

121 122

123