12
Viewing Term Proximity from a Different Perspective Ruihua Song, Ji-Rong Wen, Wei-Ying Ma Microsoft Research Asia, 49 Zhichun Road, Beijing, 100080, P.R. China {t-rsong, jrwen, wyma}@microsoft.com ABSTRACT Various approaches have been explored to utilize word proximity information to improve the effectiveness of text retrieval system. In previous works, loose phrases, composed of several query terms that may be intervened by other words, are directly considered as additional independent query terms, but in fact some of them are overlapped. This paper revisits the term proximity scoring problem, and proposes a new method to incorporate term proximity into ranking functions. The new method is different from priors in three aspects: (1) close query terms are matched up to compose non-overlapped expanded spans which represent contexts of query terms, (2) the contribution of a query term to relevance is determined by both its contexts and frequency, (3) it is relatively easy to plug this method into existing ranking functions with the part of inverse document frequency preserved. Experimental results on TREC-9,10,11 collections showed that the proposed approach consistently improved retrieval precision. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval ]: Retrieval models General Terms Algorithms, Experimentation Keywords Term proximity, relevance ranking 1. INTRODUCTION When more Web users rely on search engines to navigate or surf the web, highly relevant documents are expected to return at the top of the result list [17]. Besides some important measurements, namely term frequency and inverse document frequency, that are captured effectively by traditional relevance ranking functions, term proximity is claimed to be especially useful to improve top precision by some researchers [2,3,6,7,13]. For multi-word queries, Google[1] matches close term occurrences together and classified the distance of term occurrences into 10 different values that represent from a phrase to “not even close”. Then the distance values are used in relevance measurement. Taking topic 554 from TREC as an example, “Home buying” is used as a query to find information on how one find and apply for mortgages and buy a first home. Both two pages in Figure 1 are ranked among top 10 results, but (a) is not relevant at all. Compared with the relevant page (b), (a) contains widely separated occurrences of terms home and buying and bores little connection between them, whereas these two terms are close to each other in (b). This shows that ranking could have been improved had proximity been taken account of. As an operator, proximity, such as “NEAR”, is commonly provided by many query languages to create a more precise query. In this paper, we only focus on the use of term proximity to enhance relevance ranking for unstructured queries.

Proceedings Template - WORD · Web viewFor multi-word queries, Google[1] matches close term occurrences together and classified the distance of term occurrences into 10 different

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Proceedings Template - WORD · Web viewFor multi-word queries, Google[1] matches close term occurrences together and classified the distance of term occurrences into 10 different

Viewing Term Proximity from a Different Perspective

Ruihua Song, Ji-Rong Wen, Wei-Ying MaMicrosoft Research Asia, 49 Zhichun Road, Beijing, 100080, P.R. China

{t-rsong, jrwen, wyma}@microsoft.com

ABSTRACTVarious approaches have been explored to utilize word proximity information to improve the effectiveness of text retrieval system. In previous works, loose phrases, composed of several query terms that may be intervened by other words, are directly considered as additional independent query terms, but in fact some of them are overlapped. This paper revisits the term proximity scoring problem, and proposes a new method to incorporate term proximity into ranking functions. The new method is different from priors in three aspects: (1) close query terms are matched up to compose non-overlapped expanded spans which represent contexts of query terms, (2) the contribution of a query term to relevance is determined by both its contexts and frequency, (3) it is relatively easy to plug this method into existing ranking functions with the part of inverse document frequency preserved. Experimental results on TREC-9,10,11 collections showed that the proposed approach consistently improved retrieval precision.

Categories and Subject DescriptorsH.3.3 [Information Search and Retrieval]: Retrieval models

General TermsAlgorithms, Experimentation

KeywordsTerm proximity, relevance ranking

1. INTRODUCTIONWhen more Web users rely on search engines to navigate or surf the web, highly relevant documents are expected to return at the top of the result list [17]. Besides some important measurements, namely term frequency and inverse document frequency, that are captured effectively by traditional relevance ranking functions, term proximity is claimed to be especially useful to improve top precision by some researchers [2,3,6,7,13]. For multi-word queries, Google[1] matches close term occurrences together and classified the distance of term occurrences into 10 different values that represent from a phrase to “not even close”. Then the distance values are used in relevance measurement.

Taking topic 554 from TREC as an example, “Home buying” is used as a query to find information on how one find and apply for mortgages and buy a first home. Both two pages in Figure 1 are ranked among top 10 results, but (a) is not relevant at all. Compared with the relevant page (b), (a) contains widely separated occurrences of terms home and buying and bores little connection between them, whereas these two terms are close to

each other in (b). This shows that ranking could have been improved had proximity been taken account of.

As an operator, proximity, such as “NEAR”, is commonly provided by many query languages to create a more precise query. In this paper, we only focus on the use of term proximity to enhance relevance ranking for unstructured queries.

(a) An Irrelevant Page

(b) A Relevant Page

Figure 1. Two Search Results for the Topic of “Home buying”

Both Clarke et al. [2] and Hawking et al. [6] experimented with relevance measure based on term proximity in TREC-4. Their approaches showed positive performance but queries were refined before searching and the issue of partial match needs

Page 2: Proceedings Template - WORD · Web viewFor multi-word queries, Google[1] matches close term occurrences together and classified the distance of term occurrences into 10 different

further exploration. Brin and Page described the usage of proximity in Google [1], but no details or performance have been discussed. The most recent work was done by Rasolofo and Savoy [13]. They appended a basic relevance score with the part that is related to proximity of sets of query terms. Such combination was similar to phrase-finding and indexing. The underlying difficulty of this kind of approaches is to estimate the contribution of a loose phrase appropriately, when the contribution of a single query term has been considered in the basic relevance score.

This paper expands previous work from another point of view. Loose phrases are not considered in ranking functions directly, but the contribution to relevance is distributed to elementary query terms that the phrases contain. First, an algorithm is proposed to match up close query terms under some constrains to compose expanded spans which represent contexts of query terms. Second, the likelihood that a term t indicates relevance is determined by both its context and frequency. And finally traditional ranking function is easy to replace term frequency by the term proximity related part named relevance contribution while other parts are preserved. Experimental results on 150 queries of TREC-9, TREC-10 and TREC-12 showed that our approach improved average and top precision over Okapi’s ranking function [15] consistently.

In next section, previous works on term proximity are reviewed. Section 3 describes the new approach that builds term proximity in ranking functions. Experimental results are shown in Section 4. Section 5 concludes and discusses future work.

2. BACKGROUND A basic principle of information retrieval is that the relevance of a document to a query is determined by the distributions of query terms in the document. The essence of finding good ranking function is equal to finding good measurements of query term distributions and their combination.

2.1 Measurements of General Ranking Functionstf-idf is most classic method to model term distribution. tf stands for term frequency and idf stands for inverse document frequency. The intuitions here are: a document that contains more query terms (i.e. with higher tf) is more likely to be relevant; and a query term that occurs in few documents (i.e. with lower idf) is more important.

BM25 is the state-of-the-art ranking functions for text retrieval [16], which assembles term frequency, inverse document frequency and document length together to measure relevance. BM25 formula is showed as below:

(1)

And K is defined as

(2)

Where,

l is the document length,

avdl is the average document length in the corpus,

b, k1 are constants,

w(1) is the Robertson/Sparck Jones weight [14], and one of its simple variant is:

(3)

Where:

N is the sum of documents within all collections,

n is the number of documents containing the term t within all collections.

Obviously, w(1) is a variant of idf. Thus, BM25 is a combination of 3 measurements: tf, idf and document length.

2.2 Incorporating Term Proximity into Ranking FunctionIn above ranking functions, for a query containing multiple terms, the terms are treated as mutual independent and thus the distance of term occurrences in documents is not considered to measure term distribution at all. However, term proximity, a measurement of distance of term occurrences, could be an importance factor to affect relevance. As early as 1958, Luhn stated [8]:

It is here proposed that the frequency of word occurrences in an article furnishes a useful measurement of word significance. It is further proposed that the relative position within a sentence of words having given values of significance furnishes a useful measurement for determining the significance of sentences. The significance factor of a sentence will therefore be based on a combination of these two measurements.

In the 1995 TREC conference, both University of Waterloo and Australian National University adopted relevance measures based on term proximity. Their methods, known as shortest-substring ranking and Z-Mode respectively, are very similar. Both of these two approaches are based on the following two assumptions:

Assumption A: The closer appropriately chosen groups of query term occurrences in a document (spans), the more likely that the corresponding text is relevant.

Assumption B: The more spans contained in a document, the more likely that the document is relevant.

Here we briefly describe the processing flow of their methods. First, sets of similar or equivalent terms (synonyms, alternative spellings, plurals and etc.) are manually grouped from the retrieval topic to represent concepts. Besides personal knowledge, some external resources such as on-line dictionary (Webster’s), the Unix spell program and an on-line list of country, state and city names are used to construct a concept. For example, for the topic of “What is the economic impact of recycling tires?” three concepts “economic impact”, “recycling” and “tires”, are identified. Also, “profits” could be added in the concept of “economic impact”.

Page 3: Proceedings Template - WORD · Web viewFor multi-word queries, Google[1] matches close term occurrences together and classified the distance of term occurrences into 10 different

Second, spans (or solution extents, there is a little difference between them [2, 7].) including at least one representative of each concept are detected. For example, in the text fragment:

… reported huge profits to be made from recycling discarded automobile tires…

A span from “profits” to “tires” is found to contain representatives of all concepts.

Finally, the relevance of a document to a topic is the sum of scores of all the spans. There is a little difference in scoring a span between the two ranking functions. In the shortest-substring ranking function, the score is proportional to the reciprocal of the length of span, while the score is proportional to the inverse square root of the length instead in Z-mode.

But in real applications, especially in Web search, short and unstructured queries are prevalent and there is no notion of concept. So a concept extends as a query term, and further a span is no longer required to include all query terms. Accordingly, besides assumptions A and B, another one is added:

Assumption C: The more unique query terms a span contains and the more important these terms are, the more likely that a document is relevant.

The assumption is consistent to the experiences of web users who often expect a document containing most or all of the query terms to be ranked before a document containing fewer terms [16, 18].

Their approaches were reported to achieve very good precision-recall in that year’s TREC experiments. The main shortcoming of these approaches is that several important measurements in traditional ranking functions, such as tf, idf and document length, are abandoned.

Recently Rasolofo and Savoy demonstrated that by combining a kind of simple term proximity based on word pairs into traditional ranking function could improve retrieval effectiveness [13]. For a query , the following set S

of term pairs is obtained: . Then Okapi’s score function BM25 (formula 1) is extended by adding an analogous function for term pairs (only term pairs within a maximal distance of five are considered).

(4)

And they compute a term pair instance (tpi) weight as follows:

(5)

Where is the distance expressed in number of words. Based on formula (5), the highest value is 1.0, corresponding to a distance of one (the terms are adjacent), and the lowest value is 0.04, corresponding to a distance of 5.

The term pair here could be viewed as a kind of loose phrase, and the approach is similar to phrase finding and indexing. The final relevance score is the linear combination of relevance scores of single terms and those of loose phrases. But there are two significant difficulties underlying such a kind of approaches:

1. Since phrases are not independent from unigrams (single terms) and general ranking functions, such as Okapi’s BM25, have considered unigrams. Thus it is difficult to estimate the importance of phrases [9] and their extra contribution to relevance score. For example, if we use to represent the number of documents containing term

within all collections, and the number of documents

containing both terms and . If we do not consider overlapping between loose phrases and single query terms,

should be less than both and because

. In [9], Papineni argued that estimating

the bi-gram is much more complexy when considering

the overlapping

2. Linear combination of scores of unigrams and those of loose phrases may break the non-linear property of term frequency. In most modern ranking functions, non-linear term frequency is desirable because of the statistical dependence of term occurrences: the information gained on observing a term the first time is greater than the information gained on subsequently seeing the same term. For example, if , and , when observing a

term 4 times, ( ) is only slightly more

than that when observing a term 3 times ( ). Suppose

, a loose phrase with =5 will

increase the value of by 1, while another pair

with =6 contributes nothing. However, the

difference between these two cases is not as large as the scores show. Formula (5) seems to be more reasonable because tpi decays rapidly with distance increasing, which partially preserves non-linear property of term frequency.

In this study, we employ another strategy to integrate term proximity into ranking functions while avoiding the above difficulties simultaneously.

3. RANKING FUNCTIONS WITH TERM PROXIMITY BUILD-INHere we propose a way to go beyond the simple linear combination of relevance scores of single terms and loose phrases. Although we agree that the contribution of loose phrases is greater than that of an isolated query term, we do not treat phrases as separate objects. When a term occurs within a loose phrase, its contribution to relevance is boosted by its

Page 4: Proceedings Template - WORD · Web viewFor multi-word queries, Google[1] matches close term occurrences together and classified the distance of term occurrences into 10 different

context such as the density of the loose phrase and the number of unique terms that occur together with it. In other words, the contribution of a loose phrase is distributed to each term that composes the phrase. As a consequence, term frequency in existing ranking functions is replaced by the accumulation of single relevance contribution. While term proximity is easily plugged in, such approach could also take advantage of well-defined inverted document frequency and document length normalization parts, and preserve non-linear property of term frequency as well.

In this section, we will mainly address the following two problems:

1. How to detect loose phrases without overlapping called expanded spans.

2. How to measure relevance contribution of a single term considering of its context.

3.1 Expanded SpanEach document d is treated as an ordered sequence of terms

Given a query that is represented by a set of query terms i.e. , all query terms occurrences compose a chain of ordered hits:

with . The chain of hits will be segmented

into a set of non-overlapped spans by the algorithm presented in Appendix A. The chain is scanned one hit by another from the head to the tail. For a hit that is scanned currently, if next hit exists, four possible cases are to be processed respectively:

(1) The distance between the current and the next is bigger than a threshold MAX_DIS, then the chain is separated between these two hits;

(2) The current and the next hit are identical, then the chain is separated between these two hits;

(3) The next hit is identical to a hit with former continuous sub-chain, then the distance between the current and the next and the distance between the identical hit and its next is compared, the chain is separated at the bigger gap.

(4) Otherwise, go on scanning the next hit.

When the last hit has been scanned, the chain is segmented into

several spans called expanded spans like ( … ). The width

of an expanded span of ( … ) is defined as:

(6)

An expanded span has some properties:

(1) There is no overlapping between any two expanded spans

(2) Each expanded span contains as many unique hits as possible and the width is minimized..

For each hit , in case 1 and 2, no further comparison with hits in current sub-chain is needed, while in case 3, there are

comparisons at most when no identical hit is found. Therefore, For a chain composed of m hits, the worst time complexity of expanded span detection algorithm is .

To illustrate expanded spans, we will use the short document [12] shown in Figure 2, which was quoted by Clarke et al. in [3]. Superscripts indicate term positions.

Figure 2. A Short Document with Position Labels

Suppose that a query is “sea thousand years” and MAX_DIS is set as 10. In this example, the chain of ordered hits is:

sea5, thousand7, years8, thousand10, years11, sea29

According to the segmentation algorithm, scanning started from sea5. For sea5 and thousand7, the fourth case is applied. For years8, the next hit, i.e. thousand10, is identical to thousand7, the third case is applied. As thousand7 is nearer to years8 than is thousand10, the chain is separated before thousand10. And thus, (sea5 … years8) forms the first expanded span. For thousand10, the fourth case is applied. When years11 is scanned, the distance between sea29 and years11 is further than MAX_DIS. Applying the first case, (thousand10 years11) forms an expanded span. Finally, (sea29) become an expanded span that contains a single query term. The set of expanded spans for the document is:

{(sea5 … years8), (thousand10 years11), (sea29)}

The width corresponds to the expanded spans is listed in the following set:

{4, 2, 10}

It is remarkable that the width of (sea29) is not 1 but MAX_DIS.

3.2 Viewing Term Proximity from a Different PerspectiveWhen a document is viewed as a bag of words, term frequency accumulates the contribution to relevance of each term no matter whether the term occurs far from other query terms or near to.

Page 5: Proceedings Template - WORD · Web viewFor multi-word queries, Google[1] matches close term occurrences together and classified the distance of term occurrences into 10 different

When term position information is taken account of, the relevance contribution of each term occurrence could vary with the context then. For instance, sea5 contributes more than sea29 as sea5 is near to other 2 query terms while sea29 is isolated.

Intuitively, relevance contribution is proportional to some function of the density of an expanded span, i.e. the number of unique query terms contained is divided by the width of the expanded span. For example, (thousand10 years11) may be more relevant than (thousand56 … years65). On the other hand, the number of query terms that an expanded contains also boosts relevance contribution. For example, though the density of (thousand10 years11) is a little greater than that of (sea5 …

thousand7 years8), the later matches all the query terms and implies more relevance contribution to the query of “sea thousand year”.

Here we proposed a function, which is not the only one or even the perfect one, to represent the relevance contribution of one term occurrence:

(7)

Where:

is a query term,

is an expanded span that contains ,

is the number of query terms that occur in ,

is the width of ,

is an exponent that is used to restrain that the value decayed too rapidly with the density of an expanded span increasing,

is an exponent that is used to prompting the case that more unique query terms appear in one expanded span.

This function increases with increasing and the width of the expanded span decreasing, which accords with Assumption A & C. For example, when x and y is set as 1,

Actually the results are equal to all query terms that an expanded span contains. It does not matter since weights for different terms will distinguish the importance of them in the ranking functions.

Based on Assumption B, we accumulate relevance contribution of all occurrences for the term t:

(8)

When tf in traditional ranking functions is replaced by rc, term proximity factor is plugged in the modified ranking functions naturally. And the more expanded spans contained in a document, the relevance score will be higher, which accords with Assumption B. At the same time, the importance of terms also could be distinguished easily by the part of inverse document frequency that is preserved.

In our experiments, rc is built in Okapi’s ranking function (formula (1))[15]:

(9)

4. EXPERIMENTSExperiments were conducted based on 150 topics used for the adhoc task of TREC-9, TREC-10 and the web task of TREC-11. As the topic distillation task of TREC-12 distinguishes from relevance ranking, topics and answers for TREC-12 were excluded. Only the fields of title were extracted as queries without spelling check. TREC-9 and TREC-10 used the dataset of WT10g [4], while TREC-11 used .GOV [4]. Both the data and queries were processed by Fox’s stop list [5] and Porter stemmer [11].

There are two main parameters in Okapi’s ranking function: b and k1. Sometimes relevance measures are sensitive to the parameters. Considering the fact that queries are unexpected and answers are unknown, the parameters were tuned only once for one dataset. TREC-9 and TREC-10 shared one pair of b and k1 while TREC-11 applied another (See Table 1). As a result, the baselines obtained via Okapi’s relevance ranking were comparable to top-flight performance in the corresponding TRECs.

Three parameters in our new ranking function, i.e. MAX_DIS, x and y, were tuned coarsely on TREC-9 (See Table 1) and then applied on TREC-10 and TREC-11.

Table 1. Parameters Applied on each Collection

Collections b k1 MAX_DIS x y

TREC-9 0.3 0.4 45 0.25 0.3

TREC-10 0.3 0.4 45 0.25 0.3

TREC-11 0.45 2.5 45 0.25 0.3

In our approach, the interaction of two adjacent hits disappears when the distance between them is larger than MAX_DIS. 45 is much larger than 5 that used in Rasolofo and Savoy’s ranking function [13]. Their approach needs more credible phrases to refine Okapi’s results. Such MAX_DIS confirmed Papika and Allan’s conclusion. After any expansion step of adding top a few of multiword features, queries containing proximity operators spanning bigger windows (50) had better retrieval effectiveness that similarly sized queries using operators spanning smaller windows (such as 5) according to their experimental results [10]. The conclusion implies that words within a distance of 50 are related to each other for search.

Page 6: Proceedings Template - WORD · Web viewFor multi-word queries, Google[1] matches close term occurrences together and classified the distance of term occurrences into 10 different

The exponent of x controls the speed of relevance contribution decaying as the density of a span decreases. 0.25 represented slower speed than square root [6,7].

In addition, the performance was improved when the exponent of y increased from 0 to 0.3 during tuning. This means that the number of hits that a span contains implies extra relevance contribution.

Table 2 showed that the improvement is consistent no matter in terms of average or top precision. TREC-9’s performance seemed harder to enhance than other two collections. And precision for some queries decreased with term proximity built in. Rasolofo and Savoy observed similar phenomena. It indicated that term proximity is not always infallible evidence of relevance. Nevertheless, term proximity ranking functions could enhance perceived relevance as users expect a document containing most or all the query terms to be ranked before a document containing fewer terms even when a document with fewer terms is relevant but a document with more terms is not [17].

Table 2. Average Precision, Precision at 5 and 10 documents

Collections AvePre P@5 P@10

TREC-9-Okapi 0.2038 0.3102 0.2612

TREC-9-newTP 0.2044 0.3143 0.2735

TREC-9-Diff +0.3% +1.3% +4.7%

TREC-10-Okapi 0.2026 0.3640 0.3240

TREC-10-newTP 0.2237 0.3960 0.3480

TREC-10-Diff +10.4% +8.8% +7.4%

TREC-11-Okapi 0.1776 0.2776 0.2408

TREC-11-newTP 0.1855 0.3143 0.2653

TREC-11-Diff +4.4% +13.2% +10.2%

Given experimental settings described above, Rasolofo and Savoy’s term proximity ranking function was also implemented. Figure 3, 4 and 5 compared three ranking functions, namely Okapi, Rasolofo and Savoy’s OkaTP and ours labeled as newTP, in terms of average precision, precision at 5 and precision at 10 respectively. OkaTP outperformed our newTP in terms of precision at 5 whereas it dropped behind ours and even Okapi in terms of precision at 10. Average precision of these two approaches were similar and better than Okapi in the experiments. The advantage of our newTP was that it retained improvement based on Okapi.

5. CONCLUSION AND FUTURE WORKWe revisit previous works on term proximity and summarize the main assumptions about the relationships between proximity and relevance. The main contribution of this paper is to propose a novel approach to embed term proximity factors in existing ranking function frameworks while not imposing significant modifications to the functions. Instead of directly integrating the scores of loose phrases, the probability that a term t indicates relevance varies with the properties of the term’s context.

As a result, when term proximity is introduced in, no extra items are appended to ranking function and the refined part of inverse document frequency is still effective. Experimental results showed consistent improvement on three TREC test collections.In the next step, we plan to refine our approach and compare it with other approaches. As our results on TREC-9 are not so identical with those reported by Rasolofo and Savoy, it is doubted that some unknown reasons may lie behind. We plan to conduct further experiments to figure them out. In addition, it is pointed out that term proximity is not an absolutely infallible evidence of relevance, especially of objective relevance. Sometimes, term proximity is more useful to improve perceived relevance, or subjective relevance. Evaluating the influence of term proximity for perceived relevance is also a part of our ongoing research. If users prefer documents where the density of matched terms is big, even when they are irrelevant indeed, the rank functions may be modified to increase the impact of term proximity.

TREC-9 TREC-10 TREC-110.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23Average Precision

OkapiOkaTPnewTP

Figure 3. Average Precision

TREC-9 TREC-10 TREC-11

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

0.42

Precision at 5

OkapiOkaTPnewTP

Figure 4. Precision at 5

Page 7: Proceedings Template - WORD · Web viewFor multi-word queries, Google[1] matches close term occurrences together and classified the distance of term occurrences into 10 different

TREC-9 TREC-10 TREC-110.2

0.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36Precision at 10

OkapiOkaTPnewTP

Figure 5. Precision at 10

6. REFERENCES[1] Brin, S. and Page, L.: The anatomy of a large-scale

hypertextual Web search engine. In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, 1998.

[2] Clarke, C. L. A., Cormack, G. V. and Burkowski, F. J.: Shortest substring ranking (multitext experiments for TREC-4). In Proceedings of TREC-4, 1995

[3] Clarke, C. L. A., Cormack, G. V. and Tudhope E. A.: Relevance ranking for one to three term queries. Information Processing & Management. 36(2), 291-311, 2000

[4] CSIRO, TREC Web Tracks home page. www.ted.cmis.csiro.au/TRECWeb/

[5] Fox, C.: A stop list for general text, SIGIR Forum, ACM Press, Vol. 24, No. 4, 19-35, December 1990

[6] Hawking, D. and Thistlewaite, P.: Proximity operators – So near and yet so far. In Proceedings of TREC-4, 131-143, 1995

[7] Hawking, D. and Thistlewaite, P.: Relevance weighting using distance between term occurrences. Computer Science Technical Report TR-CS-96-08, Australian National University, August 1996

[8] Luhn, H. P.: The automatic creation of literature abstracts, IBM Journal of Research and Development, 2:159-168, 1958

[9] Papineni, K.: Why inverse document frequency? In Proceedings of the NAACL 2001, 25-32, 2001

[10] Papka, R. and Allan, J.: Document classification using multiword features. In Proceedings of CIKM-98, ACM Press, 124-131, 1998

[11] Porter, M.: An algorithm for suffix stripping. Program, 14(3):130--137, 1980

[12] Pratt, E. J.: Complete poems. University of Toronto Press, 1989

[13] Rasolofo, Y. and Savoy, J.: Term proximity scoring for keyword-based retrieval systems, 25th European Conference on IR Research(ECIR’03), Springer, 207 – 218, 2003

[14] Robertson, S.E., and Spark Jones, K.: Relevance weighting for search terms. Journal of the American Society for Information Science, 27(3), 129-146, 1976

[15] Robertson, S.E., Walker, S., and Beaulieu, M.: Experimentation as a way of life: Okapi at TREC. Information Processing & Management, 36(1), 95-108, 2000

[16] Rose, D. E. and Stevens, C.: V-Twin: a lightweight engine for interactive use. In Proceedings of TREC-5, 279-290, 1996

[17] Spink, A. Wolfram, D., Jansen, B.J., and Saracevic, T.: Searching the Web: The public and their queries. Journal of the American Society for Information Science and Technology, 52(3), 226-234, 2001

[18] Wilkinson, R., Zobel, J. and Sacks-Davis, R.: Similarity measures for short queries. In Proceedings of TREC-4, 277-285, 1995

Page 8: Proceedings Template - WORD · Web viewFor multi-word queries, Google[1] matches close term occurrences together and classified the distance of term occurrences into 10 different

7. Appendix A: The Algorithm of Detecting Expanded Spans

CurrentFirstNode = GetFirstNode(Chain_of_All_HitNodes);

DetectExpandedSpans()

{

CurrentNode = CurrentFirstNode);

If (CurrentNode != NULL)

{

NextNode = GetNextNode(CurrentNode);

While(NextNode != NULL)

{

if ((Distance(CurrentNode, NextNode) > MAX_DIS)

or (GetTerm(CurrentNode) == GetTerm(NextNode)))

{ //Process case 1 and 2

SaveExpandedSpan(CurrentFirstNode, CurrentNode);

CurrentFirstNode = NextNode;

}else{

RepeatedNode = FindRepeatedNode(CurrentNode, NextNode);

if( RepeatedNode != NULL)

ProcessCase3(CurrentNode, NextNode);

}

CurrentNode = NextNode;

}

SaveExpandedSpan(CurrentFirstNode, CurrentNode);

}

}

ProcessCase3(CurrentNode, NextNode)

{

RNextNode = GetNextNode(RepeatedNode);

if( Distance(RepeatedNode, RNextNode) > Distance(CurrentNode, NextNode))

{

SaveExpandedSpan(CurrentFirstNode, RepeatedNode);

CurrentFirstNode = RNextNode;

}else{

SaveExpandedSpan(CurrentChain, CurrentNode);

CurrentFirstNode = NextNode;

}

}

HitNode FindRepeatedNode(CurrentNode, NextNode)

{

RepeatedNode = CurrentFirstNode;

While((RepeatedNode != CurrentNode) and (GetTerm(RepeatedNode) != GetTerm(NextNode)))

Page 9: Proceedings Template - WORD · Web viewFor multi-word queries, Google[1] matches close term occurrences together and classified the distance of term occurrences into 10 different

RepeatedeNode = GetNextNode(RepeatedNode);

if(RepeatedNode != CurrentNode)

{

Return RepeatedNode;

}else{

Return NULL;

}

}

int Distance(HitNode1, HitNode2)

{

Return GetPosition(HitNode2) – GetPosition(HitNode1);

}