[ACM Press the 35th international ACM SIGIR conference - Portland, Oregon, USA (2012.08.12-2012.08.16)] Proceedings of the 35th international ACM SIGIR conference on Research and development

Improving Retrieval of Short Texts Through DocumentExpansion

Miles EfronUniversity of Illinois atUrbana-Champaign

501 E. Daniel Street, MC-492Champaign, IL

[email protected]

Peter OrganisciakUniversity of Illinois atUrbana-Champaign


[email protected]

Katrina FenlonUniversity of Illinois atUrbana-Champaign


[email protected]

ABSTRACTCollections containing a large number of short documentsare becoming increasingly common. As these collectionsgrow in number and size, providing effective retrieval of brieftexts presents a significant research problem. We propose anovel approach to improving information retrieval (IR) forshort texts based on aggressive document expansion. Start-ing from the hypothesis that short documents tend to beabout a single topic, we submit documents as pseudo-queriesand analyze the results to learn about the documents them-selves. Document expansion helps in this context becauseshort documents yield little in the way of term frequencyinformation. However, as we show, the proposed techniquehelps us model not only lexical properties, but also tem-poral properties of documents. We present experimentalresults using a corpus of microblog (Twitter) data and acorpus of metadata records from a federated digital library.With respect to established baselines, results of these exper-iments show that applying our proposed document expan-sion method yields significant improvements in effectiveness.Specifically, our method improves the lexical representationof documents and the ability to let time influence retrieval.

Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval; H.3.7 [Digital Libraries]: SystemsIssues

General TermsAlgorithms, Experimentation, Performance

KeywordsInformation retrieval, microblogs, twitter, Dublin Core, doc-ument expansion, language models, temporal IR

1. INTRODUCTIONCollections of short documents present a host of challenges

to information retrieval (IR) systems. The increasing influ-ence of microblogging platforms such as Twitter1 makes thischallenge especially timely. Twitter search exists alongsideother short-text IR problems such as ranking product re-views and advertisement placement. These short documentsjoin more familiar brief texts such as bibliographic and othermetadata records.

Faced with a corpus of millions of documents, each ofwhich contains only a few words, traditional IR models runinto difficulty. First, the vocabulary mismatch problem be-comes especially worrisome. If documents are very brief,the risk of query terms failing to match words observed inrelevant documents is large. Second, most IR models relyon some sort of TF-IDF dynamic, with a term’s frequencyin a document lending evidence to our belief about the doc-ument’s relevance. In very short documents, most termsoccur only once, making simple operations such as languagemodel estimation difficult.

However, we can mitigate these problems. We argue thata massive document expansion improves retrieval effective-ness for short texts. The mechanism that we propose forthis expansion is simple. Because the documents that weare concerned with are so brief, each one tends to focus ononly a single topic. A topically homogenous document isnot very different from a query. Thus we propose submit-ting each document in a corpus as a pseudo-query. We showthat analyzing the results obtained from this pseudo-queryimproves the models that underpin IR.

This paper proposes and tests two types of document ex-pansion. Given a corpus of N documents C = D1 . . . DNand a document of interest D, we augment D’s representa-tion in the index by submitting the text of D as a query toa search engine over C. This gives a ranked listed of resultsR1 . . . Rk. This result set informs two expansions:

1. Lexical: We use the terms in R1 . . . Rk to improve ourestimate of the language model for D.

2. Temporal: We use the timestamps associated withR1 . . . Rk to build a “temporal profile” for D–a proba-bility distribution over time. This probability distribu-tion is helpful during time-sensitive document ranking.

Our chief contribution is a novel approach to representingcollections of short text. Our methods elaborate on re-

1http://twitter.com

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR’12, August 12–16, 2012, Portland, Oregon, USA. Copyright 2012 ACM 978-1-4503-1472-5/12/08... $15.00.

911

sults from the well-known language modeling approach toIR, especially the notion of relevance models. But we alsoshow that the single-topic focus of most short documentsopens avenues for consideration of information other thanlanguage–for example, time.

2. PROBLEM STATEMENT AND MOTIVA-TION

The growing quantity of information published in smalltexts such as Twitter posts argues for a sustained analysisof the treatment of brief documents in IR. While this ur-gency is new, the need for improved retrieval of short textsis not. For instance, many digital libraries manage repos-itories of terse metadata records. While these repositories(such as the DCC repository we describe in this paper) arenominally searchable, the brevity of their metadata can frus-trate effective IR. Other examples of short-text IR includequery-specific advertisement and product review ranking.

But retrieval from collections of short documents is dif-ficult for several reasons. First, brief documents attenuateone of the primary signals used by modern IR systems–termfrequency. Figure 1 can help us understand why this is thecase. The figure shows data from two TREC collections (de-scribed in detail in Section 5). The TREC 8 data includeda corpus of news articles. Tweets2011 is the collection ofTwitter data used for the microblog track at TREC 2011.Figure 1 shows the distributions of the log-probabilities ofquery terms among the first 100 documents retrieved usinga simple query likelihood model. That is, the points thatcomprise the distributions are the maximum likelihood esti-mates of logP (q|D) for each query term q.

Two problematic facts are clear from Figure 1. First,unlike the longer news documents, tweets lead to a distri-bution of query term probabilities that is strongly peaked.The majority of tweets that contain a query word only con-tain it once, and tweets are of similar length. Second, inthe case of the TREC 8 data, the mean and median log-probability of a query term are higher in relevant documentsthan in non-relevant documents. This is not the case forthe Twitter data. Median logP (q|D) for relevant TREC 8documents is -6.92, with median -7.17 for non-relevant doc-uments. The difference in log-probability in relevant versusnon-relevant TREC8 documents is statistically significant(Mann-Whitney p � 0.001 one-sided). But logP (q|D) forrelevant Twitter documents has median -7.82 and -7.80 fornon-relevant (Mann-Whitney p = 1 one-sided). To the ex-tent that TREC 8 presents a “typical” statistical picture,Figure 1 shows that Twitter data are qualitatively differentthan more familiar TREC collections.

The impact of this difference is easy to see if we considerthe language modeling approach to IR [21]. In the languagemodeling approach, we assume that each document in ourcollection was generated by a probability distribution–a lan-guage model–over terms in the vocabulary. We rank docu-ments against a query Q by the likelihood that their corre-sponding language models generated the query:

P (D|Q) ∝ P (Q|D)P (D). (1)

Non Rel

-10

-8-6

-4

TREC 8

log

p(w

|D)

Non Rel

-7.5

-6.5

-5.5

Tweets2001

log

p(w

|D)

Figure 1: log-Probabilities of Query Terms in Rele-vant and Non-Relevant Documents in Two Corpora.

If we assume uniform priors over documents and term inde-pendence we have:

P (Q|D) =

|Q|Yi=1

P (wi|D) (2)

where |Q| is the number of word tokens in the query.Key to language modeling is the estimation of P (w|D).

Using multinomial language models, the maximum likeli-

hood estimator Pml(w|D) = n(w,D)|D| . But when confronted

with a corpus of brief documents, estimating language mod-els is difficult because the maximum likelihood estimator isnot very expressive. Since |D| is small, n(w,D) cannot belarge and is very often 1. We refer to this as the “headroomproblem:” in brief documents there is little chance for impor-tant terms to stand out via repeated usage. Coupled witha small domain of observed document lengths, the head-room problem has the effect of making many documents’estimated language models look nearly identical.

To improve our estimates, we typically smooth languagemodels, re-allocating probability mass away from the max-imum likelihood estimator by reference to a backgroundmodel, such as P (w|C), the language model of the collec-tion as a whole as described by Zhai and Lafferty in [28].Though several smoothing techniques are common in theIR literature, in this paper we select Bayesian updating withDirichlet priors. The form of a Dirichlet-smoothed languagemodel is:

P (w|D) =|D||D|+ µ

Pml(w|D) +µ

|D|+ µP (w|C) (3)

for µ ≥ 0. Smoothing in this way improves our estimatedprobabilities. But smoothing does little to alleviate theheadroom problem.

Our goal is to improve the representation of short docu-ments in IR. We argue that short text representation can beimproved by an aggressive document expansion. In Section4.1, we use this expansion to induce improved document lan-guage models. However, document expansion has a role toplay in IR for short texts that goes beyond language modelestimation. As an example of an alternative use of expan-sion information, Section 4.2 extends our results to allowtemporal factors to influence retrieval.

912

3. RELATED WORKThe problematic nature of IR on short documents has

seen little sustained research (though [22, 27] do treat thetopic explicitly). However, recent interest in social mediahas drawn attention to this problem [22, 10]. Familiar oper-ations such as measuring inter-document similarity, scholarshave found, is difficult given the brevity of many documentsin collections of user-generated content [20].

Though corpora of brief documents are increasingly com-mon, the estimation problems that our proposed expansionmethods address are not new. Most similar to our own workare results from research on cluster-based IR. Clusters havebeen applied to various points in the IR process, includingrelevance feedback [14, 9], rank fusion [12], and as a sepa-rate factor used during document ranking [13]. Clusters areappealing insofar as they afford a level of information aboutdocuments that resides at a level higher than intra-documentword counts, but below the generalities of the collection atlarge. For instance, Liu and Croft applied clusters duringdocument language model smoothing [18]. Ramage, Dumaisand Leibling address the vocabulary mismatch problem inmicroblog text using latent Dirichlet allocation (LDA) [23],allowing the LDA model to supplement observed term prob-abilities in document representation. Other methods of doc-ument expansion have also seen sustained work (e.g. [24]),though in contexts different from those that we study here.

The most similar work to ours was given by Tao et al.[26]. In their work, Tao et al. proposed smoothing documentlanguage models by analyzing their lexical neighborhoods.That is, the model for document D was smoothed withcounts obtained from its k nearest neighbors D1, . . . , Dk,with each document’s influence in the smoothed model be-ing proportional to its cosine similarity with D. Like Taoet al. we propose improving the representation D by ananalysis of similar documents. However, the approach weoutline in Section 4.1 differs from Tao et al. in its basic the-oretical orientation, couching the estimation problem in thegenerative semantics of language modeling. Tao et al. definethe neighborhood of a document in geometrical terms, rely-ing on the cosine similarity. With an explicit assumption ofnormality, they smooth the model of D based on this neigh-borhood and this metric. On the other hand, we assumethat D arises from an unseen model D, much as relevancemodeling assumes the influence of an unseen model of rele-vance. To estimate P (w|D) we combine evidence from otherdocuments, where the influence that some document Dj ex-erts on the final model is the likelihood that Dj ’s languagemodel generated D.

Our work also relies on findings from recent studies oninformation retrieval and microblogs [6, 19, 2]. The field ofmicroblog retrieval is relatively new, but has seen increasinginterest, most noticeably in the 2011 TREC microblog track.

Unlike the methods described in this section, our doc-ument expansion technique also invites extension to non-linguistic features. As we show in Section 4.2, aspects of rel-evance such as temporality fit naturally into our approach.

4. DOCUMENT EXPANSION FOR SHORTTEXTS

LetD be a document consisting of |D| word tokens d1, . . . , d|D|.Also let C be a corpus of N documents. Using to Eq. 2, wecan calculate the likelihood of D given the language model

of each document in the collection, P (D|D1), . . . , P (D|DN ).We propose analyzing these probabilities to make an aug-mented representation D′ that provides a better basis forestimation and prediction than the original D affords.

From a practical standpoint, this amounts to submittingeach document as a pseudo-query and ranking all documentsusing the standard query likelihood method. As we shallshow, the analysis of P (D|D1), . . . , P (D|DN ) can play a rolesimilar to relevance feedback, though how literally this com-parison holds is variable.

With this in mind, we offer two definitions related to adocument D:

Definition 1. Pseudo-query of D: The pseudo-queryQD is a representation of the text inD thatwe may submit to an IR system.

Definition 2. Result set of D: By submittingQD as a query against the corpus C, we derivea ranking RD = RD1, . . . , RDk. RD consists ofthe top k documents retrieved for QD and theirretrieval status value scores. In this paper weuse query likelihood for retrieval, so the scoresare the probabilities P (D|D1), . . . , P (D|Dk).

The use of QD and RD stems from our hypothesis thatmost short texts discuss only a single topic. If QD treatsa coherent topical domain, we anticipate that analyzing thedocuments inRD will yield information about the topic thatthe document of interest D is about.

4.1 Lexical EvidenceFollowing Lavrenko and Croft’s exposition of relevance-

based language models [16], we use RD to induce D′, a lan-guage model associated with D. In typical language model-ing, we estimate a document’s language model by combiningevidence from the document itself with information from abackground distribution such as the collection. Given RD,however, we can improve this estimate.

The language model D′ is

P (w|D′) = P (w|d1, . . . , d|D|) (4)

=P (w, d1, . . . , d|D|)

P (d1, . . . , d|D|).

The denominator P (d1, . . . , d|D|) does not depend on w,leaving the joint probability as the quantity of interest. Thisjoint probability is

P (w, d1, . . . , d|D|) =XDi∈C

P (Di)P (w, d1, . . . , d|D||Di). (5)

If we further assume that

P (w, d1, . . . , d|D||Di) = P (w|Di)|D|Yj=1

P (dj |Di) (6)

then we have

P (w, d1, . . . , d|D|) =XDi∈C

P (Di)P (w|Di)|D|Yj=1

P (dj |Di).

(7)The last factor in Eq. 7 reminds us that most of the prob-ability mass in this joint distribution will derive from doc-uments that are lexically similar to the document D. Thus

913

we can obtain a good estimate of P (w|D′) by performingthe summation in Eq. 7 over only the k documents withthe highest likelihood of generating D. Thus the probabil-ity of word w under the augmented model is a weightedaverage of the observed probabilities of w in the top k doc-uments retrieved by submitting D as a pseudo-query, wherethe weights are the likelihoods of D given the retrieved doc-uments’ language models.

Having obtained our augmented representation, we maythen rank documents by the likelihood that their augmentedlanguage models generated the query, P (Q|D′) which wecalculate using the query likelihood model as usual, substi-tuting P (w|D′) for the maximum likelihood estimator intoEq. 3. We refer to this method by the abbreviation LExp,for lexical expansion.

As is common when using relevance models for relevancefeedback, interpolating the expanded model with the orig-inally observed text is likely to be “safer” than relying onthe expanded model alone. Thus we define another lexicallyexpanded model:

Pλ(w|D′) = (1− λ)Pml(w|D) + λP (w|D′) (8)

for a parameter λ in [0, 1]. We may then substitute Pλ(w|D′)for the maximum likelihood estimator in Eq. 3 for retrieval.We refer to rankings based on this estimator as LExpλ. Forsimplicity, when discussing LExpλ we set λ = 0.5 through-out this paper.

4.2 Temporal EvidenceIn addition to supplementing the lexical evidence that

we store about documents, the expansion method describedabove can create new, extra-lexical features. For example,information in RD yields actionable information related totemporal aspects of relevance. Indeed, incorporating tem-poral evidence into IR entails a research area in its ownright (cf. [1]). We pursue temporality here as an exampleof extra-linguistic information that our document expansionmethod allows.

We assume that for each document Di, we have a cor-responding timestamp ti which is the time at which thedocument was published. We also define t0 as the earliesttimestamp in the collection. In this paper we measure timein fractions of days, such that ti is how many days elapsedbetween the initial time t0 and the publication of Di.

The observed timestamp ti provides useful information forIR. But our goal is to learn additional temporal informationabout each document Di. We hypothesize that the empir-ical distribution of timestamps related to Di via documentexpansion helpfully supplements direct use of ti.

Working under a similar scenario, Jones and Diaz definedthe notion of a query’s “temporal profile” [11, 5]. The tem-poral profile of a query Q is a probability distribution overtime, P (t|Q). Analyzing the empirical distribution of docu-ments retrieved for Q gives information helpful in estimatingP (t|Q), as shown in [4].

Our discussion above suggests that temporal profiles maybe defined not only on queries, but also on documents. Foreach document D, we define a temporal profile P (t|D). Thisprobability distribution expresses the extent to which D isassociated with events that were discussed at various mo-ments in time. Estimating P (t|D) follows the same logicthat we use for query temporal profiles; we submit the textof D as a query, yielding RD = RD1, . . . , RDk. Each of

the documents Ri in RD contains a timestamp ti. Fromthe empirical distribution of t1, . . . , tk we may estimate theunderlying distribution P (t|D).

To estimate P (t|D) we imagine the following generativeprocess. A person interested in D samples documents ac-cording to P (D1|D), . . . , P (DN |D). If this choice yields doc-ument Di with timestamp ti, we choose time t with proba-bility P (t|ti). We let this probability follow an exponentialdistribution on |ti − t|.

The probability of choosing document Di in our genera-tive process is simply the likelihood of D given Di. To derivea proper density, we normalize the k likelihoods in RD tosum to one:

si = P (Di|D) =P (D|Di)P (Di)P

j∈RDP (D|Dj)P (Dj)

.

We then have:

P̂ (t|D) =X

ti∈RD

si · r · e−r·|ti−t|. (9)

The exponential rate parameter r governs the temporal in-fluence; a large r strongly favors times near ti. It is worthnoting that Eq. 9 is similar to a weighted kernel density es-timate [25]. It allocates probability mass mostly around thetimestamps of those documents that have a high likelihoodof having generated D.

For each document and query we observe lexical evidence(i.e. their textual content) and temporal evidence. LetWQ and WD be the lexical representations of the queryand a document, respectively, and TQ and TD be vectorsof timestamps retrieved by submitting Q as a query or Das a pseudo-query. It is intuitive to rank on the joint prob-ability P (WQ,WD, TQ, TD). If we assume conditional inde-pendence between lexical and temporal information and auniform distribution over temporal profiles we have:

P (WQ,WD, TQ, TD) ∝ P (WQ|WD) · P (TQ|TD). (10)

This is simply the query likelihood multiplied by the like-lihood that the timestamps returned by Q were generatedby the same distribution that generated the timestamps ob-served in the result set of the pseudo-query for D. Usingour estimate of the temporal profile of D, the likelihood ofTQ given D is:

P (TQ|TD) =

kQYi=1

P (tQi|D) (11)

where kQ is the number of timestamps in the query’s tem-poral profile; we set this to kQ = 10. We then obtain atemporally informed retrieval by multiplying Eqs. 2 and 11.In the following discussion, we refer to this model as TExp(temporal expansion). Runs labeled TExp use documents’unexpanded text for WD. We will compare this method toa baseline using temporal document priors described by Liand Croft [17] (which we call TPrior). TPrior uses an ex-ponential distribution on the age of documents as the priorin Eq. 2. Both methods–TExp and TPrior–rely on a rateparameter r for the exponential. Following Li and Croft,unless otherwise specified we set r = 0.01.

5. EXPERIMENTAL DATAIn the discussion that follows we rely mainly on two data

sets. First, we use Tweets2011, the collection used for the

914

TREC 2011 microblog task. Second, we use a collectionof metadata records describing holdings in a large digital li-brary. Both of these collections consist mostly of brief docu-ments, though as we shall show, they have many differences.

5.1 Microblog DataThe Tweets2011 collection uses a corpus of posts (called

“tweets”) made available by the microblogging service Twit-ter. Instead of distributing the microblog corpus via physicalmedia or a direct download, TREC organizers distributedapproximately 16M unique tweet ID numbers. Users of thecollection downloaded these ID’s, along with software thatallowed them to fetch the tweets directly from Twitter. Weobtained our data using the supplied HTML scraping toolson May 25–26, 2011. Our corpus contained 15,653,612 in-dexable tweets, each containing: the screen name of the au-thor, the time at which the tweet was posted, and the tweettext itself.

Our microblog data was preprocessed in several ways.First, in conformance with the track’s guidelines all “retweets”were removed by deleting documents containing the stringRT. Second, we made a simple pass at removing non-Englishtweets. We deleted tweets containing more than four charac-ters with byte values greater than 255. We also defined a listof 133 words that are common in Spanish, French and Ger-man. Tweets containing any of these were removed. Afterthese operations, the corpus contained 8,320,421 documents.

The 2011 microblog track involved a real-time search prob-lem. Track organizers created 50 test topics2, each of whichcontained query text Q and a timestamp tQ. Only doc-uments posted prior to tQ were assessed for relevance (allothers are non-relevant). Aside from returning only docu-ments published before query-time, for the sake of simplicity,in this paper we drop additional real-time strictures definedby the track organizers.

5.2 Digital Library Metadata Collection (IMLSDCC)

The Institute for Museum and Library Studies DigitalCollections and Content (IMLS DCC) is a large, federateddigital library offering access to the aggregated holdings ofa broad spectrum of digital collections at distributed cul-tural heritage institutions3. In collaboration with IMLSDCC, we obtained a collection of 578,385 brief descriptionsof cultural heritage resources. IMLS DCC administratorsregularly harvest descriptions of several hundred participat-ing institutions’ digital resources via the Open Archives Ini-tiative Protocol for Metadata Harvesting (OAI-PMH) [15].Each of the 578,385 harvested documents is a Dublin Coremetadata record describing a single item such as a photo-graph, manuscript, sound recording. Details of the processby which this corpus was built are given in [8].

To enable IR experimentation, we sampled 53 queriestaken from the DCC search engine query logs. It must bestressed that these were not chosen at random. We used carein selecting queries to assure that no query returned zerorelevant documents. Because many highly specific queriesreturned zero or very few items – making changes in IRperformance difficult to test – queries with fewer than five

2NIST created relevance judgments using the standardTREC pooling method. Because it lacked relevant docu-ments, one topic was removed from the final task.3http://imlsdcc.grainger.uiuc.edu/

relevant documents (as gauged by the following process) inthe collection were omitted.

Initial assessments of query suitability for our experimentswere made by one of the authors, who is a former DCCadministrator with significant experience in collection eval-uation and user interactions. When choosing queries forinclusion on our test collection, this judge assessed poten-tial relevance generously; if there seemed to be a reasonablechance that a document could satisfy a query, the documentwas considered potentially relevant. If a query accumulatedat least five plausibly relevant documents, the query was in-cluded. Redundant queries and queries submitted by DCCadministrators were also culled from the sample. Altogetherwe retained 53 queries.

We completed this collection by soliciting relevance judg-ments via Amazon’s Mechanical Turk4 (AMT) service. Toaccomplish this, the same author wrote relevance criteria foreach query. The process of writing relevance criteria is ofcourse subjective and entails obvious limitations. However,without explicit knowledge of the information needs of peo-ple who created the queries, the author wrote these criteriato enhance the AMT workers’ understanding of the queriesand results without unduly limiting or biasing their sense ofrelevance. For example, relevance criteria for the query USSMonitor were given as, “Highly relevant documents shouldlink to photographs or documentation about the Civil-War-era warship, the USS Monitor. Somewhat relevant docu-ments will describe or provide photographic evidence of USNavy ships from the same era.”

While the process of choosing and describing queries intro-duced subjectivity into our study, we pursued this methodbecause we felt that using queries from the DCC’s searchlogs was more realistic than writing queries of our own.Without attendant clickthrough data, this intervention wasnecessary to give AMT judges sufficient information for rel-evance assessment.

Documents to be assessed via AMT were gathered frompools created by running several retrievals over all 53 queries.Pools contained results from five of the conditions discussedin Section 6, as well as a run using BM25 ranking. All to-gether, six runs contributed to the judging pools which wereheld to depth of 100.

Judgments were reconciled with two-thirds majority vot-ing. After the first run of AMT ratings, we found that agree-ment was low with Fleiss kappa=0.24. However, the major-ity AMT ratings yielded a kappa=0.724 when compared toa set of oracle ratings by the study authors, suggesting thatthe problem of low inter-rater agreement was not system-atic but due to scattered unreliable workers. To addressthis issue, we identified unreliable raters based on the pro-portion of each worker’s judgments that were in agreementwith the majority. Using an admittedly arbitrary agree-ment threshold of 0.67, we divided our workers into “good”and ”bad” categories. 28 of 131 workers fell into the “bad”category. We removed all work done by these workers andre-submitted their tasks to AMT.

After the second judging round we recomputed reliabilityscores, this time finding only four bad workers whose sumproportion of ratings was 0.01. Only two workers who weregood in round one became bad during round two. Afterround two, Fleiss kappa was 0.40. While this is lower than

4http://mturk.com

915

Table 1: Summary Statistics for Experimental Corpora. From left to right, columns give total number ofdocuments, observed word types, observed word tokens, median document length, mean document length,standard deviation of document length, and the number of test queries.

Corpus Indexed Docs Unique Terms Tokens Median Doc Len Mean Doc Len SD Doc Len # QueriesTweets2011 8,320,421 19,449,151 6,506,465,256 20 21.92 7.30 49DCC 578,385 2,793,371 114,453,512 45 197.89 556.90 53

we would like, it marks a significant improvement over theinitial work quality.

5.3 Data StatisticsTable 1 summarizes length-related aspects of our two test

data sets. Not surprisingly, documents in the microblog cor-pus are very short (median=20, compared to median=328 inTREC 8), with a compressed distribution of lengths. MostDCC documents are a bit longer than microblog posts (me-dian=45), and their lengths vary more than tweet lengthsdo. The corpora analyzed here are similar insofar as theirdocuments are much shorter than documents in more stan-dard IR collections. But the collections are also differentfrom each other.

6. EXPERIMENTAL ASSESSMENTAll experiments were done using the Indri search engine

and the Lemur toolkit5. For efficiency, we used the stan-dard Indri stoplist and a custom Twitter-specific stoplistwhen forming document pseudo-queries. But subsequentretrievals used no stoplists or stemming except to mitigatecommon words’ influence in the baseline relevance feedbackcondition–feedback models were stripped of words in theIndri stoplist. Unless otherwise specified, all retrieval (bothqueries and pseudo-queries) used the Indri defaults of Dirich-let smoothing with µ = 2500. For each Tweets2011 query weretrieved 100 documents. We retrieved 50 documents perDCC query. To assess the merit of our proposed expansionmethods we tested the conditions shown in Table 2.

6.1 The Effect of Lexical ExpansionTables 3 and 4 list four effectiveness metrics for the con-

ditions described above, on both of our test data sets6. Theclearest result from the tables is the improvement over theQL baseline offered by our lexical expansion methods. Ex-cept for one case (P10 for LExp on DCC), the lexical ex-pansion methods always outperform the QL baseline. Thepositive effect of document expansion is especially strongwhen we interpolated the expanded model with the observeddocument model (i.e the TExpλ condition).

We can see the positive effect of document expansion inFigure 2. Each panel in the figure schematizes the distribu-tion of query terms in relevant and non-relevant documents,over a variety of data sets. Besides our two short documentcollections, as a point of reference we show the TREC 8 datadescribed above and the small WT10g web collection. As wediscussed earlier, the figure shows that query terms tend to

5http://lemurproject.org6Reported metrics are mean average precision (MAP), R-precision (Rprec), NDCG over all positions (NDCG) andprecision at 10 (P10).

Table 2: Baseline and Experimental RetrievalNames and Parameters.

Abbr. Details

BaselinesQL Basic query likelihood. Dirichlet

smoothing, µ = 2500FB Relevance feedback using relevance

models. 20 feedback docs; 15 terms.Interpolation with original queryλ = 0.5

TPrior Temporal priors to promote recentdocuments. Exponential rate pa-rameter r = 0.01.

Experimental

LExp Lexical document expansion. k =50 expansion documents.

LExpλ Lexical document expansion withlinear interpolation of expansionmodel with MLE. k = 50 expan-sion. Expansion-MLE mixing pro-portion λ = 0.5.

TExp Temporal document expansion.k = 50 expansion documents.

LTExp Both lexical and temporal docu-ment expansion. i.e. A combina-tion of LExpλ and TExp.

TBoth Two types of temporal evidence areused: the prior of TPrior and theexpansion method of TExp. No lex-ical expansion.

Table 3: Observed IR Effectiveness on TREC 2011Microblog Data. The †symbol indicates p < 0.05on a permutation test against the baseline QL. The‡symbol indicates p < 0.01.

MAP Rprec NDCG P10QL 0.187 0.275 0.360 0.398FB 0.189 0.273 0.361 0.394TPrior 0.198† 0.284† 0.372 0.427LExp 0.216† 0.301† 0.404‡ 0.380LExpλ 0.226‡ 0.319‡ 0.415‡ 0.431TExp 0.204† 0.289 0.373 0.414TBoth 0.206† 0.289† 0.378† 0.427†LTexp 0.235‡ 0.324‡ 0.428‡ 0.451‡

916

Figure 2: log Probabilities of Query Terms in Relevant and Non-Relevant Documents in Four Corpora.

Non Rel

-10

-8-6

-4-2

0

TREC 8

log

p(w

|D)

Non Rel

-10

-8-6

-4-2

0

WT10g

log

p(w

|D)

Non Rel

-10

-8-6

-4-2

0

Microblog

log

p(w

|D)

Non Rel

-10

-8-6

-4-2

0

Microblog (Exp)

log

p(w

|D')

Non Rel

-10

-8-6

-4-2

0

DCC

log

p(w

|D)

Non Rel

-10

-8-6

-4-2

0

DCC (Exp)

log

p(w

|D')

Table 4: Observed IR Effectiveness on DCC Data.The †symbol indicates p < 0.05 on a permutationtest against the baseline QL. The ‡symbol indicatesp < 0.01.

MAP Rprec NDCG P10QL 0.215 0.287 0.398 0.329FB 0.205 0.256 0.366 0.300LExp 0.227 0.292 0.414 0.304LExpλ 0.302‡ 0.359‡ 0.502‡ 0.402‡

have a higher probability in relevant documents than in non-relevant documents. The difference is especially stark for thestandard (TREC 8 and WT10g) corpora. Observed queryterm probabilities in the unexpanded microblog corpus giveno consistent evidence of relevance (among retrieved docu-ments). The value of query term probabilities in the unex-panded DCC data is higher than in the Twitter data, butless so than in TREC 8 and WT10g. However, the panelslabeled Microblog (Exp) and DCC(Exp) show query termprobabilities in expanded document models. In these casesthe estimated probability of query terms in relevant docu-ments is significantly greater than the corresponding proba-bility in non-relevant documents (Mann-Whitney p� 0.001one-sided).

A surprising result of our experiments is the poor per-formance of feedback using relevance models. Inspection ofthe expanded queries that led to these results showed thatthe induced query models contained very idiosyncratic termssuch as user names (Twitter) and administrative vocabulary(DCC). We include the feedback results because it is worthnoting that the expanded document models are capturing se-mantics that expanded query models cannot, given the con-ditions encountered in these data. We hypothesize that thiseffect arises because, in addition to alleviating the vocabu-lary mismatch problem, our expansion method yields moredata for estimating language models. Expanded queries al-leviate vocabulary mismatch and improve estimates of thequery model, but expanded documents improve our esti-mates of the document models. In the context of short doc-ument retrieval, this effect seems to be crucial.

Figure 3 shows performance over a range of values (from5 to 50) for k, the number of expansion documents usedto fit an augmented language model D′ (cf. Eq. 7). Con-

Figure 3: Mean Average Precision Observed OverVaried Values of k. The x -axis is the number ofdocuments used to fit the expansion model D’.

10 20 30 40 50

0.18

0.22

0.26

0.30

Number of Expansion Docs.

Mea

n A

vg. P

reci

sion

Twitter, LExpTwitter, LExpLambdaDCC LExpDCC LExpLambda

trary to our expectations, there appears to be little risk inadding many documents to an expanded model. While asmall decline in performance is visible for the LExp condi-tion on the DCC data after k = 35, when we interpolatewith the original model (LExpλ), the performance does notdecrease over the range from 5–50 documents. We suspectthat this is due to the very small probabilities associatedwith documents deep in the retrieved set for D.

Our reliance on the retrieved sets of documents’ pseudo-queries raises the issue of language model smoothing. Howaggressively should we smooth when performing IR for doc-ument pseudo-queries? Table 5 shows that during documentpseudo-query retrieval, previous studies of smoothing’s crit-ical role apply [28]. The table shows retrieval effectivenessstatistics at three settings of the Dirichlet smoothing pa-rameter µ during retrieval with document pseudo-queries.Our experiments use µ = 2500. This setting is the defaultin the Indri search engine (we chose it because it obviatedthe need for training data) and it has been shown to be ef-fective. Table 5 shows that in this choice we were lucky.It is clear that 250 leads to under-smoothing, while 5000

917

Table 5: Comparison of Retrieval Effectiveness with Different Selections of µ, the Dirichlet Smoothing Pa-rameter, Applied During Retrieval Based on Document Pseudo-Queries.

Tweets2011 DCCMetric µ = 250 µ = 2500 µ = 5000 µ = 250 µ = 2500 µ = 5000REL RET 858 1032 796 441 661 633MAP 0.1810 0.2155 0.1792 0.1702 0.2269 0.1990Rprec 0.2764 0.3008 0.2584 0.2255 0.2919 0.2526NDCG 0.3452 0.4034 0.3454 0.3310 0.4136 0.3787P10 0.3735 0.3796 0.3694 0.2264 0.3038 0.2906P30 0.3014 0.3190 0.2871 0.1987 0.2994 0.2780

is too high. All declines in MAP, Rprec, and NDCG fromµ = 2500 are statistically significant. This suggests thatdocument expansion carries some risk. We hypothesize thatmuch of this risk comes from the conjunctive and disjunctivesemantics of aggressive smoothing. Weak smoothing pushesretrieval towards a Boolean AND’ing of query terms, whilestrong smoothing allows the predominance of only a fewquery terms to promote a document. Balancing this econ-omy is clearly important. However, it is also the case thatTable 5 shows very extreme values for µ. Because computa-tional constraints kept us from performing a full parametersweep, we chose to test very high and very low smoothingparameters. In future work we plan to examine the “safe”region for smoothing in more depth.

6.2 The Effect of Temporal ExpansionOf our two data sets only Tweets2011 has a temporal com-

ponent; the DCC documents lack any measurable chronol-ogy. Thus we tested the temporal expansion described inSection 4.2 only on the Twitter data. As a baseline wecompared our method against the use of temporal priorsintroduced by Li and Croft. Effectiveness results for eachcondition are shown in Table 3. Both the baseline method(TPrior) and our temporal expansion (TExp) gave statisti-cally significant improvements over the non-temporally in-formed QL baseline. However, the difference in effect be-tween TPrior and TExp are small in Table 3, with bothmethods scoring nearly identically (the differences betweenthem are not statistically significant).

Combining lexical and temporal expansion improves ef-fectiveness further. in Table 3 the method using both ex-pansion sources, LTExp, outperforms all other runs. Forcompleteness, we compared LTExp against a run using theLExpλ condition modified with a temporal prior. LTExpimproved on this baseline for all metrics shown in Table 3,with p < 0.05 for MAP and P10.

An interesting result is visible in Figure 4. The bar plotshows the difference in mean average precision (betweenTExp and TPrior) on a query-by-query basis. Though afew queries have near-identical scores under both TPriorand TExp, the majority of queries fare differently undereach method. Both the baseline and proposed methods im-prove aggregate effectiveness on these data. But they appearto do so quite differently, suggesting that they are not in-terchangeable. The run in Table 3 labeled TBoth uses bothtemporal priors and temporal expansion. Though its perfor-mance is similar to temporal expansion alone, TBoth doessee consistent improvement over TExp (p is 0.83, 0.12, 0.01,0.15 for MAP, Rprec, NDCG and P10, respectively), sug-

1 3 5 7 9 11 14 17 20 23 26 29 32 35 38 41 44 47

TExp

- TP

rior

-0.15

-0.05

0.05

0.15

Figure 4: Comparison of Two Methods of Tempo-ral Influence. Each bar is the difference in MeanAverage Precision between TExp and TPrior (i.e.MAPTExp−MAPTPrior on one the the 49 Tweets2011Queries.

gesting that the prior introduces information distinct fromtemporal expansion.

In earlier work [7] it was shown that choosing the rateparameter r for the exponential distribution when applyingtemporal priors has a strong effect on retrieval. However,we found that a broad range of exponential parameteriza-tions yielded little difference in performance from the datareported in Table 3. The same was true for the parameter inour temporal expansion; a broad sweep yielded no statisti-cally significant change. We hypothesize that this robustnessis due to the short temporal span of the Tweets2011 corpus(two weeks). If we repeated our experiments on collectionsspanning months or years, we anticipate seeing more sen-sitivity to setting r in both the baseline approach and ourown.

7. DISCUSSIONIn our previous exposition we treated temporal and lexical

expansion as two distinct operations. But the same rationaleand the same mechanism underpin both operations. In bothcases, information gleaned from a document’s result set isused to derive new information about that document. Theinfluence of each member of the result set is proportional toP (D|Di). Thus in addition to lexical and temporal expan-sion, we could apply this operation on arbitrary features ofdocuments. For instance, if each document had geolocationinformation, we could use the methods presented here tomake an expanded location profile. Our main contributionis a way to create, for each document, an empirical distribu-

918

tion of features of whatever type. Our results suggest thatfor the feature types explored here (lexical and temporal),these distributions convey useful information for IR.

A notable advantage of the temporal expansion methodproposed here is its flexibility with respect to the seman-tics of a query. For instance, the temporal prior methodcan account only for “recency queries” where the user seeksnew information. However, our TExp approach assesses thesimilarity between the timestamps of documents retrievedby Q and those obtained from the pseudo-query of a docu-ment D. If Q retrieves documents clustered around a win-dow of time in the past, TExp will reward documents whosepseudo-queries’ results occupy the same window.

One issue that we have not addressed in this paper is thetransformation of a document into a pseudo-query. In theinterest of simplicity, we omitted any transformation otherthan removal of stopwords to improve retrieval speed. How-ever, our results suggest room for improvement by a morethoughtful approach. In particular, we consider the differ-ence between LExp and LExpλ on the DCC data evocativeon this matter. Interpolation with the original documentmodel helped retrieval to a great extent in this case, a factemphasized by Figure 3. We suspect that this is due tothe length of some DCC documents. Many of these doc-uments are more verbose than, for example, tweets. Theseresults suggest that longer documents do not lead to suitablepseudo-queries without some improvement.

8. CONCLUSIONWe have proposed expanding document representations to

improve retrieval effectiveness on corpora of very brief texts.Our contribution starts with the idea that short documentsoften make productive pseudo-queries. Because brief docu-ments tend to discuss at most one topic, the retrieved set fora document D is informative with respect to making predic-tions about the relevance of D. We proposed two methods ofdocument expansion based on analysis of document pseudo-queries’ result sets. First, we augment the language modelof D, obtaining an expanded language model D′. Second,we induce a model of the temporal affinity of D by analyzingthe temporal profile of D’s result set. We find that the like-lihood that this density generated the timestamps retrievedby a query Q yields an effective mechanism for letting timeinform ranking.

In future work we plan to address several issues raised bythis study:

• Self-retrievability and document priors. We hypothe-size that the extent to which D is distinguishable fromother documents in its result set is an indicator of top-ical coherence. This intuition could inform a prior overdocuments.

• Use of longer documents. Though short documentsstand to gain significantly from expansion, there is noreason the gains we have seen in this study would notcarry over to more standard collections.

• Query expansion vs. document expansion. Studyingmore conventional corpora will help us assess the dif-ferences between query and document expansion morethoroughly. We also plan to compare our approachto other document expansion methods such as Bergerand Lafferty’s translation model [3].

• Other settings. Short documents arise in many IRapplications. Searching for relevant advertisementsor product reviews are both domains that we believecould benefit from the approaches outlined here.

• Other expansion data. In future work we hope to ex-amine the value of expansion methods for informationsuch as geolocation data or user-created content.

• Scalable implementation. An efficient way to store anduse document expansion data is crucial for our meth-ods’ viability in a production setting. In particular,future work will address the matter of changing docu-ments’ expansion data as the collection grows.

Despite these unexplored ideas, the results of this studyshow that document expansion is a compelling method forimproving retrieval in corpora of short texts. As people’sengagement with social media increases the pervasivenessof abbreviated documents, this problem will grow in im-portance. Considering short documents as pseudo-queriesprovides needed traction on this difficult task.

9. ACKNOWLEDGEMENTSThis work was supported by a Google academic research

award and Institute of Museum and Library Services LG-06-07-0020. Expressed opinions are those of the authors, notthe funding agencies. The project is hosted by the Center forInformatics Research in Science and Scholarship (CIRSS).

10. REFERENCES[1] Omar Alonso, Michael Gertz, and Ricardo

Baeza-Yates. On the value of temporal information ininformation retrieval. SIGIR Forum, 41:35–41,December 2007.

[2] Gianni Amati, Alessandro Celi, Cesidio Di Nicola,Michele Flammini, and Daniela Pavone. Improvedstable retrieval in noisy collections. In GiambattistaAmati and Fabio Crestani, editors, Advances inInformation Retrieval Theory, volume 6931 of LectureNotes in Computer Science, pages 342–345. SpringerBerlin, Heidelberg, 2011.

[3] Adam Berger and John Lafferty. Information retrievalas statistical translation. In SIGIR ’99: Proceedings ofthe 22nd annual international ACM SIGIR conferenceon Research and development in information retrieval,pages 222–229, New York, NY, USA, 1999. ACM.

[4] Wisam Dakka, Luis Gravano, and Panagiotis G.Ipeirotis. Answering general time-sensitive queries.IEEE Transactions on Knowledge and DataEngineering, 24(2):220–235, 2012.

[5] Fernando Diaz and Rosie Jones. Using temporalprofiles of queries for precision prediction. In SIGIR’04: Proceedings of the 27th annual internationalACM SIGIR conference on Research and developmentin information retrieval, pages 18–24, New York, NY,USA, 2004. ACM.

[6] Miles Efron. Information search and retrieval inmicroblogs. Journal of the American Society forInformation Science and Technology, 62(6):996–1008,2011.

[7] Miles Efron and Gene Golovchinsky. Estimationmethods for ranking recent information. In Proceedings

919

of the 34th international ACM SIGIR conference onResearch and development in Information, SIGIR ’11,pages 495–504, New York, NY, USA, 2011. ACM.

[8] Miles Efron, Peter Organisciak, and Katrina Fenlon.Building topic models in a federated digital librarythrough selective document exclusion. In Proceedingsof the Annual Meeting of the American Society forInformation Science and Technology, 2011.

[9] Inna Gelfer Kalmanovich and Oren Kurland.Cluster-based query expansion. In Proceedings of the32nd international ACM SIGIR conference onResearch and development in information retrieval,SIGIR ’09, pages 646–647, New York, NY, USA, 2009.ACM.

[10] Giacomo Inches, Mark Carman, and Fabio Crestani.Investigating the statistical properties ofuser-generated documents. In Henning Christiansen,Guy De Tr, Adnan Yazici, Slawomir Zadrozny, TroelsAndreasen, and Henrik Larsen, editors, Flexible QueryAnswering Systems, volume 7022 of Lecture Notes inComputer Science, pages 198–209. Springer Berlin /Heidelberg, 2011.

[11] Rosie Jones and Fernando Diaz. Temporal profiles ofqueries. ACM Transactions on Information Systems,25(3):14, 2007.

[12] Anna Khudyak Kozorovitsky and Oren Kurland.Cluster-based fusion of retrieved lists. In Proceedingsof the 34th international ACM SIGIR conference onResearch and development in Information Retrieval,SIGIR ’11, pages 893–902, New York, NY, USA, 2011.ACM.

[13] Oren Kurland and Lillian Lee. Clusters, languagemodels, and ad hoc information retrieval. ACM Trans.Inf. Syst., 27:13:1–13:39, May 2009.

[14] Oren Kurland, Lillian Lee, and Carmel Domshlak.Better than the real thing?: iterative pseudo-queryprocessing using cluster-based language models. InProceedings of the 28th annual international ACMSIGIR conference on Research and development ininformation retrieval, SIGIR ’05, pages 19–26, NewYork, NY, USA, 2005. ACM.

[15] Carl Lagoze and Herbert Van de Sompel. The OpenArchives Initiative: building a low-barrierinteroperability framework. In JCDL ’01: Proceedingsof the 1st ACM/IEEE-CS joint conference on Digitallibraries, pages 54–62, New York, NY, USA, 2001.ACM.

[16] Victor Lavrenko and W. Bruce Croft. Relevance basedlanguage models. In SIGIR ’01: Proceedings of the24th annual international ACM SIGIR conference onResearch and development in information retrieval,pages 120–127, New York, NY, USA, 2001. ACM.

[17] Xiaoyan Li and W. Bruce Croft. Time-based languagemodels. In CIKM ’03: Proceedings of the twelfthinternational conference on Information andknowledge management, pages 469–475, New York,NY, USA, 2003. ACM.

[18] Xiaoyong Liu and W. Bruce Croft. Cluster-basedretrieval using language models. In Proceedings of the27th annual international ACM SIGIR conference onResearch and development in information retrieval,SIGIR ’04, pages 186–193, New York, NY, USA, 2004.ACM.

[19] Kamran Massoudi, Manos Tsagkias, Maartende Rijke, and Wouter Weerkamp. Incorporating queryexpansion and quality indicators in searchingmicroblog posts. In Proceedings of the 33rd Europeanconference on Advances in information retrieval,ECIR’11, pages 362–367, Berlin, Heidelberg, 2011.Springer-Verlag.

[20] Donald Metzler, Susan Dumais, and ChristopherMeek. Similarity measures for short segments of text.In Proceedings of the 29th European conference on IRresearch, ECIR’07, pages 16–27, Berlin, Heidelberg,2007. Springer-Verlag.

[21] Jay M. Ponte and W. Bruce Croft. A languagemodeling approach to information retrieval. Researchand Development in Information Retrieval, pages275–281, 1998.

[22] Haoliang Qi, Mu Li, Jianfeng Gao, and Sheng Li.Information retrieval for short documents. Journal ofElectronics (China), 23:933–936, 2006.10.1007/s11767-006-0044-2.

[23] Daniel Ramage, Susan Dumais, and Dan Liebling.Characterizing microblogs with topic models. InICWSM, 2010.

[24] Nico Schlaefer, Jennifer Chu-Carroll, Eric Nyberg,James Fan, Wlodek Zadrozny, and David Ferrucci.Statistical source expansion for question answering. InProceedings of the 20th ACM international conferenceon Information and knowledge management, CIKM’11, pages 345–354, New York, NY, USA, 2011. ACM.

[25] B. W. Silverman. Density Estimation for Statisticsand Data Analysis. Monographs on Statistics andApplied Probability. Chapman & Hall, Boca Raton,1996.

[26] Tao Tao, Xuanhui Wang, Qiaozhu Mei, andChengxiang Zhai. Language model informationretrieval with document expansion. In HumanLanguage Technology Conference of the NorthAmerican Chapter of the ACL, pages 407–414, NewYork, 2006.

[27] Fulai Wang and Jim E. Greer. Retrieval of shortdocuments from discussion forums. In Proceedings ofthe 15th Conference of the Canadian Society forComputational Studies of Intelligence on Advances inArtificial Intelligence, AI ’02, pages 339–343, London,UK, UK, 2002. Springer-Verlag.

[28] Chengxiang Zhai and John Lafferty. A study ofsmoothing methods for language models applied toinformation retrieval. ACM Transactions onInformation Systems, 2(2):179–214, 2004.

920