1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006

11

What Makes a Query What Makes a Query Difficult?Difficult?David Carmel, Elad YomTov, Adam David Carmel, Elad YomTov, Adam Darlow, Dan PellegDarlow, Dan PellegIBM Haifa Research LabsIBM Haifa Research LabsSIGIR 2006SIGIR 2006

22

OutlineOutline IntroductionIntroduction A model for topic difficultyA model for topic difficulty Validating the modelValidating the model Uses of the modelUses of the model ConclusionConclusion

33

IntroductionIntroduction Typical TREC topics for comparison Typical TREC topics for comparison

between systems are defined by:between systems are defined by:– A textual descriptionA textual description– A set of documents relevant to the A set of documents relevant to the

information needinformation need Experimental results of TREC participants Experimental results of TREC participants

show a show a wide diversitywide diversity in effectiveness among in effectiveness among topics as well as among systemstopics as well as among systems Robust trackRobust track

44

IntroductionIntroduction The goals of TREC Robust trackThe goals of TREC Robust track

a.a. Encouraging systems to decrease variance by Encouraging systems to decrease variance by focusing on poorly performing topicsfocusing on poorly performing topics

b.b. To estimate the relative difficulty of each topicTo estimate the relative difficulty of each topicc.c. To study whether an old and difficult topic is still To study whether an old and difficult topic is still

difficult for current state-of-the-art IR systemsdifficult for current state-of-the-art IR systemsd.d. To study whether topics difficult in one To study whether topics difficult in one

collection are still difficult in another collectioncollection are still difficult in another collection Why are some topics more difficult than Why are some topics more difficult than

others?others?

55

Related WorkRelated Work Clarity measure for queriesClarity measure for queries Linguistic features of queryLinguistic features of query Number of topic aspectsNumber of topic aspects Features of entire collectionFeatures of entire collection Reliable Information Access (RIA) workshopReliable Information Access (RIA) workshop

– Ten failure categories are identifiedTen failure categories are identified– Most failures are related to identify all aspects Most failures are related to identify all aspects

of the topicof the topic

66


77

Topic Difficulty ModelTopic Difficulty Model Components of a topic: the queries Components of a topic: the queries QQ and and

the relevant documents the relevant documents RR, dependent on , dependent on the collection the collection CC::

1.1. d(d(QQ, , CC) – the distance between the queries, ) – the distance between the queries, QQ, and the , and the collection, collection, CC

2.2. d(d(QQ, , QQ) – the distance among the queries) – the distance among the queries3.3. d(d(RR, , CC) – the distance between the relevant ) – the distance between the relevant

documents, documents, RR, and the collection , and the collection CC4.4. d(d(R, RR, R) – the distance among the relevant documents) – the distance among the relevant documents5.5. d(d(QQ, , RR) – the distance between the queries, ) – the distance between the queries, QQ, and the , and the

relevant documents, relevant documents, RR..

)|,( CRQTopic

Figure 1: a general model for topic difficulty

88

Distance MeasureDistance Measure Jensen-Shannon divergence (JSD)Jensen-Shannon divergence (JSD)

– A symmetric version of the Kullback-Leibler A symmetric version of the Kullback-Leibler divergence (KLD)divergence (KLD)

– Applied to Applied to d(d(QQ, , CC), d(), d(RR, , CC) and d() and d(QQ, , RR) ) – For the distributions For the distributions PP((ww) and ) and QQ((ww) over the ) over the

words in the collectionwords in the collection , the JSD is: , the JSD is:Ww

WwWw

JS

KLKLJS

wMwQwQ

wMwPwPQPD

Q(w))(P(w) M(w)

MQDMPDQPD

)()(log)(

)()(log)()||(

21

))||()||((21)||(

99

Distribution of TermsDistribution of Terms The probability distribution of a word The probability distribution of a word ww

within the document or query within the document or query xx::

λ=0.9 for d(λ=0.9 for d(QQ, , QQ), d(), d(QQ, , RR) and d() and d(RR, , RR))λ=0.99 for d(λ=0.99 for d(QQ, , CC) and d() and d(RR, , CC))

)()1()|(

''

wP．n

n．xwP c

xww

w

1010

Topic Aspects and Topic Topic Aspects and Topic BroadnessBroadness The The aspect coverageaspect coverage problem is to find documents problem is to find documents

that cover as many different aspects as possiblethat cover as many different aspects as possible– Providing more information to the userProviding more information to the user

In the model, topic broadness (difficulty) is In the model, topic broadness (difficulty) is measured by the distance d(measured by the distance d(RR, , RR))

JSD suffers from the drawback that identical JSD suffers from the drawback that identical documents are very close togetherdocuments are very close together

Using topic aspects to measure d(Using topic aspects to measure d(RR, , RR))– Number of clusters of the relevant documentsNumber of clusters of the relevant documents

Square root of JSD for distance measure between Square root of JSD for distance measure between documentsdocuments

1111

Document Coverage and Document Coverage and Query CoverageQuery Coverage Rarely does the information pertaining to Rarely does the information pertaining to

both facets of the model existboth facets of the model exist When only When only QQ or or RR are available, the missing are available, the missing

part is approximated by JSDpart is approximated by JSD– Document coverage (DC)Document coverage (DC)

– Query coverage (QC)Query coverage (QC)

)'||()( ' RQDargminQDC JSR

)||'()( ' RQDargminRQC JSQ

1212

Practical Considerations for Practical Considerations for Document CoverageDocument Coverage Computing document coverage for a given Computing document coverage for a given

query is query is NP-hardNP-hard approximation approximation– Only the top 100 documents retrieved for the Only the top 100 documents retrieved for the

query are consideredquery are considered– A greedy algorithmA greedy algorithm

The document closest to the query is foundThe document closest to the query is found Iteratively adding the document that causes the Iteratively adding the document that causes the

largest decrease in JSD between the query and the largest decrease in JSD between the query and the selected docsselected docs

Once a minimum is reached, the value of JSD is Once a minimum is reached, the value of JSD is measured and the set of accumulated documents measured and the set of accumulated documents is used as an approximation to the true DC setis used as an approximation to the true DC setFigure 2: A typical JSD curve obtained by the

greedy algorithm for document coverage detection

1313

Practical Considerations for Practical Considerations for Query CoverageQuery Coverage Query coverage for given relevant Query coverage for given relevant

documentsdocuments– Only the set of terms belong to Only the set of terms belong to RR are are

considered by the greedy algorithmconsidered by the greedy algorithm– The iterative process results in a list of The iterative process results in a list of

ranked wordsranked words The most representative wordsThe most representative words

1414


1515

Experiment EnvironmentExperiment Environment Search engine: Search engine: JuruJuru Topics: the 100 topics of TREC 2004 and Topics: the 100 topics of TREC 2004 and

2005 terabyte tracks2005 terabyte tracks Document collection: Document collection: .GOV2.GOV2 (25 million (25 million

docs)docs)

1616

Model-Induced Distances vs. Model-Induced Distances vs. Average PrecisionAverage Precision

Table 1: Comparison of Pearson and Spearman Table 1: Comparison of Pearson and Spearman correlation coefficients between the different correlation coefficients between the different distances induced by the topic difficulty model and distances induced by the topic difficulty model and the AP of the 100 topicsthe AP of the 100 topics

DistanceDistanceJuru’s APJuru’s AP TREC median APTREC median AP

PearsonPearson Spearman’s Spearman’s ρρ


d(d(QQ, , CC))d(d(RR, , CC))d(d(QQ, , RR))d(d(RR, , RR))

0.1670.1670.3220.322-0.065-0.065+0.150+0.150

0.1700.1700.2900.290-0.134-0.1340.1410.141

0.2980.2980.3310.331-0.019-0.0190.1190.119

0.2920.2920.3230.3230.0040.0040.1550.155

CombinedCombined 0.4470.447 0.4760.476

1717

Model-Induced Distances vs. Model-Induced Distances vs. Topic Aspect CoverageTopic Aspect Coverage Topic aspect coverageTopic aspect coverage

– Average precision of the top-ranked document Average precision of the top-ranked document for each aspectfor each aspect

DistanceDistanceJuru’s APJuru’s AP


d(d(QQ, , CC))d(d(RR, , CC))d(d(QQ, , RR))d(d(RR, , RR))

0.0470.0470.1430.143-0.271-0.271-0.364-0.364

0.0470.0470.1940.194-0.285-0.285-0.418-0.418

CombinedCombined 0.4820.482

Table 2: Correlations between the Table 2: Correlations between the different distances and the aspect different distances and the aspect coveragecoverage

1818


1919

Uses of the ModelUses of the Model Estimating query average precisionEstimating query average precision Estimating topic aspect coverageEstimating topic aspect coverage Estimating topic Estimating topic findabilityfindability

– The likelihood of documents in the domain The likelihood of documents in the domain (topic) returning as answers to queries (topic) returning as answers to queries related to the domainrelated to the domain

2020

Estimating Average Estimating Average PrecisionPrecision R’R’ is an approximation of the set of relevant is an approximation of the set of relevant

documents, by approximating document documents, by approximating document coverage coverage

Using d(Using d(QQ, , CC), d(), d(QQ, , R’R’) and d() and d(R’R’, , CC) as ) as features for Support-Vector Machine (SVM)features for Support-Vector Machine (SVM)– Leave-one-out cross-validationLeave-one-out cross-validation

The Pearson correlation between the actual The Pearson correlation between the actual average precision to the predicted average average precision to the predicted average precision is 0.362precision is 0.362

2121

Estimating Aspect Estimating Aspect CoverageCoverage The same approach as estimating average The same approach as estimating average

precisionprecision The Pearson correlation between the actual The Pearson correlation between the actual

aspect coverage and the predicted one is aspect coverage and the predicted one is 0.3970.397

The same features are also used to train an The same features are also used to train an estimator to detect low coverage (<10%) estimator to detect low coverage (<10%) queriesqueries

Figure 3: Receiver operating characteristic (ROC) curve for distinguishing queries with low aspect coverage from other queries (the area under the curve is 0.88)

P(decide query with low coverage | query with high coverage)

P(decide query with low coverage | query with low coverage)

2222

Estimating Topic Estimating Topic FindabilityFindability Given a set of documents of a domain, Given a set of documents of a domain,

findability represents how easy it is for a user findability represents how easy it is for a user to find these documentsto find these documents– Related to the field of search engine optimizationRelated to the field of search engine optimization

For each topic, the 10 best words are selected For each topic, the 10 best words are selected from the result of query coverage from the result of query coverage approximationapproximation– A sequence of queries: the best one word, the best A sequence of queries: the best one word, the best

two words and so ontwo words and so on– For each topic, the resulting values of AP against For each topic, the resulting values of AP against

the number of terms are its features for the number of terms are its features for KK-means -means clustering clustering

2323

Results of Estimating Topic Results of Estimating Topic FindabilityFindability

Figure 4: Cluster centers of the AP curves versus Figure 4: Cluster centers of the AP curves versus the number of best words. The curves represent the number of best words. The curves represent three typical findability behaviorsthree typical findability behaviors

2424


2525

ConclusionConclusion The novel topic difficulty model is addressed to The novel topic difficulty model is addressed to

capture the main components of a topic and the capture the main components of a topic and the relations between those components to topic difficulty.relations between those components to topic difficulty.

The larger the distance of the queries and the Qrels The larger the distance of the queries and the Qrels from the entire collection, the better the topic can be from the entire collection, the better the topic can be answered.answered.

The applicability of the difficulty model is The applicability of the difficulty model is demonstrated.demonstrated.

There are more important features affecting topic There are more important features affecting topic difficulty left for further research, ex: ambiguity of the difficulty left for further research, ex: ambiguity of the query terms, or topics with missing content.query terms, or topics with missing content.

Documents

1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006