Predicting Question Quality Bruce Croft and Stephen Cronen-Townsend University of Massachusetts Amherst

Predicting Question Quality

Bruce Croft and Stephen Cronen-Townsend

University of Massachusetts Amherst

Topics

Clarity applied to TREC QA questions

Clarity applied to Web questions

Clarity used to predict query expansion

Actually predicting quality of retrieved passages (or documents)

Basic result: We can predict retrieval performance (with some qualifications) Works well on TREC ad-hoc queries Can set thresholds automatically Works with most TREC QA question classes

For example: Where was Tesla born?

Clarity score 3.57 What is sake?

Clarity score 1.28

Predicting Question Quality

-8

-7

-6

-5

-4

-3

-2

-1

0

Question LM (top 5 pass.) TREC-9 LM

Clarity score computationQuestion

Q, text

QuestionQ, text

...

Passages, A ...

Passages ranked by P(A|Q)retrieveretrieve

modelquestion-related

language

modelquestion-related

language

Compute divergence

Compute divergence

Clarity Score

Lo

g P

terms

Where was Tesla born?

“nikola”“tesla”

“born”“yugoslavia”

“unit”

“film”

Predicting Ad-Hoc Performance

Correlations with Av. Precision for TREC Queries

Av. Precision vs. Clarity for 100 TREC title queries. Optimal and automatic thresholdvalues shown

Collection Queries Num. R P-Value

AP88+89 101-200 100 0.368 1.2 · 10-4

TREC-4 201-250 50 0.490 3.0 · 10-4

TREC-5 251-300 50 0.459 6.5 · 10-4

TREC-7 351-400 50 0.577 2.7 · 10-5

TREC-8 401-450 50 0.494 2.7 · 10-4

TREC7+8 351-450 100 0.536 4.8 · 10-8

Passage-Based Clarity

Passages: Whole sentence based, 250 character maximum From top retrieved docs Passage models smoothed with all of TREC-9

Measuring performance: Average precision (rather than MRR) Top ranked passages used to estimate clarity scores Top 100 gives 99% of max correlation

Question Type # of Qs Rank Correlation (R) P-Value

Amount 35 0.171 0.16

Famous 76 0.148 0.10

Location 100 0.308 0.0011

Person 90 0.245 0.010

Time 48 0.350 0.0082

Miscellaneous 139 0.266 0.00090

Correlation by Question Type

Strong on average (R=0.255, P=10-8)

Allows prediction of question performance

Challenging cases: Amount and Famous

General comments on difficulty: Questions have been preselected to be good questions

for TREC QA track Questions are less ambiguous in general than short

queries

Correlation Analysis

0

0.2

0.4

0.6

0.8

1

1 1.5 2 2.5 3 3.5 4

Precision vs. Clarity (Location Qs)A

vera

ge

Pre

cisi

on

Clarity Score

Where was Tesla born?

Where is Venezula?

What is the location of Rider College?

What was Poe’s birthplace?

High clarity, low ave. prec. Answerless, coherent context “What was Poe’s birthplace?”

“birthplace” and “Poe” do not co-occur Bad candidate passages Variant “Where was Poe born?” performs well, predicts well

Low clarity, high ave. prec. Very rare, often few correct passages “What is the location of Rider College?”

One passage containing correct answer Cannot increase language coherence among passages Ranked first, so average precision 1

Predictive Mistakes

0 3

1

0

Ave

. P

rec

isio

n

Clarity Score

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6

“Who is Zebulon Pike?” Many correct answers decrease clarity of good ranked list

“Define thalassemia.” Passages using term are highly coherent, but often do not

define it

Challenging Types: FamousA

vera

ge

Pre

cisi

on

Clarity Score

Who is Zebulon Pike?

Define thalassemia.

Web Experiments

445 well-formed questions randomly chosen from the Excite log

WT10g test collection Human predicted values of quality

“Where can I purchase an inexpensive computer?” Clarity 0.89, human predicted ineffective

“Where can I find the lyrics to Eleanor Rigby?” Clarity 8.08, human predicted effective

Result: Clarity scores are significantly correlated with human predictions

Distribution of Clarity Scores

Class Number Average Clarity

P-value

Predicted effective

223 2.03 0.00026

Predicted ineffective

222 1.81 0.00020

Predicting When to Expand Questions

Best simple strategy: always use expanded questions e.g. Always use relevance model retrieval

But some questions do not work well when expanded NRRC workshop looking at this

Can clarity scores be used to predict which? Initial idea: “Do ambiguous queries get worse when

expanded?” Not always. New idea: Perform the expansion retrieval. “Can we use a

modified clarity score to guess if the expansion helped?” Yes.

Using Clarity to Predict Expansion

Evaluated using TREC ad-hoc data Choice: query-likelihood retrieval or relevance model

retrieval Ranked list clarity: measure coherence of ranked list

Mix documents according to their rank alone For example: top 600 documents, linearly decreasing weights

Compute improvement in ranked list clarity scores First thought: if difference positive, choose relevance model

results Best thought: if difference is higher than some threshold, choose

relevance model results

Clarity and Expansion Results

Choosing expansion using this method produces 51% of optimal improvement for TREC-8

Choosing when to expand has more impact in TREC-8 where expanded query performance is more mixed (only marginally better, on average, than unexpanded) In TREC-7, only 4 queries perform really badly with

relevance model and Clarity method predicts 2 of them.

Collection Baseline LM Relevance Model

Clarity Prediction

(predict best)

Optimal

(choose best)

TREC 7 0.188 0.237 0.239 0.262

TREC 8 0.247 0.248 0.269 0.289

Predicting Expansion Improvements

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

-0.3 0.2 0.7 1.2 1.7 2.2 2.7 3.2 3.7

Original Clarity

Change in Ave.

Precision

killer bee attacks

Legionnaires disease

tourists, violence

women clergy

Stirling engine

cosmic events

Predicting Expansion Improvements

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

-0.3 -0.1 0.1 0.3 0.5 0.7 0.9

Change in Clarity (new ranked list – old)

Change in Ave.

Precision

killer bee attacks

Legionnaires disease

tourists, violence

women clergy

Stirling engine

cosmic events

Future Work

Continue expansion experiments with queries and questions

Understanding the role of the corpus predicting when coverage is inadequate more experiments on Web, heterogeneous collections

Providing a Clarity tool user interface or data for QA system? efficiency

Better measures...

Documents

Predicting Question Quality Bruce Croft and Stephen Cronen-Townsend University of Massachusetts Amherst