Upload
ethelbert-simmons
View
212
Download
0
Embed Size (px)
Citation preview
Predicting Question Quality
Bruce Croft and Stephen Cronen-Townsend
University of Massachusetts Amherst
Topics
Clarity applied to TREC QA questions
Clarity applied to Web questions
Clarity used to predict query expansion
Actually predicting quality of retrieved passages (or documents)
Basic result: We can predict retrieval performance (with some qualifications) Works well on TREC ad-hoc queries Can set thresholds automatically Works with most TREC QA question classes
For example: Where was Tesla born?
Clarity score 3.57 What is sake?
Clarity score 1.28
Predicting Question Quality
-8
-7
-6
-5
-4
-3
-2
-1
0
Question LM (top 5 pass.) TREC-9 LM
Clarity score computationQuestion
Q, text
QuestionQ, text
...
Passages, A ...
Passages ranked by P(A|Q)retrieveretrieve
modelquestion-related
language
modelquestion-related
language
Compute divergence
Compute divergence
Clarity Score
Lo
g P
terms
Where was Tesla born?
“nikola”“tesla”
“born”“yugoslavia”
“unit”
“film”
Predicting Ad-Hoc Performance
Correlations with Av. Precision for TREC Queries
Av. Precision vs. Clarity for 100 TREC title queries. Optimal and automatic thresholdvalues shown
Collection Queries Num. R P-Value
AP88+89 101-200 100 0.368 1.2 · 10-4
TREC-4 201-250 50 0.490 3.0 · 10-4
TREC-5 251-300 50 0.459 6.5 · 10-4
TREC-7 351-400 50 0.577 2.7 · 10-5
TREC-8 401-450 50 0.494 2.7 · 10-4
TREC7+8 351-450 100 0.536 4.8 · 10-8
Passage-Based Clarity
Passages: Whole sentence based, 250 character maximum From top retrieved docs Passage models smoothed with all of TREC-9
Measuring performance: Average precision (rather than MRR) Top ranked passages used to estimate clarity scores Top 100 gives 99% of max correlation
Question Type # of Qs Rank Correlation (R) P-Value
Amount 35 0.171 0.16
Famous 76 0.148 0.10
Location 100 0.308 0.0011
Person 90 0.245 0.010
Time 48 0.350 0.0082
Miscellaneous 139 0.266 0.00090
Correlation by Question Type
Strong on average (R=0.255, P=10-8)
Allows prediction of question performance
Challenging cases: Amount and Famous
General comments on difficulty: Questions have been preselected to be good questions
for TREC QA track Questions are less ambiguous in general than short
queries
Correlation Analysis
0
0.2
0.4
0.6
0.8
1
1 1.5 2 2.5 3 3.5 4
Precision vs. Clarity (Location Qs)A
vera
ge
Pre
cisi
on
Clarity Score
Where was Tesla born?
Where is Venezula?
What is the location of Rider College?
What was Poe’s birthplace?
High clarity, low ave. prec. Answerless, coherent context “What was Poe’s birthplace?”
“birthplace” and “Poe” do not co-occur Bad candidate passages Variant “Where was Poe born?” performs well, predicts well
Low clarity, high ave. prec. Very rare, often few correct passages “What is the location of Rider College?”
One passage containing correct answer Cannot increase language coherence among passages Ranked first, so average precision 1
Predictive Mistakes
0 3
1
0
Ave
. P
rec
isio
n
Clarity Score
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6
“Who is Zebulon Pike?” Many correct answers decrease clarity of good ranked list
“Define thalassemia.” Passages using term are highly coherent, but often do not
define it
Challenging Types: FamousA
vera
ge
Pre
cisi
on
Clarity Score
Who is Zebulon Pike?
Define thalassemia.
Web Experiments
445 well-formed questions randomly chosen from the Excite log
WT10g test collection Human predicted values of quality
“Where can I purchase an inexpensive computer?” Clarity 0.89, human predicted ineffective
“Where can I find the lyrics to Eleanor Rigby?” Clarity 8.08, human predicted effective
Result: Clarity scores are significantly correlated with human predictions
Distribution of Clarity Scores
Class Number Average Clarity
P-value
Predicted effective
223 2.03 0.00026
Predicted ineffective
222 1.81 0.00020
Predicting When to Expand Questions
Best simple strategy: always use expanded questions e.g. Always use relevance model retrieval
But some questions do not work well when expanded NRRC workshop looking at this
Can clarity scores be used to predict which? Initial idea: “Do ambiguous queries get worse when
expanded?” Not always. New idea: Perform the expansion retrieval. “Can we use a
modified clarity score to guess if the expansion helped?” Yes.
Using Clarity to Predict Expansion
Evaluated using TREC ad-hoc data Choice: query-likelihood retrieval or relevance model
retrieval Ranked list clarity: measure coherence of ranked list
Mix documents according to their rank alone For example: top 600 documents, linearly decreasing weights
Compute improvement in ranked list clarity scores First thought: if difference positive, choose relevance model
results Best thought: if difference is higher than some threshold, choose
relevance model results
Clarity and Expansion Results
Choosing expansion using this method produces 51% of optimal improvement for TREC-8
Choosing when to expand has more impact in TREC-8 where expanded query performance is more mixed (only marginally better, on average, than unexpanded) In TREC-7, only 4 queries perform really badly with
relevance model and Clarity method predicts 2 of them.
Collection Baseline LM Relevance Model
Clarity Prediction
(predict best)
Optimal
(choose best)
TREC 7 0.188 0.237 0.239 0.262
TREC 8 0.247 0.248 0.269 0.289
Predicting Expansion Improvements
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
-0.3 0.2 0.7 1.2 1.7 2.2 2.7 3.2 3.7
Original Clarity
Change in Ave.
Precision
killer bee attacks
Legionnaires disease
tourists, violence
women clergy
Stirling engine
cosmic events
Predicting Expansion Improvements
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
-0.3 -0.1 0.1 0.3 0.5 0.7 0.9
Change in Clarity (new ranked list – old)
Change in Ave.
Precision
killer bee attacks
Legionnaires disease
tourists, violence
women clergy
Stirling engine
cosmic events
Future Work
Continue expansion experiments with queries and questions
Understanding the role of the corpus predicting when coverage is inadequate more experiments on Web, heterogeneous collections
Providing a Clarity tool user interface or data for QA system? efficiency
Better measures...