Upload
julian-urbano
View
26
Download
4
Tags:
Embed Size (px)
DESCRIPTION
This paper describes the participation of the uc3m team in both tasks of the TREC 2011 Crowdsourcing Track. For the first task we submitted three runs that used Amazon Mechanical Turk: one where workers made relevance judgments based on a 3-point scale, and two similar runs where workers provided an explicit ranking of documents. All three runs implemented a quality control mechanism at the task level based on a simple reading comprehension test. For the second task we also submitted three runs: one with a stepwise execution of the GetAnotherLabel algorithm and two others with a rule-based and a SVMbased model. According to the NIST gold labels, our runs performed very well in both tasks, ranking at the top for most measures.
Citation preview
In a Nutshell 3 runs, Amazon Mechanical Turk, External HITs
One HIT for each set of 5 documents = 435 HITs (2175 judgments)
$0.20 per HIT = $0.04 per document
Run 3 Stepwise execution of the GetAnotherLabel algorithm.
Hypothesis: bad workers for one type of topics are not necessarily bad for others.
For each worker wi compute expected quality qi on all topics and quality qij on each topic type tj. For topics in tj, use only workers with qij>qi.
Topic categorization: TREC category (closed, advice, navigational, etc.), topic subject (politics, shopping, etc.) and rarity of the topic words.
Runs 1 & 2 Train rule-based and SVM-based ML models. Features:
•Worker confusion matrix from GetAnotherLabel:
•For all workers, average posterior probability of relevant/nonrelevant
•For all workers, average correct-to-incorrect ratio when saying relevant or not
•For the document, relevant-to-nonrelevant ratio
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
Julián Urbano, Mónica Marrero, Diego Martín, Jorge Morato, Karina Robles and Juan Lloréns
Gaithersburg, USA November 16th, 2011
run 1 run 2 run 3
Hours to complete 8.5 38 20.5
HITs submitted (overhead) 438 (+1%) 535 (+23%) 448 (+3%)
Submitted workers (just previewers) 29 (102) 83 (383) 30 (163)
Average documents per worker 76 32 75
Total cost (including fees) $95.7 $95.7 $95.7
much better control of the whole process
fair for most workers (previous trials)
2. Display Modes • With images • Black & white,
same layout but no images
Topic key terms (run 3)
3. Task focus: keywords (runs 1 & 2) or relevance (run 3) 4. Tabbed design 5. Quality Control Worker Level 50 HITs at most, at least 100 approved and 95% approval (98% in run 3) Implicit Task Level: Work Time At least 4.5 s/document (preview+work)
Explicit Task Level: Comprehension What set of keywords better describe the document? • Correct: top 3 by TF + 2 from next 5 • Incorrect: 5 random in last 25
some folks work while previewing
subjects always recognize top 1-2 by TF Rejecting & Blocking
Action Failure run 1 run 2 run 3
Reject Keyword 1 0 1
Time 2 1 1
Block Keyword 1 1 1
Time 2 1 1 HITs rejected 3 (1%) 100 (23%) 13 (3%)
Workers blocked 0 (0%) 40 (48%) 4 (13%)
7. Relevance Labels Binary • run 1: bad = 0, fair or good = 1 • runs 2 & 3: normalize slider
range in [0-1] If value > 0.4 then 1, else 0
Ranking • run 1: order by relevance,
then by failures in keywords and then by time spent
• runs 2 & 3: explicit in sliders
Task I
Task II
Acc. Rec. Prec. Spec. AP NDCG
Median .623 .729 .773 .536 .931 .922
run 1 .748 .802 .841 .632 .922 .958
run 2 .690 .720 .821 .607 .889 .935
run 3 .731 .737 .857 .728 .894 .932
Acc. Rec. Prec. Spec. AP NDCG
Median .640 .754 .625 .560 .111 .359
run 1 .699 .754 .679 .644 .166 .415
run 2 .714 .750 .700 .678 .082 .331
run 3 .571 .659 .560 .484 .060 .299
according to Wordnet
unbiased majority voting
1. Document Preprocessing Cleanup for smooth loading and safe rendering: remove everything unrelated to style or layout
6. Relevance: run 1 run2 run3
* Unofficial, as per NIST gold labels