1
In a Nutshell 3 runs, Amazon Mechanical Turk, External HITs One HIT for each set of 5 documents = 435 HITs (2175 judgments) $0.20 per HIT = $0.04 per document Run 3 Stepwise execution of the GetAnotherLabel algorithm. Hypothesis: bad workers for one type of topics are not necessarily bad for others. For each worker w i compute expected quality q i on all topics and quality q ij on each topic type t j . For topics in t j , use only workers with q ij >q i . Topic categorization: TREC category (closed, advice, navigational, etc.), topic subject (politics, shopping, etc.) and rarity of the topic words. Runs 1 & 2 Train rule-based and SVM-based ML models. Features: Worker confusion matrix from GetAnotherLabel: For all workers, average posterior probability of relevant/nonrelevant For all workers, average correct-to-incorrect ratio when saying relevant or not For the document, relevant-to-nonrelevant ratio The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track Julián Urbano, Mónica Marrero, Diego Martín, Jorge Morato, Karina Robles and Juan Lloréns Gaithersburg, USA November 16 th , 2011 run 1 run 2 run 3 Hours to complete 8.5 38 20.5 HITs submitted (overhead) 438 (+1%) 535 (+23%) 448 (+3%) Submitted workers (just previewers) 29 (102) 83 (383) 30 (163) Average documents per worker 76 32 75 Total cost (including fees) $95.7 $95.7 $95.7 much better control of the whole process fair for most workers (previous trials) 2. Display Modes With images Black & white, same layout but no images Topic key terms (run 3) 3. Task focus: keywords (runs 1 & 2) or relevance (run 3) 4. Tabbed design 5. Quality Control Worker Level 50 HITs at most, at least 100 approved and 95% approval (98% in run 3) Implicit Task Level: Work Time At least 4.5 s/document (preview+work) Explicit Task Level: Comprehension What set of keywords better describe the document? Correct: top 3 by TF + 2 from next 5 Incorrect: 5 random in last 25 some folks work while previewing subjects always recognize top 1-2 by TF Rejecting & Blocking Action Failure run 1 run 2 run 3 Reject Keyword 1 0 1 Time 2 1 1 Block Keyword 1 1 1 Time 2 1 1 HITs rejected 3 (1%) 100 (23%) 13 (3%) Workers blocked 0 (0%) 40 (48%) 4 (13%) 7. Relevance Labels Binary run 1: bad = 0, fair or good = 1 runs 2 & 3: normalize slider range in [0-1] If value > 0.4 then 1, else 0 Ranking run 1: order by relevance, then by failures in keywords and then by time spent runs 2 & 3: explicit in sliders Task I Task II Acc. Rec. Prec. Spec. AP NDCG Median .623 .729 .773 .536 .931 .922 run 1 .748 .802 .841 .632 .922 .958 run 2 .690 .720 .821 .607 .889 .935 run 3 .731 .737 .857 .728 .894 .932 Acc. Rec. Prec. Spec. AP NDCG Median .640 .754 .625 .560 .111 .359 run 1 .699 .754 .679 .644 .166 .415 run 2 .714 .750 .700 .678 .082 .331 run 3 .571 .659 .560 .484 .060 .299 according to Wordnet unbiased majority voting 1. Document Preprocessing Cleanup for smooth loading and safe rendering: remove everything unrelated to style or layout 6. Relevance: run 1 run2 run3 * Unofficial, as per NIST gold labels

The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track

Embed Size (px)

DESCRIPTION

This paper describes the participation of the uc3m team in both tasks of the TREC 2011 Crowdsourcing Track. For the first task we submitted three runs that used Amazon Mechanical Turk: one where workers made relevance judgments based on a 3-point scale, and two similar runs where workers provided an explicit ranking of documents. All three runs implemented a quality control mechanism at the task level based on a simple reading comprehension test. For the second task we also submitted three runs: one with a stepwise execution of the GetAnotherLabel algorithm and two others with a rule-based and a SVMbased model. According to the NIST gold labels, our runs performed very well in both tasks, ranking at the top for most measures.

Citation preview

Page 1: The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track

In a Nutshell 3 runs, Amazon Mechanical Turk, External HITs

One HIT for each set of 5 documents = 435 HITs (2175 judgments)

$0.20 per HIT = $0.04 per document

Run 3 Stepwise execution of the GetAnotherLabel algorithm.

Hypothesis: bad workers for one type of topics are not necessarily bad for others.

For each worker wi compute expected quality qi on all topics and quality qij on each topic type tj. For topics in tj, use only workers with qij>qi.

Topic categorization: TREC category (closed, advice, navigational, etc.), topic subject (politics, shopping, etc.) and rarity of the topic words.

Runs 1 & 2 Train rule-based and SVM-based ML models. Features:

•Worker confusion matrix from GetAnotherLabel:

•For all workers, average posterior probability of relevant/nonrelevant

•For all workers, average correct-to-incorrect ratio when saying relevant or not

•For the document, relevant-to-nonrelevant ratio

The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track

Julián Urbano, Mónica Marrero, Diego Martín, Jorge Morato, Karina Robles and Juan Lloréns

Gaithersburg, USA November 16th, 2011

run 1 run 2 run 3

Hours to complete 8.5 38 20.5

HITs submitted (overhead) 438 (+1%) 535 (+23%) 448 (+3%)

Submitted workers (just previewers) 29 (102) 83 (383) 30 (163)

Average documents per worker 76 32 75

Total cost (including fees) $95.7 $95.7 $95.7

much better control of the whole process

fair for most workers (previous trials)

2. Display Modes • With images • Black & white,

same layout but no images

Topic key terms (run 3)

3. Task focus: keywords (runs 1 & 2) or relevance (run 3) 4. Tabbed design 5. Quality Control Worker Level 50 HITs at most, at least 100 approved and 95% approval (98% in run 3) Implicit Task Level: Work Time At least 4.5 s/document (preview+work)

Explicit Task Level: Comprehension What set of keywords better describe the document? • Correct: top 3 by TF + 2 from next 5 • Incorrect: 5 random in last 25

some folks work while previewing

subjects always recognize top 1-2 by TF Rejecting & Blocking

Action Failure run 1 run 2 run 3

Reject Keyword 1 0 1

Time 2 1 1

Block Keyword 1 1 1

Time 2 1 1 HITs rejected 3 (1%) 100 (23%) 13 (3%)

Workers blocked 0 (0%) 40 (48%) 4 (13%)

7. Relevance Labels Binary • run 1: bad = 0, fair or good = 1 • runs 2 & 3: normalize slider

range in [0-1] If value > 0.4 then 1, else 0

Ranking • run 1: order by relevance,

then by failures in keywords and then by time spent

• runs 2 & 3: explicit in sliders

Task I

Task II

Acc. Rec. Prec. Spec. AP NDCG

Median .623 .729 .773 .536 .931 .922

run 1 .748 .802 .841 .632 .922 .958

run 2 .690 .720 .821 .607 .889 .935

run 3 .731 .737 .857 .728 .894 .932

Acc. Rec. Prec. Spec. AP NDCG

Median .640 .754 .625 .560 .111 .359

run 1 .699 .754 .679 .644 .166 .415

run 2 .714 .750 .700 .678 .082 .331

run 3 .571 .659 .560 .484 .060 .299

according to Wordnet

unbiased majority voting

1. Document Preprocessing Cleanup for smooth loading and safe rendering: remove everything unrelated to style or layout

6. Relevance: run 1 run2 run3

* Unofficial, as per NIST gold labels