The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track

In a Nutshell 3 runs, Amazon Mechanical Turk, External HITs

One HIT for each set of 5 documents = 435 HITs (2175 judgments)

$0.20 per HIT = $0.04 per document

Run 3 Stepwise execution of the GetAnotherLabel algorithm.

Hypothesis: bad workers for one type of topics are not necessarily bad for others.

For each worker wi compute expected quality qi on all topics and quality qij on each topic type tj. For topics in tj, use only workers with qij>qi.

Topic categorization: TREC category (closed, advice, navigational, etc.), topic subject (politics, shopping, etc.) and rarity of the topic words.

Runs 1 & 2 Train rule-based and SVM-based ML models. Features:

•Worker confusion matrix from GetAnotherLabel:

•For all workers, average posterior probability of relevant/nonrelevant

•For all workers, average correct-to-incorrect ratio when saying relevant or not

•For the document, relevant-to-nonrelevant ratio

The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track

Julián Urbano, Mónica Marrero, Diego Martín, Jorge Morato, Karina Robles and Juan Lloréns

Gaithersburg, USA November 16th, 2011

run 1 run 2 run 3

Hours to complete 8.5 38 20.5

HITs submitted (overhead) 438 (+1%) 535 (+23%) 448 (+3%)

Submitted workers (just previewers) 29 (102) 83 (383) 30 (163)

Average documents per worker 76 32 75

Total cost (including fees) $95.7 $95.7 $95.7

much better control of the whole process

fair for most workers (previous trials)

2. Display Modes • With images • Black & white,

same layout but no images

Topic key terms (run 3)

3. Task focus: keywords (runs 1 & 2) or relevance (run 3) 4. Tabbed design 5. Quality Control Worker Level 50 HITs at most, at least 100 approved and 95% approval (98% in run 3) Implicit Task Level: Work Time At least 4.5 s/document (preview+work)

Explicit Task Level: Comprehension What set of keywords better describe the document? • Correct: top 3 by TF + 2 from next 5 • Incorrect: 5 random in last 25

some folks work while previewing

subjects always recognize top 1-2 by TF Rejecting & Blocking

Action Failure run 1 run 2 run 3

Reject Keyword 1 0 1

Time 2 1 1

Block Keyword 1 1 1

Time 2 1 1 HITs rejected 3 (1%) 100 (23%) 13 (3%)

Workers blocked 0 (0%) 40 (48%) 4 (13%)

7. Relevance Labels Binary • run 1: bad = 0, fair or good = 1 • runs 2 & 3: normalize slider

range in [0-1] If value > 0.4 then 1, else 0

Ranking • run 1: order by relevance,

then by failures in keywords and then by time spent

• runs 2 & 3: explicit in sliders

Task I

Task II

Acc. Rec. Prec. Spec. AP NDCG

Median .623 .729 .773 .536 .931 .922

run 1 .748 .802 .841 .632 .922 .958

run 2 .690 .720 .821 .607 .889 .935

run 3 .731 .737 .857 .728 .894 .932

Acc. Rec. Prec. Spec. AP NDCG

Median .640 .754 .625 .560 .111 .359

run 1 .699 .754 .679 .644 .166 .415

run 2 .714 .750 .700 .678 .082 .331

run 3 .571 .659 .560 .484 .060 .299

according to Wordnet

unbiased majority voting

1. Document Preprocessing Cleanup for smooth loading and safe rendering: remove everything unrelated to style or layout

6. Relevance: run 1 run2 run3

* Unofficial, as per NIST gold labels

Science

The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track