1
Science-Driven Data with Amazon Mechanical Turk What is Mechanical Turk Alexander Sorokin, D.A. Forsyth Department of Computer Science University of Illinois at Urbana-Champaign Quality Volume, Speed and Cost Annotation Toolkit and Future work Amazon Mechanical Turk is an example of crowdsourcing. We pay internet users to do small tasks, like segmenting an image. Mechanical Turk is simply a broker of tasks with large community of workers. Amazon Mechanical Turk at a glance: • We define the task and set the pay • Workers choose to do our tasks • We accept(reject) the work and pay for the accepted work Use cases 1 2 3 4 5 Task Throughput Number of tasks Cost People (pose) 300/hour 2595 people $40 Segmentation s ~50/hour 9000 segmentations $300 Visual attributes 282 objects/hour 15000 objects (100K selected labels) $550 Annotate human pose Align landmarks Segment images Annotate visual attributes Quality doesn’t come for free If results are blindly accepted, up to 50% are bad. Options to control quality: • Filter workers: performance history When a worker has only 50% of tasks accepted, you should expect to reject half of their work. In “alignment”, requiring 90% accept rate v.s. 25% resulted in 88% accepted v.s. 55%. • Filter workers: qualification Do qualification test to check that workers understand the instructions. In “attributes”, improved agreement from 65% to 81% (v.s. expert at 84%) • Do automatic verification In “alignment”, residual distance between landmarks predicts quality very well. 70% of bad submissions are confused with only 5% of good submissions. • Do manual grading. In most cases manual grading is easy. It is easy to spot a few bad submissions in a large pool of good ones. Manual evaluation by a trusted worker may be the only way to guarantee high quality. • Request multiple annotations o When the task is opinion- based. E.g. “Select good examples of a category?” o Too difficult to grade. Grading required as much attention as performing the task. We developed a generic tool that easily adapts to different tasks Simple XML definition of the annotation task • Checkboxes, landmarks, bounding boxes, polygons. • Static images for now • Instructions, source code, etc at http://vision.cs.uiuc.edu/annotation/ Future work: More annotation tools for other projects. Ideas are welcome. • Real-time annotation server for a robot. • “Annotate” Matlab / OpenCV call • True online learning Mechanical Turk proved to be both cheap and scalable: Throughput comes with scale • Large number of tasks attracts more workers • Burn-in effect: workers spend time understanding the task • Better pay attracts more workers and ensures higher quality Demo 6

Science-Driven Data with Amazon Mechanical Turk

Embed Size (px)

DESCRIPTION

What is Mechanical Turk. Quality. 1. 4. Science-Driven Data with Amazon Mechanical Turk. Quality doesn’t come for free If results are blindly accepted, up to 50% are bad. Options to control quality: Filter workers: performance history - PowerPoint PPT Presentation

Citation preview

Page 1: Science-Driven Data with Amazon Mechanical Turk

Science-Driven Datawith Amazon Mechanical Turk

What is Mechanical Turk

Alexander Sorokin, D.A. ForsythDepartment of Computer Science

University of Illinois at Urbana-Champaign

Quality

Volume, Speed and Cost

Annotation Toolkit and Future work

Amazon Mechanical Turk is an example of crowdsourcing. We pay internet users to do small tasks, like segmenting an image. Mechanical Turk is simply a broker of tasks with large community of workers.

Amazon Mechanical Turk at a glance:

• We define the task and set the pay

• Workers choose to do our tasks

• We accept(reject) the work and pay for the accepted work

Use cases

1

2

3

4

5

Task Throughput Number of tasks Cost

People (pose) 300/hour 2595 people $40

Segmentations ~50/hour 9000 segmentations $300

Visual attributes 282 objects/hour 15000 objects

(100K selected labels)

$550

• Annotate human pose

• Align landmarks

•Segment images

• Annotate visual attributes

Quality doesn’t come for free

If results are blindly accepted, up to 50% are bad.

Options to control quality:

• Filter workers: performance history

When a worker has only 50% of tasks accepted, you should expect to reject half of their work.

In “alignment”, requiring 90% accept rate v.s. 25% resulted in 88% accepted v.s. 55%.

• Filter workers: qualification

Do qualification test to check that workers understand the instructions. In “attributes”, improved agreement from 65% to 81% (v.s. expert at 84%)

• Do automatic verification

In “alignment”, residual distance between landmarks predicts quality very well. 70% of bad submissions are confused with only 5% of good submissions.

• Do manual grading.

In most cases manual grading is easy. It is easy to spot a few bad submissions in a large pool of good ones.

Manual evaluation by a trusted worker may be the only way to guarantee high quality.

• Request multiple annotations

o When the task is opinion-based. E.g. “Select good examples of a category?”

o Too difficult to grade. Grading required as much attention as performing the task.

We developed a generic tool that easily adapts to different tasks

• Simple XML definition of the annotation task

• Checkboxes, landmarks, bounding boxes, polygons.

• Static images for now

• Instructions, source code, etc at

http://vision.cs.uiuc.edu/annotation/Future work:

• More annotation tools for other projects. Ideas are welcome.

• Real-time annotation server for a robot.

• “Annotate” Matlab / OpenCV call

• True online learning

Mechanical Turk proved to be both cheap and scalable:

Throughput comes with scale

• Large number of tasks attracts more workers

• Burn-in effect: workers spend time understanding the task

• Better pay attracts more workers and ensures higher quality

Demo6