AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Peter W. Hallinan, Ph.D.

A9.com

November 30, 2016

MAC201

Getting to Ground Truth with

Amazon Mechanical Turk

What to expect from the session

• How to use Mechanical Turk to build ML datasets

• Best practices at smaller and larger scales

• Lessons learned building datasets for an AWS service

What not to expect

• Detailed tutorial on how to use Mechanical Turk

Machine learning requires large scale data

production pipelines

• Readily available: algorithms and compute clusters

• Not readily available: large scale, high quality datasets

Amazon Mechanical Turk can help you build your dataset

• Training data is the key differentiator

• Success depends on the quality, scale, and throughputof your dataset production pipelines

www.mturk.com

What is Amazon Mechanical Turk (MTurk)?

• A marketplace for getting simple tasks done in parallel by humans

• Launched in 2005, one of the first AWS services

• Basic unit of work is a HIT, a single, self-contained task

• Example: “How many wolves are in this photo?”

• Requesters use website or APIs to publish HITs to workers and

consume results

• One or more workers per HIT

• Rapid response times

• Simple workflow: HTML template, csv in, csv out

What datasets can you build with MTurk?

Almost any domain

• Vision, NLP, psych, etc.

All key task types

• Open-ended questions

• Structured questions

• Binary verifications

ImageNet

• Prof. Fei Fei Li, Stanford AI Lab

• 21841 WordNet categories

• 14.1 MM total images

• 1 MM localized examples

• ImageNet Challenge 2010-2016

Search Google Scholar for “Mechanical Turk” + “machine learning”

Result: 6000+ citations

L. Fei-Fei, ImageNet: crowdsourcing, benchmarking and other

cool things, CMU VASC Seminar, March, 2010.

What kind of quality can you expect?

1. How much do your

categories intrinsically

overlap?

2. How representative is

your “golden set”?

3. How well can workers

solve your specific HIT?

Wolves Dogs?

Feature B

Feature A

11

What kind of quality can you expect?

1. How much do your

categories intrinsically

overlap?

2. How representative is

your “golden set”?

3. How well can Workers

solve your specific HIT?

True wolves

2

Golden wolves

Feature B

Feature A

Worker

wolves

3

Rapid prototyping:

Smaller scale datasets

Dataset construction is highly iterative...

… so MTurk supports

rapid iterations

Source data

Define HIT & golden set

EvaluateHIT results

Augmentdata

Train & testML algorithm

Define objectives

MTurk

Example: Build a wristwatch classifier

• ML objective: Label wristwatch “shape”

• Dataset objective: ~2000 training examples

• Data source: Amazon Catalog

Experiment A: 1 shape feature, 3 categories

Rectangular

Circular

Other

Experiment A: HIT design

Experiment A: Accuracy is 97.29%

Golden set

MTurk

Circle Rect Other 700 Accuracy Precision Recall

Circle 504 3 10 571 97.29% 97% 99%

Rect 0 64 2 66 97% 94%

Other 3 1 113 117 97% 90%

700 507 68 125 681

72% 10% 18%

Can we do better?

Experiment B: 2 shape features, 16 categories

Dial face

shape

Casing shape

Other

Rectangle

Circle/oval

Circle/oval

Rectangle

Other

Tonneau

Tonneau

Experiment B: Hit Design

Experiment B: Accuracy drops 4.72% to 92.57%

Golden setCase/

Dial C/C T/C R/C O/C C/T T/T R/T O/T C/R T/R R/R O/R C/O T/O R/O O/O 700 Accuracy Precision Recall

MTurk

C/C 453 3 3 7 8 474 92.57% 96% 97%

T/C 11 19 1 2 1 34 56% 86%

R/C 1 1 2 50% 20%

O/C 5 4 9 44% 31%

C/T 0 0 - -

T/T 3 1 4 75% 100%

R/T 0 1 1 0% -

O/T 0 0 - -

C/R 0 0 - -

T/R 0 0 - 0%

R/R 53 2 2 57 93% 93%

O/R 2 2 100% 40%

C/O 0 1 1 0% -

T/O 0 0 - -

R/O 0 0 - -

O/O 1 2 113 116 97% 90%

700 469 22 5 13 0 3 0 0 0 1 57 5 0 0 0 125 648

Prevalence 67% 3% 1% 2% 0% 0% 0% 0% 0% 0% 8% 1% 0% 0% 0% 18%

A second “fuzzy” feature creates more opportunity for disagreement

C = Circle/Oval; T = Tonneau; R = Rectangular; O = Other

Experiment B: Filtering workers adds 0.14%

Golden setCase/

Dial C/C T/C R/C O/C C/T T/T R/T O/T C/R T/R R/R O/R C/O T/O R/O O/O 700 Accuracy Precision Recall

MTurk

C/C 455 3 3 7 8 476 92.71% 96% 97%

T/C 10 19 1 2 1 33 58% 86%

R/C 1 2 3 33% 20%

O/C 4 4 8 50% 31%

C/T 0 0 - -

T/T 3 1 4 75% 100%

R/T 0 1 1 0% -

O/T 0 0 - -

C/R 0 0 - -

T/R 0 0 - 0%

R/R 52 2 2 56 93% 91%

O/R 2 2 100% 40%

C/O 0 1 1 0% -

T/O 0 0 - -

R/O 0 0 - -

O/O 1 2 113 116 97% 90%

700 469 22 5 13 0 3 0 0 0 1 57 5 0 0 0 125 649

Prevalence 67% 3% 1% 2% 0% 0% 0% 0% 0% 0% 8% 1% 0% 0% 0% 18%

Only 5/124 workers are in the minority for a majority of their votes

Many possible experiments; we’ve reported three

Quality levers Lever settings / experiments

ML objectives 1 feature / 3 categories, 2 features / 16 categories

Data sources and segments Held constant

Golden set # examples: 100, 700

# annotators: 1,3

HIT design / instructions Picture only, text only, picture and text

Worker selection Prequalified

Workers per HIT 3, 5

HIT aggregation rules Majority vote vs majority of filtered workers

Worker feedback None

Quality control levers

ML objectives

Data sources and segments

Golden sets

HIT design / instructions

Worker selection

Workers per HIT

HIT aggregation rules

Worker feedback

Myth: Dataset quality is an intrinsic property of the

MTurk marketplace

Throughput control levers

• HIT price

• HIT publication rate

1

2

3

MTurk provides you with control

levers to optimize dataset

quality and throughput

Best practices: HIT design

• Simplicity of question vs. clarity of answers• Prefer questions with limited option sets to open-ended questions

• Prefer mutually exclusive, collectively exhaustive option sets

• Prefer smaller option sets to larger ones

• Ease of learning vs. time to complete HIT• Prefer more questions and simpler instructions to fewer questions and more

complex instructions

• Prefer that each possible answer set costs the same time to provide

• Workers optimize their behaviors to your design; don’t “tweak” too much

Using MTurk criteria• Geography

• Worker approval rating

• Total # HITs approved

• Masters status

• Mobile device user

• Political affiliation

• High school graduate

• Bachelor’s degree

• Marital status

• Parenthood status

• Voted in 2012 presidential election

• Smoker

• Car owner

• Handedness

Best practices: Selecting workers

Using your own criteria• Past performance on your HITs

• Custom tests of domain specific knowledge

• Custom tests of decision-making ability

Aggregating results

• When using multiple workers /

HIT, aggregate results with voting

scheme

• Align voting w/ prevalence

• Moderate prevalence => Majority

voting

• Low prevalence => Any yes vote

• Either drop split decisions, or force

them into a category that can be

split later

Best practices: Assessing results

Worker feedback

• Approve and reject HITs carefully

• Automatic rejections require

ironclad reasons

• Adjust selection criteria

• Monitor emails and forums

Best practices: Boosting quality

• Separate your categories

• Scrutinize false positives and negatives

• Simplify and clarify instructions

• Optimize worker quals

• Experiment, experiment, experiment!

Dataset accuracy puts an upper bound on system performance

Scaling Up:

Larger scale datasets

Challenge: Measuring quality over time

Example

• 1 MM HITs @ 100K / week

• 99% confidence level

• Confidence interval (CI)

varies with sample size

Three strategies

• Scrutinize @ fixed CI

• Scrutinize @ decreasing CI

• Scrutinize and trust

Potential tactic

• Partition workers and

interleave golden sets

0.0%

1.0%

2.0%

3.0%

4.0%

5.0%

6.0%

7.0%

8.0%

9.0%

0 1000 2000 3000 4000 5000 6000

Cre

dib

le r

egi

on

wid

th

Number of answers checked

Credible Region Width vs. # Answers Checked

Myth: Workers always fatigue/satisfice with time

Worker accuracy is

stable and predictable

Check forums for

Worker discussions of

your HITs

Kenji Hata, Ranjay Krishna, Li Fei-Fei, and Michael S. Bernstein. “A Glimpse Far into the Future:

Understanding Long-term Crowd Worker Accuracy.” arXiv preprint arXiv:1609.04855 (2016).

Top down (20 questions)

• Build classifier with 100 root

categories, not 20K leaf

categories

• HIT 1 labels candidate root

examples (only 10-20K ex. req’d)

• HIT 2 verifies root members

• Split root categories

• Repeat

Challenge: Minimizing cost per training example

Bottom up (data mining)

• Mine your source data for clusters

• HIT 1 assigns labels to clusters

• HIT 2 verifies members of clusters

• Delete clusters members from

source data

• Repeat

Divide and conquer to maximize validation rates

ML goal: recognize 20K categories; Dataset goal: 1K examples / category.

Challenge: Success

• Eventually, user input will

differ from the data you

trained on

• Assess your actual

recognition rates

• Use errors to guide

expansion of your test and

training datasets

Amazon Rekognition:

Lessons Learned

Ranju Das

Amazon Rekognition

Amazon Rekognition

A deep learning-based image recognition service

Search, verify, and organize millions of images

Object and scene

detectionFacial analysis Face comparison Facial recognition

What do people see?

• People see a lot more than what is imaged on the retina

Vision involves a process called “unconscious inference” in neuroscience

The largely unconscious nature of the inferences is confirmed by the study

of optical illusions

• In order for a human observer to recognize an image, two neuronal processes come together:

Sensory activation from the eyes (referential system)

Information from past experience that is stored in distributed regions across

the brain (inferential system)

What do you see in the yellow bounding box (“region proposal”)?

“a hat”?

We “know” from correlation with other image crops and past experience that it’s a baby

People don’t classify “region proposals” in isolation

What do you see in the yellow bounding box (“region proposal”)?

“a baby”!

Examples: Common inferencesAdding “must be” invisible objects

baby


fish


ring

Examples: Common inferencesBetting whole from parts

baby


family


balloon

Examples: Common inferencesReading (and trusting) text hints

farm


chocolate


pizza

Examples: Common inferencesReading (and trusting) stereotypes and symbols

beer


4th of July


party

Examples: Common inferencesGambling on the past and future

swimming


wedding


camping

Verification example

Bounding box example

Group image verification example

Yes

No

Sample Images

Descriptions>

Progress

6/200

Back

Does each image contain a cat?

How to improve quality and consistency

Make the HITs “bite”-sized

Create clear and concise instructions

Ask multiple people and build consensus

Include control images to measure performance of workers

Use qualifications and white/black lists to control workforce

Thank you!

Remember to complete

your evaluations!

Technology

AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)