8
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 CVPR #**** CVPR #**** CVPR 2008 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. Utility data annotation with Amazon Mechanical Turk Anonymous CVPR submission Paper ID **** Abstract We show how to outsource data annotation to Amazon Mechanical Turk. Doing so has produced annotations in quite large numbers relatively cheaply. The quality is good, and can be checked and controlled. Annotations are pro- duced quickly. We describe results for several different an- notation problems. We describe some strategies for deter- mining when the task is well specified and properly priced. 1. Introduction Big annotated image datasets now play an important role in Computer Vision research. Many of them were built in- house ([18, 11, 12, 3, 13, 5] and many others). This con- sumes significant amounts of highly skilled labor, requires much management work, is expensive and creates a percep- tion that annotation is difficult. Another successful strat- egy is to make the annotation process completely public ([24]) and even entertaining [26, 27]), at the cost of dimin- ished control over what annotations are produced and neces- sary centralization to achieve high volume of participation. Finally, dedicated annotation services ([28]) can produce high volume quality annotations, but at high price. We show that image annotation work can be efficiently outsourced to an online worker community (currently Ama- zon Mechanical Turk [2]) (sec. 2). The resulting annota- tions are good (sec. 2.3.2), cheap (sec. 2.3.1) and can be aimed at specific research issues. 2. How to do it Each annotation task is converted into a Human Intelli- gence Task (HIT). The tasks are submitted to Amazon Me- chanical Turk (MT). Online workers choose to work on the submitted tasks. Every worker opens our web page with a HIT and does what we ask them to do. They “submit” the result to Amazon. We then fetch all results from Amazon MT and convert them into annotations. The core tasks for a researcher are: (1) define an annotation protocol and (2) determine what data needs to be annotated. Exp Task img labels cost time effective USD pay/hr 1 1 170 510 $8 750m $0.76 2 2 170 510 $8 380m $0.77 3 3 305 915 $14 950m $0.41 1 4 4 305 915 $14 150m $1.07 5 4 337 1011 $15 170m $0.9 Total: 982 3861 $59 Table 1. Collected data. In our five experiments we have col- lected 3861 labels for 982 distinct images for only US $59. In experiments 4 and 5 the throughput exceeds 300 annotations per hour even at low ($1/hour) hourly rate. We expect further increase in throughput as we increase the pay to effective market rate. The annotation protocol should be implemented within an IFRAME of a web browser. We call the implementation of a protocol an annotation module. The most common implementation choices will be HTML/JS interface, Java or Flash applet. The annotation module must be developed for every radically new annotation protocol. We have already built 4 different annotation modules(in Flash) for labeling images of people. As the design process is quite straight- forward, we aim to accomodate requests to build annota- tion modules for various research projects. Our architecture requires very little resources adminis- tered by the researcher (bash, python, Matlab and a web server or Amazon S3). 2.1. Quality assurance There are three distinct aspects of quality assurance: (a) Ensuring that the workers understand the requested task and try to perform it well; (b) cleaning up occasional errors; (c) detecting and preventing cheating in the system. We discuss three viable strategies for QA: multiple annotations, grading and gold standard evaluation (with immediate feedback). The basic strategy is to collect multiple annotations for every image. This will account for natural variability of human performance, reduce the influence of occasional er- 1 This number includes around 30% of poor annotations. 1

Utility data annotation with Amazon Mechanical Turkvision.cs.uiuc.edu/annotation/papers/cvpr08_annotation_submission.pdfUtility data annotation with Amazon Mechanical Turk Anonymous

  • Upload
    others

  • View
    20

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Utility data annotation with Amazon Mechanical Turkvision.cs.uiuc.edu/annotation/papers/cvpr08_annotation_submission.pdfUtility data annotation with Amazon Mechanical Turk Anonymous

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

CVPR#****

CVPR#****

CVPR 2008 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Utility data annotation with Amazon Mechanical Turk

Anonymous CVPR submission

Paper ID ****

Abstract

We show how to outsource data annotation to AmazonMechanical Turk. Doing so has produced annotations inquite large numbers relatively cheaply. The quality is good,and can be checked and controlled. Annotations are pro-duced quickly. We describe results for several different an-notation problems. We describe some strategies for deter-mining when the task is well specified and properly priced.

1. IntroductionBig annotated image datasets now play an important role

in Computer Vision research. Many of them were built in-house ([18, 11, 12, 3, 13, 5] and many others). This con-sumes significant amounts of highly skilled labor, requiresmuch management work, is expensive and creates a percep-tion that annotation is difficult. Another successful strat-egy is to make the annotation process completely public([24]) and even entertaining [26, 27]), at the cost of dimin-ished control over what annotations are produced and neces-sary centralization to achieve high volume of participation.Finally, dedicated annotation services ([28]) can producehigh volume quality annotations, but at high price.

We show that image annotation work can be efficientlyoutsourced to an online worker community (currently Ama-zon Mechanical Turk [2]) (sec. 2). The resulting annota-tions are good (sec. 2.3.2), cheap (sec. 2.3.1) and can beaimed at specific research issues.

2. How to do itEach annotation task is converted into a Human Intelli-

gence Task (HIT). The tasks are submitted to Amazon Me-chanical Turk (MT). Online workers choose to work on thesubmitted tasks. Every worker opens our web page with aHIT and does what we ask them to do. They “submit” theresult to Amazon. We then fetch all results from AmazonMT and convert them into annotations. The core tasks fora researcher are: (1) define an annotation protocol and (2)determine what data needs to be annotated.

Exp Task img labels cost time effectiveUSD pay/hr

1 1 170 510 $8 750m $0.762 2 170 510 $8 380m $0.773 3 305 915 $14 950m $0.411

4 4 305 915 $14 150m $1.075 4 337 1011 $15 170m $0.9

Total: 982 3861 $59

Table 1. Collected data. In our five experiments we have col-lected 3861 labels for 982 distinct images for only US $59. Inexperiments 4 and 5 the throughput exceeds 300 annotations perhour even at low ($1/hour) hourly rate. We expect further increasein throughput as we increase the pay to effective market rate.

The annotation protocol should be implemented withinan IFRAME of a web browser. We call the implementationof a protocol an annotation module. The most commonimplementation choices will be HTML/JS interface, Java orFlash applet. The annotation module must be developed forevery radically new annotation protocol. We have alreadybuilt 4 different annotation modules(in Flash) for labelingimages of people. As the design process is quite straight-forward, we aim to accomodate requests to build annota-tion modules for various research projects.

Our architecture requires very little resources adminis-tered by the researcher (bash, python, Matlab and a webserver or Amazon S3).

2.1. Quality assurance

There are three distinct aspects of quality assurance: (a)Ensuring that the workers understand the requested task andtry to perform it well; (b) cleaning up occasional errors; (c)detecting and preventing cheating in the system. We discussthree viable strategies for QA: multiple annotations, gradingand gold standard evaluation (with immediate feedback).

The basic strategy is to collect multiple annotations forevery image. This will account for natural variability ofhuman performance, reduce the influence of occasional er-

1This number includes around 30% of poor annotations.

1

Page 2: Utility data annotation with Amazon Mechanical Turkvision.cs.uiuc.edu/annotation/papers/cvpr08_annotation_submission.pdfUtility data annotation with Amazon Mechanical Turk Anonymous

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

CVPR#****

CVPR#****

CVPR 2008 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

rors and allow us to catch malicious users. However, thisincreases the cost of annotation.

The second strategy is to perform a separate gradingtask. A worker looks at several annotated images andscores every annotation. We get explicit quality assesmentsat a fraction of the cost, because grading is easy.

The third strategy is to build a gold standard - a collec-tion of images with trusted annotations. Images from thegold standard are injected into the annotation process. Theworker doesn’t know if an image comes from the new dataor from the gold standard. If the annotations provided bythe worker significantly deviate from the gold standard, wesuspect that the worker is not doing what we asked for. Wereveal the gold standard annotation to the worker after theysumbit their own annotation. This immediate feedback clar-ifies what we expect and encourages to follow the protocol.This strategy is again cheap, as only a fraction of imagescomes from the gold standard.

It is most important to ensure that contributors with highimpact understand the task and follow the requested pro-tocol. As can be seen in fig 2, the bulk of annotation isproduced by a few contributors. In our experiments we col-lected multiple annotations to study consistency. In onlyone experiment did we have a significant contributor pro-viding poor annotations (Fig 2, experiment 3, see the lowtimes among the first contributors. See also figure 5 experi-ment 3, example “G”, yellow curve).

2.2. Annotation protocols

We implemented four annotation protocols (fig 1): twocoarse object segmentation protocols, polygonal labelingand 14-point human landmark labeling. Object segmen-tation protocols show an image to the worker and a smallimage of the query (person). We ask the worker to click onevery circle (site) overlapping with the query (person). Pro-tocol one places sites on a regular grid, whereas protocoltwo places sites at the centers of superpixels (computedwith [19, 17]).

The third protocol, polygonal labeling, is very similarto the one adopted in LabelMe[24]. We ask the worker totrace the boundary of the person in the image.

The fourth protocol labels the landmarks of the humanbody used for pose annotation in [23]. We ask the workerto click on locations of the 14 points in the specified or-der: right ankle, right knee, right hip, left hip, left knee, leftankle, right wrist, right elbow, right shoulder, left shoulder,left elbow, left wrist, neck and head. The worker is alwaysreminded what the next landmark is.

2.3. Annotation results

So far we have run five annotation experiments usingdata collected from Youtube (experiments 1, 2, 5), thedataset of people from [23] (exp. 3, 4) and small sample of

data from LabelMe[24], Weizman [6] and our own dataset(exp. 5). In all experiments we are interested in people. Asshown in table 1 we have a total of 3861 annotations for 982distinct images collected for a total cost of US$ 59. This isvery cheap as discussed in section 2.3.1. We describe thequality of annotations in section 2.3.2.

We present sample annotation results (fig 1,4,5) to showthe representative annotations and highlight the most promi-nent failures. We are extremely satisfied with the qual-ity of the annotations taking into account that workers re-ceive no feedback from us. We are currently implementingQA strategies described above to provide feedback to work-ers so we can stop using the multiple duplicate annotationsstrategy.

2.3.1 Pricing

The work throughput is elastic and depends on the price ofthe task. If the price is too low, workers will participateout of curiosity and for entertainment, but may feel under-paid and will loose motivation. If the price is too high, wecould be wasting resources and possibly attracting ineffi-cient workers. As table 1 shows, the hourly pay in exper-iments 4 and 5 was roughly $1/hour. In these experimentswe had a comments field and some comments suggestedthat the pay should be increased by a factor of 3. Fromthis we conclude that the perceived fair pricing is about US$3/hour. The fact that our experiments 1-5 finished com-pletely shows the elasticity of the workforce. We note thateven at US $1/hour we had a high throughput of 300 anno-tations per hour.

2.3.2 Annotation quality

To understand the quality of annotations we use three sim-ple consistency scores for a pair of annotations (a1 anda2) of the same type. For protocols 1,2 and 3 we dividethe area where annotations disagree by the area marked byany of the two annotations. We can think about this asXOR(a1,a2)/OR(a1,a2). For protocols 1 and 2 XOR countsof sites with the different annotations, OR counts the sitesmarked by any of the two annotations a1 and a2. For pro-tocol 3, XOR is the area of the symmetric difference andOR is the area of the union. For protocol 4 we measure theaverage distance between the selected landmark locations.Ideally, the locations coincide and the score is 0.

We then select the two best annotations for every imageby simply taking a pair with the lowest score, i.e. we takethe most consistent pair of annotations. For protocol 3 wefurther assume that the polygon with more vertices is a bet-ter annotation and we put it first in the pair. The distributionof scores and a detailed analysis appears in figures 4,5. Weshow all scores ordered from the best (lowest) on the left

2

Page 3: Utility data annotation with Amazon Mechanical Turkvision.cs.uiuc.edu/annotation/papers/cvpr08_annotation_submission.pdfUtility data annotation with Amazon Mechanical Turk Anonymous

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

CVPR#****

CVPR#****

CVPR 2008 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Prot

ocol

1Pr

otoc

ol 2

Prot

ocol

3Pr

otoc

ol 4

Figure 1. Example results show the example results obtained from the annotation experiments. The first column is the implementation ofthe protocol, the second column show obtained results, the third column shows some poor annotations we observed. The user interfacesare similar, simple and are easy to implement. The total cost of annotating the images shown in this figure was US $0.66.

to the wort (highest) on the right. We select 5:15:952 per-centiles of quality and show the respective annotations.

Looking at the images we see that the workers mostlytry to accomplish the task. Some of the errors come fromsloppy annotations (especially in the heavily underpaid ex-

25 through 95 with step 15

periment 3 - polygonal labeling). Most of the disagreementscome from difficult cases, when the question we ask is dif-ficult to answer. Consider figure 5, experiment 2, sample“G”, leftmost circle. One annotator decided to mark the bat,while the other decided not to. This is not the fault of theannotators, but is rather a sign for us to give better instruc-

3

Page 4: Utility data annotation with Amazon Mechanical Turkvision.cs.uiuc.edu/annotation/papers/cvpr08_annotation_submission.pdfUtility data annotation with Amazon Mechanical Turk Anonymous

324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377

378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431

CVPR#****

CVPR#****

CVPR 2008 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

tions. The situation is even more difficult in experiment 4,where we ask to label landmarks that are not immediatelyvisible. In figure 6 we show consistency of the annotationsof each landmark between the 35th and the 65th percentileof figure 5. It is obvious from this figure that hips are muchmore difficult to localize compared to shoulders, knees, el-bows, wrists, ankles, the head and the neck.

3. Related workCrisp understanding of the purpose of annotated data is

crucial. When it is clear what annotations should be made,quite large annotated datasets appear [16, 15, 4, 22, 25, 18].Such datasets last for a long time and allow for significantadvances in methods and theories. For object recognition,there isn’t really a consensus on what should be annotatedand what annotations are required, so we have a large num-ber of competing datasets.

To build large scale datasets researchers have made peo-ple label images for free. LabelMe[24] is a public on-line image annotation tool. LabelMe has over 11845 im-ages and 18524 video frames with at least one object la-beled [24]. The current web site counter displays 222970labelled objects. The annotation process is simple and intu-itive; users can browse existing annotations to get the ideaof what kind of annotations are required. The dataset isfreely available for download and comes with handy Mat-lab toolbox to browse and search the dataset. The datasetis semi-centralized. MIT maintains a publicly-accessiblerepository, they accept images to be added to the datasetand they distribute the source code to allow interested par-ties to set up a similar repository. To our knowledge thisis the most open project. On the other hand LabelMe hasno explicit annotation tasks and annotation batches. Theprogress can only be measured in the number of images an-notated. In contrast we aim at annotating project-specificdata in well-defined batches. We also minimized the needfor maintainance of a centralized database. An annotationproject can run with only researcher’s laptop and computingutility services easily accessible online.

The ESP game [26] and Peekaboom [27] are interac-tive games that collect image annotations by entertainingpeople. The players cooperate by providing textual andlocation information that is likely to describe the contentof the image to the partner. The games are great success.They are known to have produced over 37 million [8] and1 million [27] annotations respectively. The Peekaboomproject recently released a collection of 57797 images an-notated through gameplay. The game-based approach hastwo inconveniences. The first is centralization. To achieveproper scale, it is necessary to have a well-attended gameservice that features the game. This constrains publishingof a new game to obtain project-specific annotations. Thesecond one is the game itself. To achieve reasonable scale

one has to design a game. The game should be entertainingor else nobody will play it. This will require creativity andexperimentation to create appropriate annotation interface.In contrast, our model serves as a drop-in, minimum effort,utility annotation.

Building in-house datasets was another common strat-egy. The most prominent examples here include: Berke-ley segmentation dataset [18], Caltech 5/101 [11]/256 [12],Pascal VOC datasets [10, 9], UIUC car dataset [1], MIT[20] and INRIA [7] pedestrian datasets, Yale face dataset[4], FERET [22], CMU PIE [25] and (Labeled [13]) Facesin the Wild [5]. Every dataset above is a focused data col-lection targeted at a specific research problem: segmenta-tion, car detection, pedestrian detection, face detection andrecognition, object category recognition. The datasets arerelatively small compared to those produced by large scaleannotation projects.

Finally, dedicated annotation services can provide qual-ity and scale, but at a high price. ImageParsing.com hasbuilt one of the world largest annotated datasets[28]. Withover 49357 images, 587391 video frames and 3,927,130 an-notated physical objects [28] this is a really invaluable re-source for vision scientists. At the same time, the cost ofentry is steep. Obtaining standard data would require atleast US $1000 investment and custom annotations wouldrequire at least US $5000 [14]. In contrast our model willproduce a 1000 images with custom annotations for underUS $40. ImageParsing.com provides high quality annota-tions and has a large number of images available for free. Itis important to note that [28] presents probably the most rig-orous and the most varied definition of the image labelingtask. Their definitions might not fit every single researchproject, but we argue that this degree of rigor must be em-braced and adopted by all researchers.

4. Discussion

We presented a data annotation framework to obtainproject-specific anntations very quickly on a large scale. Itis important to turn annotation process into a utility, bea-cause this will make the researchers answer the importantresearch issues: “What data to annotate?” and “What typeof annotations to use?”. As annotation happens quickly,cheaply and with minimum participation of the researchers,we can allow for multiple runs of annotation to iterativelyrefine the precise definition of annotation protocols. Finally,we shall ask “What happens when we get 1/10/100 millionannotated images?”.

We plan to implement more annotation protocols ([18, 3,28, 9, 21], other suggestions are welcome) and the qualityassuarance strategies we discussed. We will make all thecode and data available online.

4

Page 5: Utility data annotation with Amazon Mechanical Turkvision.cs.uiuc.edu/annotation/papers/cvpr08_annotation_submission.pdfUtility data annotation with Amazon Mechanical Turk Anonymous

432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485

486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539

CVPR#****

CVPR#****

CVPR 2008 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Percentage of workers

Perc

enta

ge o

f con

trib

utio

n

0

2040 60 80 100 120

0

50

100

150

200

250

300

The rank of the worker

Num

ber o

f im

ages

labe

led

0 10 20 30 40 50 600

1020

30

40

50

60

70

80

90

100

The rank of the worker

Num

ber o

f im

ages

labe

led

0

510 15 20 25 30 35 40

0

20

40

60

80

100

120

140

The rank of the workerN

umbe

r of i

mag

es la

bele

d

Exp 4

Click on sites Click on sites Trace boundary

0 5 10 15 20 25 30 350

2040

60

80

100

120

140

160

180

The rank of the worker

Num

ber o

f im

ages

labe

led

0 10 20 30 40 50 600

50

100

150

200

250

The rank of the worker

Num

ber o

f im

ages

labe

led

Number of submitted hitsTime spent in minutes(gross=submit−load)Time spent in minutes(net=last_click−first_click)

Experiment 1 Experiment 2 Experiment 3

Number of HITsTime spent in minutes (net)

Click on 14pts Click on 14ptsExperiment 5Experiment 4

Experiment 1

Experiment 2

Experiment 3

Experiment 4Experiment 5

Low quality annotations

Figure 2. Contributions. The first five graphs plot the contribution and the time spent against the rank of the worker. The rank isdetermined by the total amount of the contribution by a particular worker. The lower the rank the higher the contributions. Note thatthe scales differ from experiment to experiment, because of different complexity of the tasks. The sixth graph plots the total contributionagainst the percentage of the top workers. It is really astonishing how closely the curves follow each other. These graphs give insight intothe job distribution among the workers: (1) single top contributors produce very significant amounts spending hours on the task (2) topcontributors are very effective in performing the tasks and (3) top 20% of annotators produce 70% of the data.

Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5

Click on sites Click on sites Trace boundary Click on 14pts Click on 14pts

0 100 200 300 400 500 600 700 800 90010000

20

40

60

80

100

120

Submission time (minutes)

Wor

ker r

ank

0 20 40 60 80 100 120 1400

5

10

15

20

25

30

35

Submission time (minutes)

Wor

ker r

ank

0 50 100 150 200 250 300 350 4000

5

10

15

20

25

30

35

40

Submission time (minutes)

Wor

ker r

ank

0 200 400 600 800 1000 12000

10

20

30

40

50

60

Submission time (minutes)

Wor

ker r

ank

0 20 40 60 80 100 120 140 160 1800

10

20

30

40

50

60

Submission time (minutes)

Wor

ker r

ank

Figure 3. Temporal structure of annotations. We show a scatterplot of all submitted annotations. The horizontal axis is time in minuteswhen we receive the annotation. The vertical axis is the rank of the worker who produced the annotation. The bottom lines have many dots,as they show when the most significant contributors participated in the annotation process. Note the different scales of the scatterplots. Thehorizonal scale reflects the total time of the annotation while the vertical scale reflects the total number of people who participated in theannotation. The plots show how interesting the tasks are to the workers. In experiments 4 and 5 the workers start early and participate untilthe available tasks are exhausted - the dots all end at the same time, when no more tasks are left. In experiments 1,2 and 3 it takes muchlonger for significant annotators to come. This is a direct consequence of the task pricing (sec 2.3.1). Experiments 1 and 2 pay 30% lessthan experiments 4 and 5, while experiment 3 pays 50% less.

References

[1] S. Agarwal, A. Awan, and D. Roth. Learning to detect ob-jects in images via a sparse, part-based representation. PAMI,26(11):1475–1490, November 2004. 4

[2] Amazon mechanical turk. http://www.mturk.com/. 1

[3] K. Barnard, Q. Fan, R. Swaminathan, A. Hoogs, R. Collins,P. Rondot, and J. Kaufhold. Evaluation of localized seman-tics: Data, methodology, and experiments. IJCV, 2008. 1,4

5

Page 6: Utility data annotation with Amazon Mechanical Turkvision.cs.uiuc.edu/annotation/papers/cvpr08_annotation_submission.pdfUtility data annotation with Amazon Mechanical Turk Anonymous

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

CVPR#****

CVPR#****

CVPR 2008 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Experiment 2: click on the sites (superpixels)overlapping with the person A B C

F G

D

E

0 20 40 60 80 100 120 140 160 1800

0.2

0.4

0.6

0.8

1

A B C D E FG

#disagree / (#disagree+#agree). The lower the betterMean 0.09, std 0.09, median 0.07 (no person)

knee

B

FG

Experiment 1: click on the sites overlapping with the person

#disagree / (#disagree+#agree). The lower the betterMean 0.14, std 0.12, median 0.12

AC D

E

0 20 40 60 80 100 120 140 160 1800

0.2

0.4

0.6

0.8

1

AB C D E F

G

knee

Figure 4. Quality details. We present detailed analysis of annotation quality for experiments 1 and 2. For every image the best fitting pairof annotations is selected. The score of the best pair is shown in the figure. We count the number of the sites where the two annotatorsdisagree and divide by all sites labeled by at least one of the two annotators. The scores are ordered low (best) to high (worst). This iseffectively a cumulative distribution function of the annotation scores. For clarity we render annotations at 5:15:95 percentiles of the score.Blue and red dots show annotations provided by annotator 1. Yellow circle shows the disagreement. Not surprisingly, superpixels makeannotations more consistent compared to a regular grid.

[4] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman. Eigen-faces vs. Fisherfaces: Recognition using class specific linearprojection. PAMI, 19(7):711–720, 1997. Special Issue onFace Recognition. 4

[5] T. L. Berg, A. C. Berg, J. Edwards, and D. Forsyth. Who’sin the picture? In Proc. NIPS, 2004. 1, 4

[6] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri.Actions as space-time shapes. ICCV, pages 1395–1402,

6

Page 7: Utility data annotation with Amazon Mechanical Turkvision.cs.uiuc.edu/annotation/papers/cvpr08_annotation_submission.pdfUtility data annotation with Amazon Mechanical Turk Anonymous

648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701

702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755

CVPR#****

CVPR#****

CVPR 2008 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1

A B C D EF

G

area(XOR)/area(AND). The lower the better. Mean 0.21, std 0.14, median 0.16

A B

C D E F G

Experiment 3: trace the boundary of the person.

knee

111

222

333 444

555

666

777888999 101010

111111

121212

131313

141414

0 50 100 150 200 250 300 3500

10

20

30

40

50

A B C D EF

G

Experiment 4: click on 14 landmarksMean error in pixels between annotation points. The lower the better. Mean 8.71, std 6.29, median 7.35.

111

222

333 444

555666

777888 999 101010

111111121212

131313141414

111

222

333444

555

666

777

888

999

101010

111111

121212

131313141414

111

222

333444

555

666

777

888

999101010

111111

121212

131313

141414

111

222

333 444

555

666

777

888

999 101010

111111

121212131313

141414

111

222

333 444

555

666

777

888

999 101010

111111

121212

131313

141414

111

222

333

444

555

666

777888

999101010

111111121212

131313

141414

A B C

F GD E

�gure 6knee

Figure 5. Quality details. We present detailed analysis of annotation quality for experiments 3 and 4. For every image the best fittingpair of annotations is selected. The score of the best pair is shown in the figure. For experiment 3 we score annotations by the area oftheir symmetric difference (XOR) divided by the area of their union(OR). For experiment 4 we compute the average distance between themarked points. The scores are ordered low (best) to high (worst). For clarity we render annotations at 5:15:95 percentiles of the score.Blue curve and dots show annotation 1, yellow curve and dots show annotation 2 of the pair. For experiment 3 we additionally assume thatthe polygon with more vertices is a better annotation, so annotation 1 (blue) always has more vertices.

2005. 2[7] N. Dalal and B. Triggs. Histograms of oriented gradients for

human detection. In CVPR, 2005. 4[8] Espgame. www.espgame.org, 2008. 4

[9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,and A. Zisserman. The PASCAL Visual Object ClassesChallenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.

7

Page 8: Utility data annotation with Amazon Mechanical Turkvision.cs.uiuc.edu/annotation/papers/cvpr08_annotation_submission.pdfUtility data annotation with Amazon Mechanical Turk Anonymous

756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809

810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863

CVPR#****

CVPR#****

CVPR 2008 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

100 110 120 130 140 150 160 170 180 190 2003

45

6

7

89

10

11

12

13

rAnklerKneelKneelAnkle

100 110 120 130 140 150 160 170 180 190 2003

4

5

6

7

8

9

10

11

12

13

rWristrElbowlElbowlWrist

100 110 120 130 140 150 160 170 180 190 2003

4

5

6

7

8

9

10

11

12

13

rHiplHiprShoulderlShoulder

100 110 120 130 140 150 160 170 180 190 2003

4

5

6

7

8

9

10

11

12

13

NeckHead

Figure 6. Quality details per landmark. We present analysis of annotation quality per landmark in experiment 4. We show scores of thebest pair for all annotations between 35th and 65th percentiles - between points “C” and “E” of experiment 4 in fig. 5. All the plots havethe same scale: from image 100 to 200 on horizontal axis and from 3 pixels to 13 pixels of error on the vertical axis. These graphs showannotators have greater difficulty choosing a consistent location for the hip than for any other landmark; this may be because some placethe hip at the point a tailor would use and others mark the waist, or because the location of the hip is difficult to decide under clothing.

4[10] M. Everingham, A. Zisserman, C. K. I. Williams, and

L. Van Gool. The PASCAL Visual Object ClassesChallenge 2006 (VOC2006) Results. http://www.pascal-network.org/challenges/VOC/voc2006/results.pdf. 4

[11] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning ofobject categories. PAMI, 28(4):594–611, 2006. 1, 4

[12] G. Griffin, A. Holub, and P. Perona. Caltech-256 object cat-egory dataset. Technical Report 7694, California Institute ofTechnology, 2007. 1, 4

[13] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller.Labeled faces in the wild: A database for studying facerecognition in unconstrained environments. Technical report,University of Massachusetts, Amherst, 2007. 1, 4

[14] Imageparsing. ImageParsing.com, 2008. 4[15] Linguistic data consortium. www.ldc.upenn.edu/. 4[16] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Build-

ing a large annotated corpus of english: The penn treebank.Computational Linguistics, 19(2):313–330, 1994. 4

[17] D. Martin, C. Fowlkes, and J. Malik. Learning to detect nat-ural image boundaries using local brightness, color, and tex-ture cues. PAMI, 2004. in press. 2

[18] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A databaseof human segmented natural images and its application toevaluating segmentation algorithms and measuring ecologi-cal statistics. In International Conference on Computer Vi-sion, 2001. 1, 4

[19] G. Mori, X. Ren, A. Efros, and J. Malik. Recovering humanbody configurations: Combining segmentation and recogni-tion. In CVPR, 2004. 2

[20] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. IJCV, 2000. 4

[21] The pascal visual object classes chal-lenge 2008. http://www.pascal-network.org/challenges/VOC/voc2008/index.html. 4

[22] P. J. Phillips, A. Martin, C. Wilson, and M. Przybocki. Anintroduction to evaluating biometric systems. Computer,33(2):56–63, 2000. 4

[23] D. Ramanan. Learning to parse images of articulated bodies.In NIPS, pages 1129–1136, 2007. 2

[24] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman.Labelme: A database and web-based tool for image annota-tion. Int. J. Comput. Vision, 77(1-3):157–173, 2008. 1, 2,4

[25] T. Sim, S. Baker, and M. Bsat. The cmu pose, illumination,and expression(pie) database. In AFGR, 2002. 4

[26] L. von Ahn and L. Dabbish. Labeling images with a com-puter game. In ACM CHI, 2004. 1, 4

[27] L. von Ahn, R. Liu, and M. Blum. Peekaboom: A game forlocating objects in images. In ACM CHI, 2006. 1, 4

[28] B. Yao, X. Yang, and S.-C. Zhu. Introduction to a large scalegeneral purpose ground truth dataset: methodology, annota-tion tool, and benchmarks. In EMMCVPR, 2007. 1, 4

8