165
D EPARTMENT OF I NFORMATICS U NIVERSITY OF F RIBOURG (S WITZERLAND ) Quality of Service in Crowd-Powered Systems THESIS Presented to the Faculty of Science of the University of Fribourg (Switzerland) in consideration for the award of the academic grade of Doctor scientiarum informaticarum by DJELLEL EDDINE DIFALLAH from ALGIERS ,ALGERIA Thesis No. 1912 UniPrint 2015

Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

DEPARTMENT OF INFORMATICS

UNIVERSITY OF FRIBOURG (SWITZERLAND)

Quality of Service in

Crowd-Powered Systems

THESIS

Presented to the Faculty of Science of the University of Fribourg (Switzerland)

in consideration for the award of the academic grade of

Doctor scientiarum informaticarum

by

DJELLEL EDDINE DIFALLAH

from

ALGIERS, ALGERIA

Thesis No. 1912

UniPrint

2015

Page 2: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible
Page 3: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Weak human + machine + superior process was greater

than a strong computer and, remarkably, greater than

a strong human + machine with an inferior process.

— Garry Kasparov

Page 4: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible
Page 5: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

AcknowledgementsFirst, I would like to extend my deepest gratitude to my advisor, Philippe Cudré-Mauroux, who

gave me the opportunity to work with him and provided me all the necessary ingredients to

grow as a researcher. His wisdom, generosity and kindness will always be an inspiration to me.

I am especially grateful to Gianluca Demartini, who instilled in me his passion for the topic,

and with whom I had exhilarating discussions on how to shape the future of crowdsourcing. I

would also like to thank the rest of my thesis committee, Panos Ipeirotis, Béat Hirsbrunner,

and the jury president Ulrich Ultes-Nitsche, for their availability, insights and questions, which

helped me to create the final form of this thesis.

I have been extremely fortunate to be surrounded by brilliant colleagues and friends at the

eXascale Infolab: Marcin Wylot, Jean-Gérard Pont, Roman Prokofyev, Alberto Tonon, Mar-

tin Grund, Victor Felder, Michael Luggen, Ruslan Mavlyutov, Artem Lutov, Mourad Khayati,

Dingqi Yang and not forgetting our friends Michele Catasta from EPFL and Monica Noselli. I

am truly appreciative for all the stimulating discussions, critics, and motivation you provided

during my PhD.

I am also thankful for the time, support and encouragement of Carlo Curino who gave me the

opportunity to spend three months at Microsoft CISL. Also I would like to thank the rest of the

CISL team: Chris Douglas, Russell Sears, Sriram Rao and Raghu Ramakrishnan, for their help,

availability and mentorship.

In addition, I would like to extend a warm thank you to all the extraordinary researchers I had

the chance to collaborate within and visit at multiple occasions and for numerous projects,

but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna.

Finally, this work would not have been possible without my family and friends who always

encouraged me to pursue my dreams, and provided unconditional support throughout the

years. For their love, support and patience, I will eternally be thankful.

Fribourg, September 2015 Djellel Eddine Difallah

v

Page 6: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible
Page 7: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

AbstractHuman-machine computation opens the doors to a new class of applications that leverage

the scalability of computers with the yet unmatched cognitive abilities of the human brain.

Such a new synergy is today possible thanks to the advent of programmable micro-task

crowdsourcing platforms that facilitate the recruitment and compensation of online users.

Today, crowdsourcing is leveraged in many fields including data management, information

retrieval and machine learning. For example, a crowd-powered data management system

makes it possible to process new types of tasks including subjective sorting, semantic joins, or

complex data integration. While the use of human-machine computation fills-in a significant

gap in intelligent data processing, it often raises concerns about the overall Quality of Service

(QoS) guarantees that such hybrid systems can offer to the end-users, in terms of efficiency

and effectiveness of the collected results.

In this thesis, we investigate, design, and evaluate several techniques and algorithms that

improve the efficiency and the effectiveness of crowd-powered systems. We tackle the follow-

ing crowdsourcing-specific QoS aspects: quality of responses, progress of batches of tasks,

and load-balancing of heterogeneous tasks among crowd workers. In order to improve those

aspects, we explore techniques stemming from expert finding, human resources, and cluster

management practices to derive our solutions that take into account inherent human-machine

differences, e.g., unpredictability, preferences, and poor context switching. Specifically, we

make the following contributions: (1) We reduce the error on multi-choice tasks by contin-

uously assessing the quality of the workers using probabilistic inference. Our model uses

signals from test tasks, peer consensus, and confidence scores obtained from machine based

solvers. (2) We propose a task assignment model (push-crowdsourcing) that matches tasks

with potentially better-suited users. For that purpose, we index the workers’ profiles based on

their provided social network information. (3) We avoid the stagnation of a crowdsourcing

campaign by providing monetary incentives favoring worker retention. (4) We load balance

tasks across multiple workers to improve the efficiency of a multi-tenant system. Here, we

adopt cluster scheduling approaches for their scalability and adapt them to reduce context

switching for the workers.

We experimentally show that, by using such approaches, one can improve the quality of

the answers provided by the crowd, boost the speed of crowdsourcing campaigns, and load-

balance the crowd workforce across heterogeneous tasks.

Keywords: Crowdsourcing, Crowd-Powered Systems, Human Computation, Quality of Service.

vii

Page 8: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible
Page 9: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

ZusammenfassungVom Menschen unterstützte Berechnung (Human-machine computation) eröffnet eine neue

Art von Applikationen welche die Skalierbarkeit von Computern mit den bisher nicht erreich-

ten kognitiven Fähigkeiten des menschlichen Gehirns verbindet. Diese Synergie ist heutzutage

möglich dank neuer programmierbarer Kleinstaufgaben-Crowdsourcing-Plattformen welche

es ermöglicht Arbeiter im Internet zu rekrutieren und auch zu kompensieren.

Heutzutage wird Crowdsourcing in vielen verschiedenen Feldern wie dem Datenmanagement,

der Informationsgewinnung und den maschinellen Lernverfahren eingesetzt. Beispielsweise

kann ein Datenmanagementsystem welches durch die Crowd unterstützt wird neuartige Pro-

zesse ausführen wie das subjektive Sortieren, das semantische Verknüpfen oder die komplexe

Integration von Daten. Der Einbezug von Menschen um intelligente Datenverabeitung zu

ermöglichen füllt eine signifikante Lücke, dies aber zu Lasten der Garantie der Qualität des

Dienstes (QoS) welches ein solches hybrides System dem Endbenützer bietet, besonders

bezüglich der Effizienz und Effektivität der gesammelten Antworten

In dieser Arbeit untersuchen und entwickeln wir verschiedene Vorgehen und Algorithmen

welche die Effizienz und die Effektivität der von der Crowd unterstützten Systemen verbessert.

Wir behandeln die folgenden Crowdsourcing spezifischen QoS Aspekte: Die Aufgabenfehlerra-

te, die Qualität der Antworten, das Voranschreiten in der Aufgabenserie und die Verteilung

von heterogenen Aufgaben an Crowdarbeiter. Diese Themen werden bearbeitet unter Ein-

bezug von Techniken aus den Gebieten der Expertenfindung, dem Personalwesen und der

Cluster Management Praktiken, wobei unserer abgeleitete Lösungen auf inhärente mensch-

maschinen Unterschiede eingehen, wie z.B. der Unvorhersehbarkeit, der Präferenzen und dem

schlechten Kontextwechsel. Wir können Fortschritte in folgenden Punkten aufzeigen: 1.) Wir

vermindern die Fehlerraten von Mehrfachauswahslaufgaben durch kontinuierliches eruiren

mit probabilistischer Inferenz der Qualität der Arbeiter. Unser Model benutzt dazu Daten aus

Testaufgaben, aus dem Teilnehmerkonsens und durch Konfidenzbewertungen aus maschinel-

len Aufgabenlösern. 2.) Ein Model für die Aufgabenzuweisung (push-crowdsourcing) welches

Aufgaben mit potentiell besser geeigneten Benützern verbindet wird vorgestellt. Dazu ziehen

wir die zur verfügung gestellten Profile der Arbeiter auf Sozialen Netzwerken heran. 3.) Wir

verhindern das Abflauen einer Crowdsourcing-Kampagne durch monetäre Anreize welche die

kontinuität der Arbeiter verbessert. 4.) Durch Lastverteilung von Aufgaben auf verschiedene

Arbeiter wird die Effizienz eines Systems mit mehreren Auftraggebern verbessert. Dazu wer-

den Ansätzte aus der Clusterablaufplanung bezüglich der Skalierbarkeit herangezogen und

zusätzlich adaptieren um die Kontextwechsel für die Arbeiter zu minimieren.

ix

Page 10: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Acknowledgements

In Experimenten zeigen wir, durch die Benutzung dieser Ansätzte, dass wir die Qualität der

Antworten der Crowd verbessern können, die Durchführung von Crowdsourcing-Kampagnen

verkürzen können und die Arbeiterschaft auf heterogene Aufgaben verteilen können.

Stichwörter: Crowdsourcing, Crowd-Powered Systems, Human Computation, Servicequalität.

x

Page 11: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

RésuméCombiner les capacités cognitives inégalées du cerveau humain avec la puissance de calcul

des ordinateurs ouvre la voie à de nouveaux type de systèmes intelligents dit hybrides homme-

machine. Une telle synergie est désormais possible et accessible grâce à l’avènement de

plateformes de crowdsourcing ; celles-ci facilitent le recrutement et la compensation des

travailleurs éphémères en ligne. De nos jours, le crowdsourcing est utilisé dans différents

domaines, incluant la gestion de données, la recherche d’informations et l’apprentissage-

automatique des machines. Par exemple, un système homme-machine hybride de gestion de

données rend possibles le traitement de nouveaux types de requêtes incluant le tri subjectif,

les jointures sémantiques ou encore l’intégration des sources de données les plus complexes.

Tandis que l’utilisation du crowdsourcing comble un besoin important dans le traitement

intelligent des données, cette pratique suscite des préoccupations quant à la qualité de service

(QOS) que l’on peut garantir à ses utilisateurs.

Dans cette thèse, nous étudions et concevons plusieurs techniques et algorithmes qui ont

pour but d’améliorer l’efficience et l’efficacité des systèmes hybrides homme-machine. Nous

ciblons les aspects suivants de la QoS relevant du domaine du crowdsourcing : le taux d’erreur,

la qualité des réponses, l’avancement de lots de tâches et l’équilibre de charge de travail

entre les utilisateurs. Afin d’adresser ces questions, nous explorons des techniques allant de la

recherche d’experts, aux pratiques de ressources humaines et aux techniques de gestion de

clusters ; ainsi, nos solutions prennent en compte les différences inhérentes entre l’homme et

les machines, par exemple l’imprévisibilité, les préférences et les changements de contexte.

Plus précisement, nous contribuons de la manière suivante : (1) Nous réduisons le taux d’er-

reurs sur des tâches à choix multiples en évaluant continuellement la qualité des travailleurs,

ceci utilisant des models statistiques. (2) Nous proposons un modèle de distribution de tâches

(push-crowdsourcing), qui fait correspondre une tâche avec le meilleur travailleur potentiel.

Pour cela, nous indexons le profil des travailleurs en nous basant sur les informations extraites

de réseaux sociaux. (3) Nous évitons la stagnation d’une campagne de crowdsourcing en

fournissant des primes favorisant la rétention des travailleurs. (4) Nous répartissons les tâches

à plusieurs travailleurs pour améliorer l’efficacité d’un système multi-utilisateurs. Dans ce cas,

nous réutilisons des approches de gestion de cluster et nous les adaptons afin de réduire le

changement de contexte pour chaque travailleur.

Dans les expériences que nous faisons, nous montrons qu’en utilisant de telles méthodes,

nous pouvons améliorer la qualité des réponses fournies par des travailleurs annonymes, nous

augmentons la vitesse des compagnes de crowdsourcing et balançons la charge de tâches

xi

Page 12: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Acknowledgements

hétérogènes.

Mots clefs : Crowdsourcing, Systèmes Hybrides Homme-machine, Qualité de Service.

xii

Page 13: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

ContentsAcknowledgements v

Abstract (English/Deutsch/Français) vii

List of figures xvii

List of tables xxi

1 Introduction 1

1.1 Crowdsourcing and Human Computation . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Human Computation and Micro-tasks . . . . . . . . . . . . . . . . . . . . 2

1.1.2 The Amazon Mechanical Turk Marketplace . . . . . . . . . . . . . . . . . 3

1.2 Crowd-powered Algorithms and Systems . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Quality of Service in Crowd-powered Systems . . . . . . . . . . . . . . . . . . . . 4

1.4 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4.1 Additional Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 What this Thesis is Not About . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background in Crowd-Powered Systems 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Crowd-Powered Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Crowd-Powered Database Systems . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Crowd-Powered Database Operators . . . . . . . . . . . . . . . . . . . . . 11

2.2.3 Crowd-Powered Systems in Other Communities . . . . . . . . . . . . . . . 12

2.2.4 Languages and Toolkits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Task Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Task Repetitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.2 Test Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.3 Result Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.4 Task Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Task Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

xiii

Page 14: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Contents

3 An Analysis of the Amazon Mechanical Turk Crowdsourcing Marketplace 19

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 The Evolution of Amazon MTurk From 2009 to 2014 . . . . . . . . . . . . . . . . 21

3.3.1 Crowdsourcing Platform Dataset . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3.2 A Data-driven Analysis of Platform Evolution . . . . . . . . . . . . . . . . 21

3.4 Large-Scale HIT Type Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4.1 Supervised HIT Type Classification . . . . . . . . . . . . . . . . . . . . . . 26

3.4.2 Task Type Popularity Over Time . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5 Analyzing the Features Affecting Batch Throughput . . . . . . . . . . . . . . . . . 28

3.5.1 Machine Learning Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5.2 Throughput Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5.3 Features Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.6 Market Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.6.1 Supply Attracts New Workers . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.6.2 Demand and Supply Periodicity . . . . . . . . . . . . . . . . . . . . . . . . 32

3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Human Intelligence Task Quality Assurance 37

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1.1 The Entity Linking and Instance Matching Use-Cases . . . . . . . . . . . 38

4.2 Preliminaries on the EL and IM Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 ZenCrowd Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3.2 LOD Index and Graph Database . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3.3 Probabilistic Graph & Decision Engine . . . . . . . . . . . . . . . . . . . . 44

4.3.4 Extractors, Algorithmic Linkers & Algorithmic Matchers . . . . . . . . . . 44

4.3.5 Three-Stage Blocking for Crowdsourcing Optimization . . . . . . . . . . . 45

4.3.6 Micro-Task Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.4 Effective Instance Matching based on Confidence Estimation and Crowdsourcing 46

4.4.1 Instance-Based Schema Matching . . . . . . . . . . . . . . . . . . . . . . . 47

4.4.2 Instance Matching with the Crowd . . . . . . . . . . . . . . . . . . . . . . . 48

4.5 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.5.1 Graph Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.5.2 Reaching a Decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.5.3 Updating the Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.5.4 Selective Model Instantiation . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.6 Experiments on Instance Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.6.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

xiv

Page 15: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Contents

4.7 Experiments on Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.7.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.7.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.8 Related Work on Entity Linking and Instance Matching . . . . . . . . . . . . . . . 68

4.8.1 Instance Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.8.2 Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5 Human Intelligence Task Routing 71

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2.2 HIT Generation, Difficulty Assessment, and Reward Estimation . . . . . 73

5.2.3 Crowd Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.4 Worker Profile Linker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2.5 Worker Profile Selector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2.6 HIT Assigner and Facebook App . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2.7 HIT Result Collector and Aggregator . . . . . . . . . . . . . . . . . . . . . . 77

5.3 HIT Assignment Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3.1 Category-based Assignment Model . . . . . . . . . . . . . . . . . . . . . . 77

5.3.2 Expert Profiling Assignment Model . . . . . . . . . . . . . . . . . . . . . . 78

5.3.3 Semantic-Based Assignment Model . . . . . . . . . . . . . . . . . . . . . . 79

5.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.4.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.4.2 Motivation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.4.3 SocialBrain{r} Crowd Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4.4 Evaluation of HIT Assignment Models . . . . . . . . . . . . . . . . . . . . . 83

5.4.5 Comparison of HIT Assignment Models . . . . . . . . . . . . . . . . . . . . 84

5.5 Related Work in Task Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.5.1 Crowdsourcing over Social Networks . . . . . . . . . . . . . . . . . . . . . 85

5.5.2 Task Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.5.3 Expert Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6 Human Intelligence Task Retention 89

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.2 Worker Retention Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.2.2 Pricing Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.2.3 Visual Reward Clues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.2.4 Pricing Schemes for Different Task Types . . . . . . . . . . . . . . . . . . . 93

6.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

xv

Page 16: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Contents

6.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.3.3 Efficiency Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.5 Related Work on Worker Retention and Incentives . . . . . . . . . . . . . . . . . 99

6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7 Human Intelligence Task Scheduling 103

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.1.1 Motivating Use-Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.2 Scheduling on Amazon MTurk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.2.1 Execution Patterns on Micro-Task Crowdsourcing Platforms . . . . . . . 106

7.2.2 A Crowd-Powered DBMS Scheduling Layer on top of AMT . . . . . . . . . 107

7.3 HIT Scheduling Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.3.1 HIT Scheduling: Problem Definition . . . . . . . . . . . . . . . . . . . . . . 109

7.3.2 HIT Scheduling Requirement Analysis . . . . . . . . . . . . . . . . . . . . 109

7.3.3 Basic Space-Sharing Schedulers . . . . . . . . . . . . . . . . . . . . . . . . 111

7.3.4 Fair Schedulers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

7.3.5 Gang Scheduling for Collaborative HITs . . . . . . . . . . . . . . . . . . . . 113

7.3.6 Crowd-aware Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.4.2 Micro Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.4.3 Scheduling HITs for the Crowd . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.4.4 Live Deployment Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 122

7.5 Related Work on Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

8 Conclusions 127

8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

8.1.1 Toward Crowsourcing Platforms with an Integrated CrowdManager . . . 128

8.1.2 Worker Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

8.1.3 HIT Recommender System . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

8.1.4 Crowd-Powered Big Data Systems . . . . . . . . . . . . . . . . . . . . . . . 130

8.1.5 Social and Mobile Crowdsourcing . . . . . . . . . . . . . . . . . . . . . . . 130

8.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

xvi

Page 17: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

List of Figures1.1 Examples of Human Computation applications combining the efficiency of

machines with the effectiveness of humans. . . . . . . . . . . . . . . . . . . . . . 2

1.2 The Mturk worker main interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 The CrowdManager interface with the four componenents that we propose in

this thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 A Human Intelligence Task mockup. . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Batch throughput versus number of HITs available in the batch. The red line

corresponds to the maximum throughput we could have observed due to the

tracker periodicity constraints. For readability, this graph represents a subset of

3 months (January-March 2014), and HITs with rewards $0.05 and less. . . . . . 22

3.2 The use of keywords to annotate HITs. F r equenc y corresponds to how many

times a keyword was used, and Aver ag eRew ar d corresponds to the average

monetary reward of batches that listed the keyword. The size of the bubbles

indicates the average batch size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 HITs with specific country requirements. On the left-hand side, the countries

with the most HITs dedicated to them. On the right-hand side, the time evolution

(x-axis) of country-specific HITs with volume (y-axis) and reward (size of data

point) information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Keywords for HITs restricted to specific countries. . . . . . . . . . . . . . . . . . . 24

3.5 Popularity of HIT reward values over time. . . . . . . . . . . . . . . . . . . . . . . 25

3.6 Requester activity and total reward on the platform over time. . . . . . . . . . . 25

3.7 The distribution of batch sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.8 Average and maximum batch size per month. The monthly median is 1. . . . . 27

3.9 Popularity of HIT types over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.10 Predicted vs actual batch throughput values for δ = 4hour s. The prediction

works best for larger batches having a large momentum. . . . . . . . . . . . . . . 30

3.11 Computed feature importance when considering a larger training window for

batch throughput prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.12 The effect of new arrived HITs on the work supplied. Here, the supply is ex-

pressed as the percentage of HITs completed in the market. . . . . . . . . . . . . 33

xvii

Page 18: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

List of Figures

3.13 Computed autocorrelation on the number of HITs available and on the weekly

moving average of the completed reward (N.B., autocorrelation’s Lag is computed

in Hours). In both cases, we clearly see a weekly periodicity (0-250 Hours). . . . 34

4.1 The architecture of ZenCrowd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 The Label-only instance matching HIT interface, where entities are displayed as

textual labels linking to the full entity descriptions in the LOD cloud. . . . . . . 49

4.3 The Molecule instance matching HIT interface, where the labels of the entities

as well as related property-value pairs are displayed. . . . . . . . . . . . . . . . . 50

4.4 The Entity Linking HIT interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.5 An entity factor-graph connecting two workers (wi ), six clicks (ci j ), and three

candidate matchings (m j ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.6 Maximum achievable Recall by considering top-K results from the the inverted

index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.7 Precision and Recall as compared to Matching confidence values. . . . . . . . . 57

4.8 Number of tasks generated for a given confidence value. . . . . . . . . . . . . . . 58

4.9 ZenCrowd money saving by considering results from top-K workers only. . . . . 59

4.10 Distribution of the workers’ precision using the Molecule design as compared to

the number of tasks performed by the workers. . . . . . . . . . . . . . . . . . . . 60

4.11 Average Recall of candidate selection when discriminating on max relevance

probability in the candidate URI set. . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.12 Performance results (Precision, Recall) for the automatic approach. . . . . . . . 64

4.13 Per document task effectiveness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.14 Crowdsourcing results with two different textual contexts. . . . . . . . . . . . . . 65

4.15 Comparison of three linking techniques. . . . . . . . . . . . . . . . . . . . . . . . 66

4.16 Distribution of the workers’ Precision for the Entity Linking task as compared to

the number of tasks performed by the worker (top) and task Precision with top k

workers (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.17 Number of HITs completed by each worker for both IM and EL ordered by most

productive workers first. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.1 Pick-A-Crowd Component Architecture. Task descriptions, Input Data, and a

Monetary Budget are taken as input by the system, which creates HITs, estimates

their difficulty and suggests a fair reward based on the skills of the crowd. HITs

are then pushed to selected workers and results get collected, aggregated, and

finally returned back to the requester. . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Screenshots of the SocialBrain{r} Facebook App. Above, the dashboard displaying

HITs available to a specific worker. Below, a HIT about actor identification

assigned to a worker who likes several actors. . . . . . . . . . . . . . . . . . . . . 76

5.3 An example of the Expert Finding Voting Model. . . . . . . . . . . . . . . . . . . . 78

5.4 Crowd performance on the cricket task. Square points indicate the 5 workers

selected by our graph-based model that exploits entity type information. . . . . 81

xviii

Page 19: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

List of Figures

5.5 Crowd performance on the movie scene recognition task as compared to movie

popularity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.6 SocialBrain{r} Crowd age distribution. . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.7 SocialBrain{r} Notification click rate. . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.8 SocialBrain{r} Crowd Accuracy as compared to the number of relevant Pages a

worker likes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.1 The classic distribution of work in crowdsourced tasks follows a long-tail dis-

tribution where few workers complete most of the work while many workers

complete just one or two HITs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.2 Screenshot of the Bonus Bar used to show workers their current and total reward. 93

6.3 Screenshot of the Bonus Bar with next milestone and bonus. . . . . . . . . . . . 93

6.4 Effect of different bonus pricing schemes on worker retention over three different

HIT types. Workers are ordered by the number of completed HITs. . . . . . . . . 94

6.5 Average of the HITs execution time with standard error ordered by their sequence

in the batch. Results are grouped by worker category (long, medium and short

term workers). In many cases, the Long term workers improve their HIT time

execution. This is expected to have a positive impact on the overall batch latency. 96

6.6 Overall precision per worker and category of worker for the Butterfly Classifica-

tion task (using Increasing Bonus). . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.7 Results of five independent runs of A, B and C setups. Type A batches include the

retention focused incentive while Type B is the standard approach using fixed

pricing, Batch C uses a higher fixed pricing – but leveraging the whole bonus

budget. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.1 Caption for LOF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

7.2 The role of the HIT Scheduler in a Multi-Tenant Crowd-Powered System Archi-

tecture (e.g., a DBMS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.3 Results of a crowdsourcing experiment involving 100+ workers concurrently

working in a controlled setting on a HIT-BUNDLE containing heterogeneous

HITs (B1-B5, see section 7.4) scheduled with FS. (a) Throughput (measured in

HITs/minute) increases with an increasing number of workers involved. (b)

Amount of work done by each worker. . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.4 A performance comparison of batch execution time using different grouping

strategies publishing a large batch of 600 HITs vs smaller batches (From B6). . . 117

7.5 A performance comparison of batch execution time using different grouping

strategies publishing two distinct batches of 192 HITs separately vs combined

inside an HIT-BUNDLE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.6 Average Execution time for each HIT submitted from the experimental groups

RR, SEQ10 and SEQ25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.7 Scheduling approaches applied to the crowd. . . . . . . . . . . . . . . . . . . . . 120

7.8 (a) Effect of increasing B2 priority on batch execution time. (b) Effect of varying

the number of crowd workers involved in the completion of the HIT batches. . 121

xix

Page 20: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

List of Figures

7.9 An example of a successful scheduling of a collaborative task involving 3 workers

within a window of 10 seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7.10 Accuracy and precision of gang scheduling methods. . . . . . . . . . . . . . . . . 122

7.11 Average execution time per HIT under different scheduling schemes. . . . . . . 123

7.12 CDF of different batch sizes and scheduling schemes. . . . . . . . . . . . . . . . 124

7.13 Worker allocation with FS, WCFS and classical individual batches in a live de-

ployment of a large workload derived from crowdsourcing platform logs. Each

color represents a different batch. . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

8.1 The concept of the Flow Theory [42]. . . . . . . . . . . . . . . . . . . . . . . . . . 129

xx

Page 21: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

List of Tables3.1 Gini importance of the top 2 features used in the prediction experiment. A large

mean indicates a better overall contribution to the prediction. A positive slope

indicates that the feature is gaining in importance when the considered time

window is larger. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1 Top ranked schema element pairs in DBPedia and Freebase for the Person,

Location, and Organization instances. . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Crowd Matching Precision over two different HIT design interfaces (Label-only

and Molecule) and two different aggregation methods (Majority Vote and Zen-

Crowd). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3 Matching Precision for purely automatic and hybrid human/machine approaches. 57

4.4 Correct and incorrect matchings as by crowd Majority Voting using two different

HIT designs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.5 Performance results for the candidate selection approach. . . . . . . . . . . . . . 62

4.6 Performance results for crowdsourcing with majority vote over linkable entities. 64

4.7 Performance results for crowdsourcing with ZenCrowd over linkable entities. . 65

5.1 A comparison of the task accuracy for the AMT HIT assignment model assigning

each HIT to the first 3 and 5 workers and to AMT Masters. . . . . . . . . . . . . . . 83

5.2 A comparison of the effectiveness for the category-based HIT assignment models

assigning each HIT to 3 and 5 workers with manually selected categories. . . . . 84

5.3 Effectiveness for different HIT assignments based on the Voting Model assigning

each HIT to 3 and 5 workers and querying the Facebook Page index with the task

description q = ti and with candidate answers q = Ai respectively. . . . . . . . . 84

5.4 Effectiveness for different HIT assignments based on the entity graph in the

DBPedia knowledge base assigning each HIT to 3 and 5 workers. . . . . . . . . . 85

5.5 Average Accuracy for different HIT assignment models assigning each HIT to 3

and 5 workers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.1 Statistics for the three different HIT types. . . . . . . . . . . . . . . . . . . . . . . 94

6.2 Statistics of the second experimental setting – English Essay Correction . . . . . 98

7.1 Description of the batches constituting the dataset used in our experiments. . 116

xxi

Page 22: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible
Page 23: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

1 Introduction

Big data is revolutionizing the way businesses operate by supporting decision processes

thanks to massive data gathering and advanced data analysis. Beyond some of the identified

properties of big data (volume, velocity, variety) [149], the complexity of certain classes of

content, and the ad-hoc nature of some analytical requests, poses further challenges not yet

solved by using fully-automated algorithms. Still, unlocking the real potential of data often

resides in processing complex pieces of information that only humans can fully comprehend.

To that end, some companies outsource or hire full-time employees to perform tasks such as

data entry, data pre-processing and data integration. However, this approach quickly shows

its limitations as the volume of data to be processed increases, and the turnarounds become

critical.

Crowdsourcing has emerged as an alternative to outsourcing. It is defined as the act of creating

an open call to perform a job that anyone on the Internet can do [74]. In order to scale, such a

job is usually broken into micro-tasks that the crowd can perform in parallel, hence producing

faster results. More complex crowdsourcing scenarios can be put in place, e.g., job pipelines,

where the output of a task is used as input to another one.

The major counter-argument to the use of crowdsourcing is its inefficiency. In fact, asking

the crowd to process all the records of a large database is not only costly (although the result

can be of a higher value), it is also inherently bound by the crowd size and the speed of the

workers. This makes crowdsourcing impractical for cases where high data velocity and billions

of records are the norms. Moreover, crowd workers usually exhibit high error rates, which can

be rooted in multiple factors like fatigue, subjectivity, priming and even willingness to cheat.

In order to leverage crowdsourcing as a viable solution to complex data processing needs,

and potentially create an added value for the end-users (often called requesters), it is worth

investigating methods to integrate the effectiveness of crowdsourcing with the efficiency of

machines to maintain a high user experience in terms of latency, minimizing the overall cost

and maximizing the quality of crowdsourcing results.

1

Page 24: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 1. Introduction

Protein Folding

FoldIT

Image Tagging

ESP Game

Data Management

CrowdDB

Computer generated Human Tasks

Human Answers

Crowd Workers Crowdsourcing Platforms

Figure 1.1 – Examples of Human Computation applications combining the efficiency of ma-chines with the effectiveness of humans.

1.1 Crowdsourcing and Human Computation

First coind by Jeff Howe in his article “The Rise of Crowdsourcing” [74], the term crowdsourcing

is nowadays used to describe several types of activities that involve the crowd with different

incentives and expectations. Crowdfunding, for instance, consists in raising money from the

crowd to support a project [99]. Another example is Citizen Journalism, where the crowd

contributes pieces of information like reports, photos, and videos to create novel channels for

news gathering [11].

1.1.1 Human Computation and Micro-tasks

In our context, we refer to crowdsourcing as the general paradigm that leverages human

abilities to solve problems that a computer is not yet capable of solving with acceptable

precision (if at all); we commonly call this concept Human Computation (HC) [157]. In order

to tap into the power of Human Computation at scale, one needs to offer proper incentives to

the crowd, e.g., monetary reward, fun, altruism or social recognition. Figure 1.1 illustrates the

basic interaction between a backend system crowdsourcing platform.

In this thesis, we are interested in paid micro-task crowdsourcing, where the crowd is asked

to perform short tasks, also known as Human Intelligence Tasks (HITs), in exchange for a

small monetary reward per unit. Popular examples of such tasks include: spell checking of

short paragraphs, sentiment analysis of tweets, rewriting product reviews, or transcription of

scanned shopping receipts.

Micro-task crowdsourcing has gained momentum with the emergence of online labor market-

places which facilitate the interaction between requesters and potential workers. A typical

crowdsourcing platform would work as follows: First the requesters design the HIT interface

based on their input data and desired outcome. Next, they publish the HIT onto the crowd-

2

Page 25: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

1.2. Crowd-powered Algorithms and Systems

sourcing platform specifying the promised monetary reward in exchange for the completion

of each HIT. Next, those workers willing to perform the published HITs complete the tasks and

submit their results back to the requester who obtains the desired output and compensate

the workers accordingly. There are many popular platforms that offer such services including

Amazon Mechanical Turk (AMT) [1], ClickWorker [2], CloudFactory [3], and CrowdFlower [4].

1.1.2 The Amazon Mechanical Turk Marketplace

Most of the experiments conducted in this thesis were done on Amazon Mechanical Turk. AMTis the oldest and the most popular micro-task crowdsourcing platform, it has a continuous flow

of workers and requesters. AMT provides programmatic Application Programming Interfaces

(APIs) as well as a Web interface for requesters to design and deploy online tasks, and its

activity logs are available to the public [76] and were used in the context of this thesis to

perform an analysis tracing its evolution (see Chapter 3).

AMT adopts a pull methodology, where all the published tasks are publicly presented on a

search-based dashboard (see Figure 1.2). The workers can pick their preferred tasks on a

first-come-first-served basis.

From a requester perspective, the pull crowdsourcing approach has several advantages includ-

ing simplicity and minimization of task completion times, since any available worker from

the crowd can pick and perform any HIT, provided that they meet some pre-requisites set

by the requester. From a worker perspective, it creates competition among requesters, and

potentially leads to high HIT standards, both in terms of interface design, quality, and pricing.

On the other hand, pull crowdsourcing limits the possibilities of the platform to offer any

form of service guarantees to its customers (i.e., the requesters). For example, this mechanism

cannot guarantee priority to a requester who has a deadline, and often the only effective

lever consists in increasing the unit reward of the HITs to attract more workers [12]. It also

cannot guarantee that the worker who performs the task is the best fit, as more knowledgeable

workers might be available within the crowd, but are unable to pick the task on time.

1.2 Crowd-powered Algorithms and Systems

Modern crowdsourcing platforms offer programmatic APIs in order to post HITs, monitor their

progress, collect the results and distribute the rewards. Hence, the idea of combining human

computation and computers to produce a new breed of hybrid Human-machine algorithms

found an opportunity to concretize. Not only can the crowd be invoked programmatically,

using a declarative language, this very process can be parametrized, monitored and embedded

in long running jobs.

A direct application of this idea goes naturally with the class of machine-learning algorithms

that produce their results along with a confidence score. A generic hybrid scheme consists in

3

Page 26: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 1. Introduction

Figure 1.2 – The Mturk worker main interface.

falling back to the crowd to increase the precision of the results whenever the confidence of

the generated solution falls bellow a predefined threshold. Another application is in active

learning, where a classification algorithm would repeatedly collect training labels from the

crowd – as opposed to a limited number of human operators [123]. Likewise, we refer to the

class of computer systems that would involve the crowd at some point in their execution

as Crowd-Powered Systems. A canonical example is CrowdDB [64], a relational database

management system with an augmented SQL syntax that would trigger queries to execute on

AMT, asking the crowd to perform a predefined data processing operation.

1.3 Quality of Service in Crowd-powered Systems

Humans and machines behave fundamentally differently: While machines can deal with

large volumes of data, with real-time streams, and with flocks of concurrent users interacting

with the system, crowdsourcing is currently seen mostly as a batch-oriented, offline data

processing paradigm. Today, crowdsourcing platforms are not providing any guarantees on

task completion times due to the unpredictability of the crowd workers, who are free to come

and go at any moment, and to selectively focus on an arbitrary subset of the available tasks

only. Moreover, the quality of the provided answers can vary dramatically for the same worker

and across workers. These are the hazards that any crowd-powered system needs to deal with

automatically in order to provide better services to its end-users.

Quality of Service (QoS) is a concept that is mostly used in telephony and computer networks.

It refers to the measures taken to improve (and sometimes guarantee) the overall performance

perceived by the users in terms of throughput, error rate, latency, among other domain-specific

metrics. In this thesis, we specifically investigate effectiveness and efficiency as QoS aspects

4

Page 27: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

1.4. Summary of Contributions

Figure 1.3 – The CrowdManager interface with the four componenents that we propose in thisthesis.

that need to be improved in a crowd-powered system. We define these two aspects and the

scope that we consider in Section 1.4.

We note that the Quality of Service that we are targeting in this thesis is best effort. This

limitation is due to several inherent reasons 1) the crowd is not employed and thus not bound

by any contract, and 2) the size of the available workforce can vary widely throughout the day.

Under these conditions, we can potentially model and predict the execution time of a batch

of HITs [62], but not enforce a given promise to the requesters (e.g., in order to finish a task

before a given deadline).

1.4 Summary of Contributions

The goal of this thesis is to: “Investigate, design, and evaluate methods and algorithms that

improve the effectiveness and efficiency of crowd-powered systems”. In practice, we implement

several modules as part of a CrowdManager which, in essence, can be thought of as a smart

network interface that manages and improves the exchanges between the backend system

and the target crowdsourcing platform (see Figure 1.3).

In the following, we detail our contributions and list the associated conference and journal

papers that we have published along our research work. We tackle two major areas that pertain

to the Quality of Service in crowdsourcing, namely: Effectiveness and Efficiency.

A) Effectiveness designates the ability of a system to produce the desired results. Our focus

with that regard is to ensure the high quality of the collected results; in that context we

investigate the following approaches:

HIT Quality Assurance: We first tackle the issue of aggregating HIT responses obtained

from multiple crowd workers. We propose an aggregation mechanism to lower

error rates specifically for tasks with Multiple Choice Questions. Our approach

consists in using a probabilistic model and weighted voting when aggregating the

responses of multiple workers for a given task.

Our first work, ZenCrowd, focused on an entity linking use-case, where the ag-

5

Page 28: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 1. Introduction

gregation mechanism was used to enhance the results of an automatic entity

linker:

Demartini, Gianluca, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. "Zen-

Crowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-

scale entity linking." Proceedings of the 21st international conference on World Wide

Web. ACM, 2012.

In a follow-up work, we extend the use-case to cover instance matching tasks:

Demartini, Gianluca, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. "Large-

scale linked data integration using probabilistic reasoning and crowdsourcing." The

VLDB Journal 22.5 (2013): 665-687.

HIT Routing: Next, we explore an alternative approach to pull crowdsourcing that

actively selects and pushes tasks to specific crowd workers who might be able to

provide better answers to particular tasks. Although HITs usually do not require

any expertise, we could however leverage the general knowledge of the workers to

find a match. For example, one can assign a novel translation task to a worker who

likes novels. Then, we apply expert finding techniques to match HITs to our crowd

participants based on indexed profiles that we build from their social network

information. Our experimental system, Pick-A-Crowd, is a custom Facebook [5]

application that assigns tasks to its users automatically based on what they liked

on the social network.

Difallah, Djellel Eddine, Gianluca Demartini, and Philippe Cudré-Mauroux. "Pick-

a-crowd: tell me what you like, and i’ll tell you what to do." Proceedings of the 22st

international conference on World Wide Web. ACM, 2013.

B) Efficiency designates the ability of a system to make the best use of the time, effort, and

budget in carrying the task at hand. In this thesis, we are interested in reducing the

latency and in enabling HIT prioritization in batches of homogenous or heterogeneous

HITs. We focus specifically on:

HIT Retention: Batches of tasks published on a crowdsourcing platform might be sub-

ject to slow progress and even to stagnation, especially when only a few tasks are

left. We investigate worker retention as a new dimension in increasing crowdsourc-

ing throughput or to avoid stagnation. In our work, we achieve worker retention

by granting punctual bonuses to the active workers.

Difallah, Djellel Eddine, Michele Catasta, Gianluca Demartini, and Philippe Cudré-

Mauroux. "Scaling-Up the Crowd: Micro-Task Pricing Schemes for Worker Retention

and Latency Improvement." Second AAAI Conference on Human Computation and

Crowdsourcing. 2014.

HIT Scheduling: Crowd-powered systems can be multi-tenant, i.e., supporting work-

loads generated by concurent users. A traditional approach would publish a new

batch of tasks on the crowdsourcing platform for each incoming query. We ar-

gue that this is suboptimal for the overall efficiency of the system. Instead, we

propose to bundle heterogeneous tasks in a single batch. We take control of the

6

Page 29: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

1.4. Summary of Contributions

HIT serving schedule in order to seamlessly load-balance the available workers

on multiple heterogeneous HITs. This work is currently submitted for peer review

(see Chapter 7).

1.4.1 Additional Contributions

In addition to the core contributions of this thesis, which are listed above, we also published

the following pieces of work related to crowdsourcing.

1. We studied the data collected from AMT over the past five years, and analyzed a number

of key dimensions of the platform (see Chapter 3).

Difallah, Djellel Eddine, Michele Catasta, Gianluca Demartini, Ipeirotis, Panagiotis G.,

and Philippe Cudré-Mauroux. "The Dynamics of Micro-Task Crowdsourcing – The Case

of Amazon MTurk". Proceedings of the 24th international conference on World Wide Web.

ACM, 2015.

2. We presented a position paper where we first review the techniques currently used to

detect spammers and malicious workers (whether they are bots or humans) randomly

or semi-randomly completing tasks. Then, we describe the limitations of existing

techniques by proposing approaches that individuals, or groups of individuals, could

use to attack a task on a crowdsourcing platforms.

Difallah, Djellel Eddine, Gianluca Demartini, and Philippe Cudré-Mauroux. "Mechani-

cal Cheat: Spamming Schemes and Adversarial Techniques on Crowdsourcing Platforms."

CrowdSearch. 2012.

3. We contributed to Hippocampus, a “Transactive Search” system that answers memory-

based queries by involving a group of people who have vivid memories of an event or

an interaction. In that work, we compare autmated methods, AMT crowd, and personal

social networks.

Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer,

and Philippe Cudre-Mauroux. "Hippocampus: answering memory queries using transac-

tive search." Proceedings of the companion publication of the 23rd international confer-

ence on World wide web companion. International World Wide Web Conferences Steering

Committee, 2014.

4. As an extension to Transactive Search, we describe the necessary components, the

architecture and the research directions for building a “transactive data management

system” that leverages social networks and crowdsourcing. TransactiveDB allows users

to pose different types of transactive queries in order to reconstruct collective human

memories.

Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer,

and Philippe Cudre-Mauroux. "TransactiveDB: Tapping into Collective Human Memo-

ries." Proceedings of the VLDB Endowment 7.14 (2014).

7

Page 30: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 1. Introduction

1.5 What this Thesis is Not About

Human computation is a multidisciplinary field, spanning from Human-Computer Inter-

action (HCI) to game theory, computer science, business, economics, social science, and

psychology. There might be a benefit in integrating multiple aspects from those research

areas to achieve better QoS in crowd-powered systems. In fact, other research agendas aim

at designing better user interfaces to obtain faster response times for a particular task [100],

add gamification elements to eliminate or reduce the cost [156], or leverage social networks

to increase the audience [54]. However, given that the field is still in its infancy, we choose to

focus on problems that aim at enhancing such systems by solely considering paid micro-task

crowdsourcing as a paradigm, and without necessarily integrating other techniques.

We do not to build specific crowd-powered operators (see Section 2.2.2) but rather assume

the ones supported by the crowd-powered system. Often, in the literature, such operators

propose ad-hoc quality assurance methods. We believe that these methods must be the raison

d’être of a separate and extensible quality assurance module.

Finally, we do not aim at creating a new crowdsourcing platform, although we often felt that

some key features were missing were AMT that could greatly benefit the QoS, and in few cases

we built custom solutions to showcase those benefits (see Chapter 5).

1.6 Outline

We organize the rest of this thesis as follows. Chapter 2 reviews relevant work on crowd-

powered systems and algorithms, in addition to existing methods tackling effectiveness and

efficiency in crowdsourcing. Next, in Chapter 3 we delve into an analysis of AMT; Understanding

and characterizing our target crowdsourcing platform will eventually help us formulate some

of our design choices. In Chapter 4 we present and evaluate our quality assurance method. We

center our study around a data integration use-case, where we create a hybrid human-machine

system for entity linking and instance matching. Chapter 5 introduces push crowdsourcing,

a model that “routes” tasks to better fit workers. Our prototype is a Facebook application

that suggests tasks to its users based on their social profiles. In Chapter 6 we consider worker

retention as a new dimension in speeding up the execution of a batch of tasks and minimize

its stagnation. We evaluate several bonus schemes that aim at retaining crowd workers longer

on a given batch. Chapter 7 introduces and evaluates scheduling algorithms that optimize

the execution of heterogeneous tasks while minimizing context switching for the workers. We

conclude withChapter 8 that summarizes our main findings, future directions, as well as our

outlook for future developments in crowdsourcing.

8

Page 31: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

2 Background in Crowd-Powered Sys-tems

2.1 Introduction

Since its introduction in 2005 by Amazon Mechanical Turk, paid micro-task crowdsourcing has

been studied and applied for a range of purposes including entity resolution, entity linking,

schema matching, association rule mining, word sense disambiguation, relevance judgement,

and query answering. Such hybrid human-machine systems use crowdsourcing in order to

provide better solutions as compared to purely machine-based systems.

In the following, we give some background on algorithms and systems leveraging human

computation, and the techniques used for quality assurance, routing and scheduling of Human

Intelligence Tasks (HITs) which are the main themes covered in this thesis. Additional related

work is also covered in the corresponding chapters.

2.2 Crowd-Powered Systems

2.2.1 Crowd-Powered Database Systems

In the database community, hybrid human-machine data management systems were pro-

posed with CrowdDB [64], Qurk [117] and Deco [128]. While the architectural details of those

systems differ, their core concept suggests adding new modules to the system that interact

with the crowd via the target crowdsourcing platform API. The query language is extended

in order to support declarative operators that process selected records with the help of the

crowd. The query execution engine supports new query operators – usually in the form of

User Defined Functions (UDFs) – that encode a template comprised of a visual interface

descriptor1 to display the HIT, an operator signature that defines the input and the output

of each HIT, the task reward, optional HIT pre-requisites, in addition to any processing logic.

When such operators are invoked, their execution triggers the generation of HITs onto the

platform, through the API, along with the provided input.

1Any format that the used crowdsourcing platform API supports, e.g., HTML, XML, JSON.

9

Page 32: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 2. Background in Crowd-Powered Systems

or CancelSubmit

Task: Gender Identification in CCTV Images

Female

Male

Unknown

Please, look at the picture and select the correct matching gender of the individual appearing in it.If unsure, select 'Unknown'.

Reward $0.05 Remaining 30582

Figure 2.1 – A Human Intelligence Task mockup.

Consider the following use-case: The security cameras of a mall capture and store snapshots

into a database relation VISITORS. During a security check, the administrator needs to find

pictures of ’All male visitors from the last hour’. Since the gender information is not present

in any column of the relation, nor the snapshots are annotated, the only way to uncover this

information is by checking the image field for each record. Given the size of the table (several

thousands records) and a short time window, the administrator decides to run his query using

a crowd-operator getGender(Type Image) which was added to the system beforehand. This

operator takes the snapshot as input, and is expected to return the gender of the pictured

subject. A mockup of the HIT interface as presented to the crowd is shown in fig. 2.1, and the

query writen by the user in listing 2.1.

1 SELECT * FROM VISITORS v2 WHERE getGender(v.picture) like "male"3 AND v.date >= DATE_SUB(NOW(),INTERVAL 1 HOUR);

Listing 2.1 – Sample operator of a crowd-powered DBMS.

In the realm of database systems, crowdsourcing can be used to fill-in null values in tuples, or

to define subjective ‘ORDER BY’ operators that allow users to express queries such as ‘Sort

by scariest movie’. CrowdDB, in particular, goes beyond the usual closed world assumption

of database systems, which states that: what is not present in the database must not exist.

In fact, CrowdDB supports operators that can add new tuples to a relation, e.g., ‘Insert the

name, address and phone number of bakeries in Boston, MA’. This presents new challenges

to database systems in how to handle query optimization, especially since the cardinality

of such tables is not previously known [114]. Selke et al. [141] extend these ideas to cover

maleable database schemas, that is the ability to add new columns by probing the crowd when

a relevant information might be associated to a record. More recently, Catasta et al. proposed

TransactiveDB [37], a system that reconstructs non-transcribed information from collective

memories; here, social networks and personal acquaintances are leveraged to find pieces of

information.

10

Page 33: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

2.2. Crowd-Powered Systems

Jeffery et al. [82] propose to hide the process of crowdsourcing from the user by defining

the concept of Labor Independence. Their goal is to simplify the declartive language used by

systems like CrowdDB which forces the users to be aware of the underlying crowdsourcing

process at the record level. Instead, their system Arnold takes generic parameters (expected

quality and total budget) to automatically crowdsource records within those criterias.

2.2.2 Crowd-Powered Database Operators

A natural development in crowd-powered database systems was the study of SQL-like oper-

ators tailored to the crowd. For instance, Parameswaran et al. [129] investigate the filtering

operation, which consists in applying a set of conditions (or predicates) to filter out unmatch-

ing tuples. The main issue they tackle, is how to reach a consensus out of multiple noisy

answers collected from the crowd, and run additional tasks if required. In their work, they

propose both an optimal and a heuristic strategy.

Marcus et al. [116] studied the Join and Sort operators, where they conclude that for a join

operation, a one-to-many join interface was optimal – as compared to a full pairwise cross-

join. On the other hand, for sort operations, they show that using a rating system instead

of a pairwise record comparison required far less HITs, yet producing similar results. In a

subsequent work [115], Marcus et al. created a Count operator which again leverages batching

as an efficient technique to dramatically lower the number of HITs to crowdsource. Here,

batching consists in showing multiple records to workers and ask them to provide a close

estimate count. Wang et al. [165] take advantage of transitive relation properties to further

reduce the set of elements to crowdsource in the case of Join operators.

Guo et al. [69] focus on the Max operator that finds the maximum element in a set of pairwise

comparisons. The problem is far from obvious when optimizing the pairwise comparison

operations; some workers might take longer to answer a HIT, or, might provide an incorrect

answer; thus, the query is executed by arbitrary pairwise comparisons rather than a predefined

tournament-like order. The authors show that the problem at hand is NP-hard and provide

a heuristic to estimate the max, and a method to decide what pair to crowdsource next in

order to improve the results. Venetis et al. [154] develop a set of generic parameterized max

algorithms considering time, cost, and quality tradeoffs. Along the same line, top-k algorithms,

combining heuristics and crowdsourcing, have been proposed in [47, 109, 124, 131].

As introduced in the previous section, CrowdDB allows some operators to add new rows to a

relation. A sample application to this is item enumeration, with the popular query “List all

possible ice cream flavors”. Trushkowsky et al. [152] used a statistical approach, inspired by

species estimations algorithms, to reason about the progress of an enumeration query and

estimate the size of the set.

The previous operators are mainly static, fulfilling their objective on a fixed set of elements.

Mahmood et al. [111] propose to use a crowdsourced index for operations that would require

11

Page 34: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 2. Background in Crowd-Powered Systems

search, update and deletion of records. Their Palm-Tree index is constructed with the help of

the crowd and is based on a B+Tree.

2.2.3 Crowd-Powered Systems in Other Communities

In the Information Retrieval community, crowdsourcing is especially appealing for creating

relevance judgments to evaluate the results of a search system. In fact, this operation is usually

carried out by expert judges. However, the later cannot handle the requests of all researchers.

As such, crowdsourcing has been used to produce relevance judgments for documents [12],

books [85, 86], or entities [28].

Another example is the use of crowdsourcing to answer tail queries in Web Search Engines [25].

Tail queries are keywords that rarely appear in search engine logs, as opposed to more popular

terms. Here the goal is to ask the crowd to select the most appropriate link to a tail query

within a set of machine-selected candidate Web pages. More recently, Demartini et al. [51]

proposed CrowdQ, a system that helps answering search queries leveraging the cognitive

ability of the crowd workers. Although the system do not opperate in real time, the crowd help

creating generic templates that can be applied for future queries.

In the Semantic Web community, crowdsourcing has also been recently considered, for in-

stance, to link [50] or map [139] entities. In both cases, the use of crowdsourcing can signifi-

cantly improve the quality of generated links, or mappings, as compared to purely automatic

approaches. In the context of Natural Language Processing, games to crowdsource the Word

Sense Disambiguation task[140] have recently been proposed.

Amsterdamer et al. [13] introduce the concept of Crowd Mining; that is: retrieving interesting

facts and rules directly from the crowd. Specifically, they study the case of association rule

mining without a pre-existing database of transactions. Their system, CrowdMiner, asks the

crowd to provide directly such rules from their own experience, which is an interesting case

leveraging humans’ ability to summarise information and to infer facts.

In the case of large enterprises, knowledge is often distributed across a number of employees.

Crowdsourcing within an enterprise (i.e., when the crowd is composed of th e company em-

ployees) is becoming popular and can benefit from the fact that employees are domain experts

and can solve tasks better and faster than anonymous crowds. In this case, crowdsourcing can

be used, for example, to efficiently find solutions to operational issues [160]. Crowdsourcing

has also been used in the biomedical domain where, for example, ontological relations among

diseases can be validated by the crowd [122, 132].

2.2.4 Languages and Toolkits

Apart from the regular AMT API, a whole echosystem of tools and paradigms similar to pro-

gramming languages tailored to crowdsourcing start appearing [121]. Such methods expose

12

Page 35: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

2.3. Task Quality

easier abstractions over the crowd, allowing system designers to transparently crowdsource

some of their processing needs.

For example, the requesters might have complex jobs that can be decomposed into pipelines of

micro-tasks (or HIT workflows). CrowdForge [91] provides a framework for task decomposition

and merging using map-reduce style of programming. The crowd is not only asked to complete

tasks but also to partition larger ones and merge results. CrowdWeaver [89] is a visual interface

that simplifies the management of such workflows and allows progress monitoring of the

whole system.

TurKit [108] automates the execution of iterative tasks without the manual intervention of an

operator. As an example, consider the case of spell checking a short paragraph. We can quickly

create a single HIT containing the original text. After the first iteration, TurKit takes the output

produced by the previous worker and creates a new HIT engaging a different worker. This

process then continues until we reach a predefined number of iterations.

As AMT is used more and more by non-programmers, especially in research areas such as

behavioural science and psychology. PsiTurk [8] is an automation tool that lowers the entry

barrier to AMT by providing an open platform for exchanging reusable code and designs of

experiments.

2.3 Task Quality

Quality control is a common issue of paid crowdsourcing. This is the case for several reasons

including:

• Human intrinsic factors (e.g., fatigue, boredom, priming, bias, hastiness) which can

affect some answers given by the workers.

• The results are hardly verifiable, and the requesters cannot check the answers one by

one; as this would defeat the whole purpose of crowdsourcing the task in the first place.

• There are some unfaithful workers whose intent is to game the system in order to collect

the monetary reward without properly completing the task [55].

2.3.1 Task Repetitions

In order to avoid bad quality answers, the same HIT is usually offered multiple times to

distinct workers; once all the tasks get completed; the requester decides what answers to

pick and how to aggregate them. The primary goal of task repetition is to diversify the output

by asking different workers and potentially improve the quality of a single HIT – the error

of the involved workers is usually assumed to be independant. Task repetition comes with

the price of multiplying the cost by the number of required repetitions. In some cases it is

possible to automatically decide if a new repetition of the HIT is needed and thus can be done

on demand (we refer the reader to the related work on crowdsourcing labels for supervised

13

Page 36: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 2. Background in Crowd-Powered Systems

machine learning [23, 78, 144, 148]).

2.3.2 Test Questions

Next, a simple screening process could be used to quickly identify malicious workers [60]. The

requester can use k HITs2 as tests tasks that the worker has to pass before, or during, a work

session. We can categorize test questions as follows:

• Gold Standard Tasks: The requester can create and add a set of undistinguishable yet

verifiable HITs into his larger batch. Those HITs can be inserted randomly during the

work session.

• Qualification Tasks: AMT provides the ability to request specific qualifications from the

workers. Those qualifications can either be drawn from the gold standards or, might

consist of more generic tasks, e.g., verify that a worker is fluent in French.

• Turing Test Tasks: Such questions (e.g., Captcha) are widely used to stop bots, they can

also be generated indefinitely. Here, the requester will not have to worry about creating

a test set of questions.

Test questions are powerful tools to detect malicious workers, especially when the task cannot

be differentiated from regular tasks. However, they come at a cost: for large batches of HITs, a

bigger gold standard collection is needed to avoid that the workers spot recurrent questions.

Moreover, test questions should be selected carefully so that they do not trick real workers

and, they are not easy for robots to answer.

2.3.3 Result Aggregation

The aggregation of the final results is a well-studied topic; the most straightforward approach

is to proceed with a majority vote; a simple, yet rather effective approach [104]. The authors

of [73] formalized the majority vote approach and proposed the use of a control group that

double-checks the answers of a prior run.

The distribution of workers and the number of tasks they perform are usually characterized

by a power law distribution [12] where many workers do few tasks, and few workers do many

tasks. The quality of aggregated results in such a context (e.g., with a majority vote) is self-

contained in the judgment of the task. However, one can capture many more signals that can

help make a better aggregation decision. Using machine learning algorithms and statistics,

one can model tasks complexity, workers’ error rate, skills and maliciousness [49, 50, 59, 79,

135, 162, 166, 168]. Sheshadri et al. surveyed and assessed many of these methods in their

SQUARE Benchmark [145].

While requesters can directly benefit from those methods without any additional cost; their

complexity, and sometimes meager improvements, make majority often the preferred ap-

2k is usually significantly smaller than the total number of HITs.

14

Page 37: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

2.4. Task Routing

proach.

2.3.4 Task Design

The HIT design is usually the responsibility of the requester. It encompases the visual rep-

resentation of the tasks, the clarity of the instructions, the communication style, and even

bonus mechanisms, which can be critical to boost quality, speed or both for any crowdsourc-

ing campaign. A study on the impact of incentives was recently conducted in [142]. The

authors observed that crowdsourcing platforms favor monetary incentives instead of social

ones and hypothesized that explicit worker conditioning (e.g., inform the worker that upon

disagreement with other workers on the same task they will be sanctioned) in addition to

quality control, can lead to better result quality.

Kittur et al. [88] stressed the importance of task formulation and presented their results with

two variants of a given task formulated differently. On a different note, Eickhoff et al. [61]

observed that malicious workers are less attracted to “Novel tasks that involve creativity and

abstract thinking”.

2.4 Task Routing

Another mechanism used for quality assurance is routing tasks to workers who might possess

some knowledge or background to provide a higher quality answer. Dommez et al. [58]

use an exploration-exploitation technique where: first they estimate the accuracy of the

crowd workers, and early in the process drop those workers who fall bellow a threshold while

optimizing their task to worker assignement.

While the task assignement process can be controlled to some extent intra batch on AMT, a

more generic approach is to use the concept of push crowdsourcing that we introduce in [56],

where the crowdsourcing platform itself disposes of workers profiles and can actively assign

a HIT to them if it sees fit. Pushing tasks can even be done offline, i.e., a task assigned to a

worker who is not currently on the platform. A particulary appealing application of pushing

tasks is in mobile based crowdsourcing [94].

In their early version, MobileWorks [6] described an architecture for adding tasks to a queue

and then “routing” them to one or multiple adequate workers [97]. However, their technique

of identifying worker expertise is not described.

Karger et al. [84] show that an adaptive task assignment system, one that dynamically decides

to whom to assign the task next, is not optimal if the crowd workers are ephemerals. Among

their conclusions is the following: “building a reliable worker-reputation system is essential to

fully harnessing the potential of adaptive designs.”

Ipeirotis et al. [77] proposed a novel crowdsourcing system based on Google Ads to target a

15

Page 38: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 2. Background in Crowd-Powered Systems

crowd population interested in some niche domains. The crowd participation is volluntary

but the system still uncures the advertizing cost. The later can be alleviated, or supressed, in

the case of non-profit use.

2.5 Task Scheduling

In this thesis, we consider a system oriented definition of task scheduling; where a set of

heterogeneous HITs (processes), possibly submitted by multiple requesters (tenants), are given

different priorities to get executed by the available crowd on the platform (ressources). The

goal here is to achieve load balancing and minimize the overall latency that requesters might

experience otherwise. To our knowledge, there is little work in this area thus far. Nonetheless,

in the following, we refer to related work where the term “scheduling” was used even if it does

not strictly match our definition.

So far, scheduling HITs for the crowd has been addressed in the context of work quality, and

often mixed with task routing (see section 2.4). CrowdControl[134], for example, proposes

scheduling approaches that take into account the history of the workers to understand how

to assign HITs best to workers based on how they learn doing tasks. Similarly, SmartCrowd

[137] considers task assignment as an optimization problem based on worker skills and their

reward requirements. Another work looking at the quality dimension is [125] where authors

look at scheduling tasks according to the required skills and previous feedback of requesters.

Khazankin et al. [87] proposed an architecture for a crowdsourcing platform that can provide

Service Level Agreements to its requester with a particular focus on work quality. Authors show,

by means of simulations, how approaches that take into account worker skills outperform

standard scheduling methods.

A different type of scheduling has been addressed in [53], where authors look at crowdsourcing

tasks that need to take place in a specific real-world geographical location. In this case it is

necessary to schedule tasks for workers in order to minimize space movements by taking into

account their geographical location and path.

Task allocation in teams has been studied in [14] where authors defined the problem, examined

its complexity, and proposed greedy methods to allocate tasks to teams and adjust team size.

Team formation given a task has been studied in [15] looking at worker skills.

2.6 Conclusions

Human Computation (HC) has been intensively studied over the past few years. One of the

major trends in that context is the study and characterization of HC processes from a com-

puter science perspective. Davis recently proposed the concept of Human co-Processing Units

(HPUs) [48] to model HC components along with CPUs or GPUs on computational platforms.

16

Page 39: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

2.6. Conclusions

Many other researchers, on the other hand, believe that humans behave fundamentally differ-

ently from machines and that radically new abstractions are required in order to characterize

(and potentially predict) the behavior of the crowd. While this debate is conceptual and ethical

mostly, our contributions are mainly technical; We design new algorithms to manage the

crowd more efficiently and effectively, and experimentally compare the effects of various HIT

scheduling, pricing schemes, routing techniques on crowd-powered systems, and hope that

the results gathered in this context will be instrumental in better understanding and managing

the crowd.

In the next chapter we start by analyzing AMT, the main target crowdsourcing platform used in

our evaluations, to develop a better understanding of the different characteristics that drive

the performance of the crowdsourcing campaigns running on it.

17

Page 40: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible
Page 41: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

3 An Analysis of the Amazon Mechani-cal Turk Crowdsourcing Marketplace

3.1 Introduction

The efficiency and effectiveness of a crowd-powered system heavily depend on the target

crowdsourcing platform. Partly because of different crowd demographics, size of the crowd,

available work, and competing requesters. Such factors can have a significant influence on

the quality of the results and the speed of a crowdsourcing campaign. In this thesis, we mainly

used AMT as a crowdsourcing platform for our empirical evaluations. In this section, we start

by analyzing this platform using a five-years log containing information about the posted HITs

and their progress status obtained from mturk-tracker.com [76]. We report key findings of

some of the factors that shape the dynamics of this platform. Such findings will eventually

help us explain or design some of the proposed methods in this thesis. Moreover, using

features derived from a large-scale analysis of these logs, we propose a method to predict

the throughput of a batch of HITs published by a certain requester at a certain point in time.

This prediction is based on several features including the current platform load and tasks

types. Using this prediction method, we try to understand the impact of each feature that we

consider, and its scope over time.

The main findings of our analysis are: 1) the type of tasks published on the platform has

changed over time with content creation HITs being the most popular today; 2) the HIT

pricing approach evolved towards larger and higher paid HITs; 3) geographical restriction are

applied to certain task types (e.g., surveys for US workers); 4) we observe a consistent growth

in the number of new requesters who use the platform; 5) we identify size of the batch as the

main feature that impacts the progress of a given batch; 6) we observe that supply (workforce)

has little control over driving the price of demand (posted HITs).

In summary, the main contributions of this analysis are:

• An analysis of the evolution of a popular micro-task crowdsourcing platform looking at

dimensions like topics, reward, worker location, task types, and platform throughput.

• A large-scale classification of 2.5M HIT types published on AMT.

19

Page 42: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 3. An Analysis of the Amazon Mechanical Turk Crowdsourcing Marketplace

• A predictive analysis of HIT batch progress using more than 29 different features.

• An analysis of the crowdsourcing platform as a market (demand and supply).

The rest of the chapter is structured as follows. In Section 3.2, we overview recent work on

micro-task crowdsourcing specifically focusing on how micro-task crowdsourcing has been

used and on how it can be improved. Section 3.3 presents how AMT has evolved over time in

terms of topics, reward, and requesters. Section 3.4 summarizes the results of a large-scale

analysis on the types of HIT that have been requested and completed over time. Based on

the previous findings, Section 3.5 presents our approach to predicting the throughput of the

crowdsourcing platform for a batch of published HITs. Section 3.6 studies the AMT market and

how different events correlate (e.g., new HITs attracting more workers to the platform). We

discuss our main findings in Section 3.7 before concluding in Section 3.8.

3.2 Related Work

AMT Market Analysis and Prediction

An initial work analyzing AMT market was done in [76], here we extend on this work by consid-

ering the time dimension and analyze long term trend changes. Faradani et al. [62] proposed

a model to predict the completion time of a batch. Our prediction endeavor is however differ-

ent, in the sense that we try to predict the immediate throughput based on current market

condition and try to understand what features are having more impact than others.

The Future of Crowdsourcing Platforms

In [90] authors provide their own view on how the crowdsourcing market should evolve in the

future, specifically focusing on how to support full-time crowd workers. Similarly to them, our

goal is to identify ways of improving crowdsourcing marketplaces by understanding the dy-

namics of such platforms—based on historical data and models. Our work is complementary

to existing work as we present a data-driven study of the evolution of micro-task crowdsourc-

ing over five years. Our findings can be used as support evidence to the ongoing efforts in

improving crowdsourcing quality and efficiency that are described above. Our work can be

also used to support requesters in publishing HITs on these platforms and getting results more

efficiently.

Online Reputation

Many AMT workers share their experience about HITs and requesters through dedicated web

forums and ad-hoc websites [80]. Requester “reviews” serve as a way to measure the reputation

of the requesters among workers and it is assumed to influence the latency of the tasks

published [146], as workers are naturally more attracted by HITs published by requesters with

a good reputation.

20

Page 43: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

3.3. The Evolution of Amazon MTurk From 2009 to 2014

3.3 The Evolution of Amazon MTurk From 2009 to 2014

In this section, we start by describing our dataset and extract some key information and

statistics that we will use in the rest of the chapter.

3.3.1 Crowdsourcing Platform Dataset

Over the past five years, we have periodically collected data about HITs published on AMT. The

data that we collect from the platform is available at http://mturk-tracker.com/.

In this work, we consider hourly aggregated data that includes the available HIT batches and

their metadata (title, description, rewards, required qualifications, etc.), in addition to their

progress over time, that is, the temporal variation of the set of HITs available. In fact, one of

the main metrics that we leverage (see Section 3.5) is the throughput of a batch, i.e., how many

HITs get completed between two successive observations. In Figure 3.1, we plot the number

of HITs available in a given batch versus its throughput. An interesting observation that can

be made is that large batches can achieve high throughput (thousands of HITs per minute).

In total, our dataset covers more than 2.5M different batches with over 130M HITs. We note

that the tracker reports data periodically only and does not reflect fine-grained information

(e.g., real-time variations). We believe however that it captures enough information to perform

meaningful, long-term trend analyses and to understand the dynamics and interactions within

the crowdsourcing platform.

3.3.2 A Data-driven Analysis of Platform Evolution

First, we identify trends obtained from aggregated information over time, keywords, and

countries associated to the published HITs. Each of the following analyses is also available as

an interactive visualization over the historical data on http://xi-lab.github.io/mturk-mrkt/.

Topics Over Time First, we want to understand how different topics have been addressed

by means of micro-task crowdsourcing over time. In order to run this analysis, we look at the

keywords associated with published HITs. We observe the evolution of keyword popularity

and associated reward on AMT. Figure 3.2 shows this behavior. Each point in the plot represents

a keyword associated to the HITs with its frequency (i.e., number of HITs with this keyword)

on the x-axis, and the average reward in a given year on the y-axis. The path connecting data

points indicates the time evolution, starting in 2009, with one point representing the keyword

usage over one year.

We observe that the frequency of the ‘audio’ and ‘transcription’ keywords (i.e., blue and red

paths from left to right) have substantially increased over time. They have become the most

21

Page 44: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 3. An Analysis of the Amazon Mechanical Turk Crowdsourcing Marketplace

Figure 3.1 – Batch throughput versus number of HITs available in the batch. The red line cor-responds to the maximum throughput we could have observed due to the tracker periodicityconstraints. For readability, this graph represents a subset of 3 months (January-March 2014),and HITs with rewards $0.05 and less.

popular keywords in the last two years and are paid more than $1 on average. HITs with the

‘video’ tag have also increased in number with a reward that has reached a peak in 2012 and

decreased after that. HITs tagged as ‘categorization’ have been paid consistently in the range

of $0.10-$0.30 on average, except in 2009 where they were rewarded less than $0.10 each.

HITs tagged as ‘tweet’ have not increased in number but have been paid more over the years,

reaching $0.90 on average in 2014: This can be explained by more complex tasks being offered

to workers, such as sentiment classification or writing of tweets.

Preferred Countries by Requesters Over Time Figure 3.3 shows the requirements set by

requesters with respect to the countries they wish to select workers from. The left part of

Figure 3.3 shows that most HITs are to be completed exclusively by workers located in the

US, India, or Canada. The right part of Figure 3.3 shows the evolution over time of the

country requirement phenomenon. The plot shows the number of HITs with a certain country

requirement (on the y-axis) and its time evolution (on the x-axis) with yearly steps. The size of

the data points indicates the total reward associated to those HITs.

We observe that US-only HITs dominate, both in terms of their large number as well as in

terms of the reward associated to them. Interestingly, we notice how HITs for workers based in

India have been decreasing over time. On the other hand, HITs for workers based in Canada

have been increasing over time, becoming in 2014 larger than those exclusively available to

22

Page 45: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

3.3. The Evolution of Amazon MTurk From 2009 to 2014

1,000 10,000 100,000Frequency (log)

Average Reward (log)

10

1

0.1

0

Figure 3.2 – The use of keywords to annotate HITs. F r equenc y corresponds to how manytimes a keyword was used, and Aver ag eRew ar d corresponds to the average monetary rewardof batches that listed the keyword. The size of the bubbles indicates the average batch size.

workers based in India. We also see that the reward associated to them is smaller than the

budget for India-only HITs. As of 2014, both HITs for workers based in Canada or UK are more

numerous that those for workers based in India. Overall, 88.5% of the HIT batches that were

posted in the considered time period did not require any specific worker location. 86% of

those which did, imposed a constraint requesting US-based workers.

Figure 3.4 shows the top keywords attached to HITs restricted to specific locations. We observe

that the most popular keywords (i.e., ‘audio’ and ‘transcription’) do not require country-specific

workers. We also note that US-only HITs are most commonly tagged with ‘survey’.

HIT Reward Analysis Figure 3.5 shows the most frequent rewards assigned to HITs over

time.1 We observe that while in 2011 the most popular reward was $0.01, recently HITs paid

$0.05 are getting more frequent. This can be explained both by how workers search for HITs on

AMT and by the AMT fee scheme. Requesters now prefer to publish more complex HITs possibly

with multiple questions in them and grant a higher reward: This also attracts those workers

who are not willing to complete a HIT for small rewards and reduces the fees paid to AMT,

which are computed based on the number of HITs published on the platform.

Requester Analysis In order to be sustainable, a crowdsourcing platform needs to retain

requesters over time or get new requesters to replace those who do not publish HITs anymore.

1Data for 2014 has been omitted as it was not comparable with other year values.

23

Page 46: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 3. An Analysis of the Amazon Mechanical Turk Crowdsourcing Marketplace

10

100

1000

10000

100000

1000000

Total HITsRequested

106

Time

Cumulative HITs (log)

105

104

2009 2010 2011 2012 2013 2014

Figure 3.3 – HITs with specific country requirements. On the left-hand side, the countrieswith the most HITs dedicated to them. On the right-hand side, the time evolution (x-axis) ofcountry-specific HITs with volume (y-axis) and reward (size of data point) information.

NO-Location NON-US US

0

200000

400000

600000

Article

Audio

Crowd

Easy

Editing

Insurance

Psychology

Quick

Research

Survey

Transcription

Verbatim

Voicemail

Article

Audio

Crowd

Easy

Editing

Insurance

Psychology

Quick

Research

Survey

Transcription

Verbatim

Voicemail

Article

Audio

Crowd

Easy

Editing

Insurance

Psychology

Quick

Research

Survey

Transcription

Verbatim

Voicemail

Keywords

Count

Figure 3.4 – Keywords for HITs restricted to specific countries.

Figure 3.6 shows the number of new requesters who used AMT and the overall number of

active requesters at a certain point in time. We can observe an increasing number of active

requesters over time and a constant number of new requesters who join the platform (at a rate

of 1,000/month over the last two years).

It is also interesting to look at the overall amount of reward for HITs published on the platform,

as platform revenues are computed as a function of HIT reward. From the bottom part of Figure

3.6, we observe a linear increase in the total reward for HITs on the platform. Interestingly, we

also observe some seasonality effects over the years, with October being the month with the

highest total reward and January or February being the month with minimum total reward.

HIT Batch Size Analysis When a lot of data needs to be crowdsourced (e.g., when many

images need to be tagged), multiple tasks containing similar HITs can be published together.

We define a batch of HITs as a set of similar HITs published by a requester at a certain point in

time.

Figure 3.7 shows the distribution of batch sizes in the period from 2009 to 2014. We can

24

Page 47: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

3.4. Large-Scale HIT Type Analysis

0

25000

50000

75000

100000

2009 2010 2011 2012 2013Year

Count

Micro Reward (USD) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

Figure 3.5 – Popularity of HIT reward values over time.

0

500

1000

0

1000

2000

3000

4000

0

25000

50000

75000

100000

125000

New.RequestersD

istinct.R

equesters

Total.Reward

2009 2010 2011 2012 2013 2014Year

Count

Figure 3.6 – Requester activity and total reward on the platform over time.

observe that most of the batches were of size 1 (more than 1M), followed by a long tail of larger,

but less frequent, batch sizes.

Figure 3.8 shows how batch size has changed over time. We observe that the average batch

size has slightly decreased. The monthly median is 1 (due to the heavily skewed distribution).

Another observation that can be made is that in 2014 very large batches containing more that

200,000 HITs have appeared on AMT.

3.4 Large-Scale HIT Type Analysis

In this section, we present the results of a large-scale analysis of the evolution of HIT types

published on the AMT platform. For this analysis, we used the definition of HIT types proposed

by [65] in which authors perform an extensive study involving 1,000 crowd workers to under-

stand their working behavior, and categorize the types of tasks that the crowd perform into six

25

Page 48: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 3. An Analysis of the Amazon Mechanical Turk Crowdsourcing Marketplace

100

101

102

103

104

105

106

10

0

10

1

10

2

10

3

10

4

10

5

Batch Size (log)

Nu

mb

er

of B

atc

he

s (

log

)

Figure 3.7 – The distribution of batch sizes.

top-level “goal-oriented” tasks, each containing further sub-classes. We briefly describe the

six top-level classes introduced by [65] below.

• Information Finding (IF): Searching the Web to answer a certain information need. For

example, “Find the cheapest hotel with ocean view in Monterey Bay, CA”.

• Verification and Validation (VV): Verifying certain information or confirming the validity

of a piece of information. Examples include checking Twitter accounts for spamming

behaviors.

• Interpretation and Analysis (IA): Interpreting Web content. For example, “Categorize

product pictures in a predefined set of categories”, or “Classify the sentiment of a tweet”.

• Content Creation (CC): Generating new content. Examples include summarizing a

document or transcribing an audio recording.

• Surveys (SU): Answering a set of questions related to a certain topic (e.g., demographics

or customer satisfaction).

• Content Access (CA): Accessing some Web content. Examples include watching online

videos or clicking on provided links.

3.4.1 Supervised HIT Type Classification

Using the various definitions of HIT types given above, we trained a supervised machine

learning model to classify HIT types based on their metadata. The features we used to train

the Support Vector Machine (SVM) model are: HIT title, description, keywords, reward, date,

allocated time, and batch size.

To train and evaluate the supervised model, we created labelled data: We uniformly sampled

5,000 HITs over the entire five-year dataset and manually labelled their type by means of

crowdsourcing. In detail, we asked workers on MTurk to assign each HIT to one of the

26

Page 49: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

3.4. Large-Scale HIT Type Analysis

0

100

200

300

0

100000

200000

300000

Average.BatchSize

Maximum.BatchSize

2009 2010 2011 2012 2013 2014Year

Count

Figure 3.8 – Average and maximum batch size per month. The monthly median is 1.

predefined classes by presenting them with the title, description, keywords, reward, date,

allocated time, and batch size for the HIT. The instructions also contained the definition and

examples for each task type. Workers could label tasks as ‘Others’ when unsure or when the

HIT did not fit in any of the available options.

After assigning each labelling HIT to three different workers in the crowd, a consensus on the

task type label was reached in 89% of the cases (leaving 551 cases with no clear majority). A

consensus was reached when at least two out of three workers agreed on the same HIT type

label. The other cases, that is, when the workers provided different labels or when they where

not sure about the HIT type, have then been removed from our labelled dataset.

Using the labelled data, we trained a multi-class SVM classifier for the 6 different task types

and evaluated its quality with 10-fold cross validation over the labelled dataset. Overall, the

trained classifier obtained a Precision of 0.895, a Recall of 0.899, and an F-Measure of 0.895.

Most of the classifier errors (i.e., 66 cases) were caused by incorrectly classifying IA instances

as CC jobs.

Performing feature selection for the HIT type classification problem, we observed that the

best features based on information gain are the HIT allotted time and reward: This indicates

that HITs of different types are associated with different levels of reward as well as different

task durations (i.e., longer and better paid tasks versus shorter and paid worse). The most

distinctive keywords for identifying HIT types are ‘transcribe’, ‘audio’, and ‘survey’, which

clearly identify CC and SU HITs.

Using the classifier trained over the entire labelled dataset, we then performed a large-scale

classification of the types for all 2.5M HITs in our collection. This allows us to study the

evolution of the task types over time on the AMT platform, which we describe next.

27

Page 50: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 3. An Analysis of the Amazon Mechanical Turk Crowdsourcing Marketplace

100

101

102

103

104

105

CA CC IA IF SU VVTask Category

Co

un

t (l

og

)

Year 2009 2010 2011 2012 2013 2014

Figure 3.9 – Popularity of HIT types over time.

3.4.2 Task Type Popularity Over Time

Using the results of the large-scale classification of HIT types, we analyze which types of HITs

have been published over time. Figure 3.9 shows the evolution of task types published on

AMT. We can observe that, in general, the most popular type of task is Content Creation. In

terms of observable trends, we note that–while there is a general increase in the volume of

tasks on the platform—CA tasks have been decreasing over time. This can be explained by the

enforcement of AMT terms of service, which state that workers should not be asked to create

accounts on external websites or be identified by the requester. In the last three years, SU and

IA tasks have seen the biggest increase.

3.5 Analyzing the Features Affecting Batch Throughput

Next, we turn our attention to analyzing the factors that influence the progress (or the pace) of

a batch, how those factors influence each other and how their importance changes over time.

In order to conduct this analysis, we carry out a prediction experiment on the batch’s through-

put, that is, the number of HITs that will be completed for a given batch within the next time

frame of 1 hour (i.e., the D I F F _H I T feature is the target class). Specifically, we model this

task as a regression problem using 29 features; some of them were used in the previous section

to classify the HIT type; we describe the remaining ones in Section 3.5.1.

3.5.1 Machine Learning Features

The following is the list of features associated to each batch. We used these features in our

machine learning approach to predict batch throughput for the next hourly observation:

• HIT_available: Number of available HITs in the batch.

• start_time: The time of an observation.

• reward: HIT Reward in USD.

• description: String length of the batch’s description.

28

Page 51: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

3.5. Analyzing the Features Affecting Batch Throughput

• title: String length of the batch’s title.

• keywords: Keywords (space separated).

• requester_id: ID of the requester.

• time_alloted: Time allotted per task.

• tasktype: Task class (as per our classification in 3.4).

• ageminutes: Age since the Batch was posted (minutes).

• leftminutes: Time left before expiration (minutes).

• location: The requested worker’s Location (e.g., US).

• totalapproved: Batch requirement on the number of total approved HITs.

• approvalrate: Batch requirement on the percentage of workers approval.

• master: Worker is a master.

• hitGroupsAvailableUI: Number of batches as reported on Mturk dashboard.

• hitsAvailableUI: Number of HITs available as reported on Mturk dashboard.

• hitsArrived: Number of new HITs arrived.

• hitsCompleted: Number of HITs completed.

• rewardsArrived: Sum of rewards associated with the HITs arrived.

• rewardsCompleted: Sum of rewards associated with the HITs completed.

• percHitsCompleted: Ratio of HITs completed and total HITs available.

• percHitsPosted: Ratio of new HITs arrived and total HITs available.

• diffHits: hitsCompleted-hitsArrived.

• diffHitsUI: Difference in HITs observed from Mturk dashboard.

• diffGroups: Computed difference in number of completed and arrived batches.

• diffGroupsUI: Difference in number of completed and arrived batches observed from

Mturk dashboard.

• diffRewards: Difference in rewards = (rewardsArrived-rewardsCompleted).

• DIFF_HIT: Number of HITs completed since the last observation.

3.5.2 Throughput Prediction

To predict the throughput of a batch at time T , we train a Random Forest Regression model

with samples taken in the range [T −δ,T ) where δ is the size of the time window that we are

considering directly prior to time T . The rationale behind this approach is that the throughput

should be directly correlated to the current and recent market situations.

We considered data from June to October 2014, and hourly observations (see Section 3.3.1),

from which we uniformly sampled 50 test time points for evaluation purposes. In our experi-

ments, the best prediction results, in terms of R-squared2, were obtained using δ= 4hour s.

For that window, our predicted versus actual throughput values are shown in Figure 3.10. The

figure suggests that the prediction works best for larger batches having a large momentum.

In order to understand which features contribute significantly to our prediction model, we

2http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

29

Page 52: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 3. An Analysis of the Amazon Mechanical Turk Crowdsourcing Marketplace

Figure 3.10 – Predicted vs actual batch throughput values for δ = 4hour s. The predictionworks best for larger batches having a large momentum.

proceed by feature ablation. For this experiment, we computed the prediction evaluation score

R-squared, for 1,000 randomly sampled test time points and kept those where the prediction

worked reasonably, i.e., having R-squared> 0, that is 327 samples. Next, we reran the prediction

on the same samples by removing one feature at a time. The results revealed that the features

H I T _avai l able (i.e., the number of tasks in the batch) and Ag e_mi nutes (i.e., how long

ago the batch was created) were the only ones having a statistically significant impact on the

prediction score with p < 0.05 and p < 0.01 respectively.

3.5.3 Features Importance

In order to better grasp the characteristics of the batch throughput, we examine the computed

Gini importance of the features [35]. In this experiment, we varied the training time frame

δ from 1 hour to 24 hours for each tested time point. Figure 3.11 shows the contribution of

our 2 top features (as concluded from the previous experiment, i.e., H I T _avai l able and

Ag e_mi nutes) and how their importances varied when we increased the training time-frame.

These features are again listed in Table 3.1, the slope indicates whether the feature is gaining

importance over time (positive value) or decreasing in importance (negative value).

The most important feature is H I T _avai l able, that is, the current size of the batch. Indeed,

as observed by previous work, larger batches tend to attract more workers [76, 64]. This

feature becomes less important when we consider longer periods, partly because of noise,

30

Page 53: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

3.6. Market Analysis

HIT_available Age_minutes

0%

20%

40%

60%

80%

0 5

10

15

20

25 0 5

10

15

20

25

Time Delta Considered (Hours)

Fe

atu

re Im

po

rta

nce

%

Figure 3.11 – Computed feature importance when considering a larger training window forbatch throughput prediction.

Table 3.1 – Gini importance of the top 2 features used in the prediction experiment. A largemean indicates a better overall contribution to the prediction. A positive slope indicates thatthe feature is gaining in importance when the considered time window is larger.

Feature mean stderr slope interceptHIT_available 29.8606 13.4247 -0.0257 34.4940Age_minutes 12.9087 8.1967 -0.0050 13.8181

and because other features start to encode additional facts. On the other hand, Ag e_mi nutes

importance suggests that the crowd is sensitive to newly posted HITs, or how fresh the HITs

are. To better understand this phenomenon, we conduct an analysis on what attracts the

workforce to the platform in the next section.

3.6 Market Analysis

Finally, we study the demand and supply of the Amazon MTurk marketplace. In the following,

we define Demand as the number of new tasks published on the platform by the requesters.

In addition, we compute the average reward of the tasks that were posted. Conversely, we

define Suppl y as the workforce that the crowd is providing, concretized as the number of

tasks that got completed in a given time window by the workers. In this section we use hourly

collected data for the time period spanning June to October 2014.

31

Page 54: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 3. An Analysis of the Amazon Mechanical Turk Crowdsourcing Marketplace

3.6.1 Supply Attracts New Workers

We start by analyzing how the market reacts when new tasks arrive on the platform, in order

to understand the degree of elasticity of the supply. If the supply of work is inelastic, the

amount of work done over time should be independent of the demand for work. So, if the

amount of tasks available in the market (“demand”) increases, then the percentage of work

that gets completed in the market should drop, as the same amount of “work done" gets split

among a higher number of tasks. To understand the elasticity of the supply, we regressed the

percentage of work done in every time period (measured as the percentage of HITs that are

completed) against the number of new HITs that are posted in that period. Figure 3.12 shows

the scatterplot for those two variables.

Our data reveals that an increase in the number of arrived HITs is positively associated with a

higher percentage of completed HITs. This result provides evidence that the new work that is

posted is more attractive than the tasks previously available in the market, and attracts “new

work supply".3

Our regression4 of the “Percent Completed" against “Hits Arrived (in thousands)" indicates an

intercept of 2.5 and a slope of 0.05. To put these numbers in context: On average, there are 300K

HITs available in the market at any given time, and on average 10K new HITs arrive every hour.

The intercept of 2.5 means that 2.5% of these 300K HITs (i.e., 7.5K per hour) get completed, as

a baseline, assuming that no new HIT gets posted. The slope is 0.05, meaning that if 10K new

HITs arrive within an hour, then the completion ratio increases by 0.5%, to 3% (i.e., 9K HITs per

hour). When 50K new HITs arrive within an hour, then the completion percentage increases

to 5% indicating that 15K to 20K HITs get completed. In other words, approximately 20% of

the new demand gets completed within an hour of being posted, indicating that new work

has almost 10x higher attractiveness for the workers than the remaining work that is available

on the platform. This result could be explained by how tasks are presented to workers by AMT.

Workers, when not searching for tasks using specific keywords, are presented with the most

recently published tasks first.

3.6.2 Demand and Supply Periodicity

On the demand side, some requesters frequently post new batches of recurrent tasks. Hence,

we are interested in the periodicity of such demand in the marketplace and the supply it drives.

To look in this, we consider both the time-series of available HITs and the rewards completed.

First, we observe that the demand exhibits a strong weekly periodicity, which is reflected by

the autocorrelation that we compute from the number of available HITs on Amazon Mturk

(See Figure 3.13a and 3.13c). The market seems to have a significant memory that lasts for

3From the data available, it is not possible to tell whether the new supply comes from distinct workers, fromworkers that were idle, or from an increased productivity of existing workers.

4We use Ordinary Least Squares regression.

32

Page 55: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

3.7. Discussion

−20 0 20 40 60 80 100 120

HITs Arrived (in thousand)

−5

0

5

10

15

20

25

30

Percent HITs Completed

OLSdata

Figure 3.12 – The effect of new arrived HITs on the work supplied. Here, the supply is expressedas the percentage of HITs completed in the market.

approximately 7-10 days.

Conversely, and to check for the periodicity in the supply, we compute an autocorrelation on

the weekly moving average of the completed HITs reward. Figure 3.13b and 3.13d show that

there is a strong weekly periodicity effect, as we observe high values in the range 0-250 hours.

3.7 Discussion

In this section, we summarize the main findings of our study and present a discussion of our

results. We extracted several trends from the five years data, summarized as follows:

• Tasks related to audio transcription have been gaining momentum in the last years and

are today the most popular tasks on AMT.

• The popularity of Content Access HITs has decreased over time. Surveys are however

becoming more popular over time especially in the US.

• While most HITs do not require country-specific workers, most of such HITs require

US-based workers.

• HITs that are exclusively asking for workers based in India have strongly decreased over

time.

• Surveys are the most popular type of HITs for US-based workers.

• The most frequent HIT reward value has increased over time, and reaches $0.05 in 2014.

• New requesters constantly join AMT, making the total number of active requesters and

the available reward increase over time.

• The average HIT batch size has been stable over time; however, very large batches have

recently started to appear on the platform.

33

Page 56: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 3. An Analysis of the Amazon Mechanical Turk Crowdsourcing Marketplace

Jun Jul Aug Sep Oct

Date

100

200

300

400

500

600

HITs Available (in thousands)

(a) Number of HITs available over a three monthsperiod.

Jul Aug Sep Oct

Date

80

100

120

140

160

180

200

220

240

Rewards Completed

(in thousand dollars)

(b) Weekly moving average on rewardscompleted over a three months period.

0 500 1000 1500 2000 2500 3000

Lag

−1.0

−0.5

0.0

0.5

1.0

Autocorrelation

(c) Autocorrelation on theHITs available.

0 500 1000 1500 2000 2500 3000

Lag

−1.0

−0.5

0.0

0.5

1.0

Autocorrelation

(d) Autocorrelation on the moving average ofrewards completed.

Figure 3.13 – Computed autocorrelation on the number of HITs available and on the weeklymoving average of the completed reward (N.B., autocorrelation’s Lag is computed in Hours).In both cases, we clearly see a weekly periodicity (0-250 Hours).

Our batch throughput prediction (Section 3.5) indicates that the throughput of batches can

be best predicted based on the number of HITs available in the batch,i.e., its size; and its

freshness, i.e., for how long the batch has been on the platform.

Finally, we analyzed AMT as a marketplace in terms of demand (new HITs arriving) and supply

(HITs completed). We observed some strong weekly periodicity both in demand and in supply.

We can hypothesize that many requesters might have repetitive business needs following

weekly trends, while many workers work on AMT on a regular basis during the week.

3.8 Conclusions

We studied data collected from a popular micro-task crowdsourcing platform, AMT, and ana-

lyzed a number of key dimensions of the platform, including: topic, task type, reward evolu-

tion, platform throughput, and supply and demand. The results of our analysis can serve as a

starting point for improving existing crowdsourcing platforms and for optimizing the overall

34

Page 57: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

3.8. Conclusions

efficiency and effectiveness of human computation systems. The evidence presented above

indicate how requesters should use crowdsourcing platforms to obtain the best out of them:

By engaging with workers and publishing large volumes of HITs at specific points in time.

Future research based on this work might look at different directions. On one hand, novel

micro-task crowdsourcing platforms need to be designed based on the findings identified in

this work, such as the need for supporting specific task types like audio transcription or surveys.

Additionally, analyses that look at specific data could provide a deeper understanding of the

micro-task crowdsourcing universe. Examples include per-requester or per-task analyses of

the publishing behavior rather than looking at the entire market evolution as we did in this

work. Similarly, a worker-centered analysis could provide additional evidence of the existence

of different classes of workers, e.g., full-time vs casual workers, or workers specializing on

specific task types as compared to generalists who are willing to complete any available task.

In the next chapter, we will start by investigating a technique to aggregate the results of a HIT

(run with multiple repetitions) to provide better quality of the end-results.

35

Page 58: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible
Page 59: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

4 Human Intelligence Task Quality As-surance

4.1 Introduction

One of the most significant benefits of crowdsourcing is the ability to tap into human compu-

tation at scale. Hundreds, or even thousands, of crowd workers can participate in a crowd-

sourcing campaign, thus contributing to its quick completion. The only caveat is that the

collected answers are not verified one by one (as it defeats the purpose of crowdsourcing),

and are usually subject to high error rates. In fact, some workers might be malicious and try

to do the tasks quickly by providing random answers in order to collect the rewards with the

least effort. One strategy that can be used to avoid high error rates, is to use test questions to

stop poor performing workers. However, human error is not always a sign of maliciousness;

it can simply be due to fatigue, a defect in the system, bias, excess-confidence or any other

temporary factor. Even the most honest workers cannot consistently perform at 100% all the

time, hence stoping workers can be considered as an extreme measure.

Another compatible method is to use repetitions, i.e., ask multiple workers for the same

task and then automatically decide which answer to pick based on some form of agreement

scheme. Majority vote is the simplest approach to use; it consists in selecting the answer that

most of the workers selected. However, 1) majority vote can be easily cheated, e.g., multiple

malicious workers can agree on an answer, and 2) it gives all the workers the same weight,

regardless whether we have some prior knowledge about the workers’ reliability.

In this chapter, we investigate a Bayesian framework to assess dynamically the results of

tasks with Multiple Choice Questions obtained from arbitrary human workers operating on a

crowdsourcing platform. We show that we can effectively combine workers answers by taking

into account an adaptive weight associated with each worker in addition to any available prior

output of an algorithmic pre-processing step. In the following, we focus on two use-cases,

namely: Entity Linking and Instance Matching (see Section 4.1.1 for an overview), and for

which we also develop a hybrid human-machine system that we describe.

37

Page 60: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 4. Human Intelligence Task Quality Assurance

4.1.1 The Entity Linking and Instance Matching Use-Cases

Semi-structured data is becoming more prominent on the Web as more and more data is

either interweaved or serialized in HTML pages. The Linked Open Data (LOD) community1,

for instance, is bringing structured data to the Web by publishing datasets using the RDF

formalism and by interlinking pieces of data coming from heterogeneous sources. As the

LOD movement gains momentum, linking traditional Web content to the LOD cloud is giving

rise to new possibilities for online information processing. For instance, identifying unique

real-world objects, persons, or concepts, in textual content and linking them to their LOD

counterparts (also referred to as Entities), opens the door to automated text enrichment (e.g.,

by providing additional information coming from the LOD cloud on entities appearing in the

HTML text), as well as streamlined information retrieval and integration (e.g., by using links to

retrieve all text articles related to a given concept from the LOD cloud).

As more LOD datasets are being published on the Web, unique entities are getting described

multiple times by different sources. It is therefore critical that such openly available datasets

are interlinked to each other in order promote global data interoperability. The interlinking of

datasets describing similar entities enables Web developers to cope with the rapid growth of

LOD data, by focusing on a small set of well-known datasets (such as DBPedia2 or Freebase3)

and by automatically following links from those datasets to retrieve additional information

whenever necessary.

Automatizing the process of instances matching (IM) from heterogeneous LOD datasets and

the process of entities linking (EL) appearing in HTML pages to their correct LOD counterpart

are currently drawing a lot of attention (see the Related Work section below). These processes

represent however a highly challenging task, as instance matching is known to be extremely

difficult even in relatively simple contexts. Some of the challenges that arise in this context are

1) to identify entities appearing in natural text, 2) to cope with the large-scale and distributed

nature of LOD, 3) to disambiguate candidate concepts, and 4) to match instances across

datasets.

The current matching techniques used to relate an entity extracted from text to corresponding

entities from the LOD cloud as well as those used to identify duplicate entities across datasets

can be broadly classified into two groups:

Algorithmic Matching: Given the scale of the problem (that could potentially span the entire

HTML Web), many efforts are currently focusing on designing and deploying scalable

algorithms to perform the matching automatically on very large corpuses.

Manual Matching: While algorithmic matching techniques are constantly improving, they

are still at this stage not as reliable as humans. Hence, many organizations are still today

1http://linkeddata.org/2http://www.dpbedia.org3http://freebase.org

38

Page 61: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

4.1. Introduction

appointing individuals to manually link textual elements to concepts. For instance, the

New York Times employs a whole team whose sole responsibility is to manually create

links from news articles to NYT identifiers4.

ZenCrowd is a system that we have developed in order to create links across large datasets

containing similar instances and to semi-automatically identify LOD entities from textual

content. Our system gracefully combines algorithmic and manual integration, by first taking

advantage of automated data integration techniques, and then by improving the automatic

results by involving human workers.

The ZenCrowd approach addresses the scalability issues of data integration by proposing a

novel three-stage blocking technique that incrementally combines three very different ap-

proaches together. In a first step, we use an inverted index built over the entire dataset to

efficiently determine potential candidates and to obtain a first ranked list of potential results.

Top potential candidates are then analyzed further by taking advantage of a more accurate

(but also more costly) graph-based instance matching techniques (a similar structured/un-

structured hybrid approach has been taken in [151]). Finally, results yielding low confidence

values (as determined by probabilistic inference) are used to dynamically create micro tasks

published on a crowdsourcing platform, the assumption being that tasks in question do not

need special expertise to be performed.

ZenCrowd does not focus on the algorithmic problems of instance matching and entity linking

per se. However, we make a number of key contributions at the interface of algorithmic and

manual data integration, and discuss in detail how to most effectively and efficiently combine

scalable inverted indices, structured graph queries and human computation in order to match

large LOD datasets. The contributions described in this chapter include:

• a new system architecture supporting algorithmic and manual instance matching as

well as entity linking in concert.

• a new three-stage blocking approach that combines highly scalable automatic filtering

of semi-structured data together with more complex graph-based matching and high-

quality manual matching performed by the crowd.

• a new probabilistic inference framework to dynamically assess the results of arbitrary

human workers operating on a crowdsourcing platform, and to effectively combine

their (potentially conflicting) answers taking into account the results of the automatic

stage output.

• an empirical evaluation of our system in a real deployment over different Human In-

telligence Task interfaces showing that ZenCrowd combines the best of both worlds,

in the sense that our combined approach turns out to be more effective than both (a)

pure algorithmic, by improving the accuracy and (b) full manual matching, by being

cost-effective while mitigating the workers’ uncertainty.

4see http://data.nytimes.com/

39

Page 62: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 4. Human Intelligence Task Quality Assurance

The rest of this chapter is structured as follows: Section 4.2 introduces the terminology used

throughout the chapter. Section 4.3 gives an overview of the architecture of our system, includ-

ing its algorithmic matching interface, its probabilistic inference engine, and its templating

and crowdsourcing components. Section 4.4 presents our graph-based matching confidence

measure as well as different methods to crowdsource instance matching and entity linking

tasks. We describe our formal model to combine both algorithmic and crowdsourcing results

using Probabilistic Networks in Section 4.5. We introduce our evaluation methodology and

discuss results from a real deployment of our system for the Instance Matching task in Sec-

tion 4.6 and for the Entity Linking task in Section 4.7. We review the state of the art in instance

matching and entity linking in Section 4.8, before concluding in Section 4.9.

4.2 Preliminaries on the EL and IM Tasks

As already mentioned, ZenCrowd addresses two distinct data integration tasks related to the

general problem of Entity Resolution [66].

We define Instance Matching as the task of identifying two instances following different

schemas (or ontologies) but referring to the same real-world object. Within the database

literature, this task is related to Record Linkage [39], Duplicate Detection [27], or Entity Identi-

fication [107] when performed over two relational databases. However, in our setting, the main

goal is to create new cross-dataset <owl:sameAs> RDF statements. As commonly assumed for

Record Linkage, we also assume that there are no duplicate entities within the same source and

leverage this assumption when computing the final probability of a match in our probabilistic

reasoning step.

We define Entity Linking as the task of assigning a URI selected from a background knowledge

base for an entity mentioned in a textual document. This task is also known as Entity Resolu-

tion [66] or Disambiguation [36] in the literature. In addition to the classic entity resolution

task, the objective of our task is not only to understand which possible interpretation of the

entity is correct (Michael Jordan the basketball player as compared to the UC Berkeley pro-

fessor), but also to assign a URI to the entity, which can be used to retrieve additional factual

information about it.

Given two LOD dataset U1 = {u11, ..,u1n} and U2 = {u21, ..,u2m} containing structured entity

descriptions ui j , where i identifies the dataset and j the entity URI, we define Instance

Matching as the identification of each pair (u1i ,u2 j ) of entity URIs from U1 and U2 referring to

the same real-world entity and call such a pair a match. An example of match is given by the

pair u11 = <http://dbpedia.org/resource/Tom_Cruise> and u21 = <http://www.freebase.com/

m/07r1h> where U1 is the DBPedia LOD dataset and U2 is the Freebase LOD dataset.

Given a document d and a LOD dataset U1 = {u11, ..,u1n}, we define Entity Linking as the task

of identifying all entities in U1 from d and of associating the corresponding identifier u1i to

each entity.

40

Page 63: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

4.3. ZenCrowd Architecture

These two tasks are highly related: Instance Matching aims at creating connections between

different LOD datasets that describe the same real-world entity using different vocabularies.

Such connections can then be used to run Linking on textual documents. Indeed, ZenCrowd

uses existing <owl:sameAs> statements as probabilistic priors to take a final decision about

which links to select for an entity appearing in a textual document.

Hence, we use in the following the term entity to refer to a real-world object mentioned in a

textual document (e.g., a news article) while we use the term instance to refer to its structured

description (e.g., a set of RDF triples), which follows the well-defined schema of a LOD dataset.

Our system relies on LOD datasets for both tasks. Such Linked Datasets describe intercon-

nected entities that are commonly mentioned in Web content. As compared to traditional

data integration tasks, the use of LOD data may support integration algorithms by means of

its structured entity descriptions and entity interlinking within and across datasets.

In our work, we make use of Human Intelligence at scale to, first, improve the quality of

such links across datasets and, second, to connect unstructured documents to the structured

representation of the entities they mention. To improve the result for both tasks, we selectively

use paid micro-task crowdsourcing. To do this, we create HITs on a crowdsourcing platform.

For the Entity Linking task, a HIT consists of asking which of the candidate links is correct for

an entity extracted from a document. For the Instance Matching task, a HIT consists in finding

which instance from a target dataset corresponds to a given instance from a source dataset.

See Figure 4.2, 4.3, and 4.4, which give examples of such tasks.

Paid crowdsourcing presents enormous advantages for high quality data processing. The

disadvantages, however, potentially include: high financial cost, low availability of workers,

and poor workers’ skills or honesty. To overcome those shortcomings, we alleviate the financial

cost using an efficient decision engine that selectively picks tasks that have a high improve-

ment potential. Our present assumption is that entities extracted from HTML news articles

could be recognized by the large public, especially when provided with sufficient contextual

information. Furthermore, each task is shown to multiple workers to balance out low quality

answers.

4.3 ZenCrowd Architecture

ZenCrowd is a hybrid human-machine architecture that takes advantage of both algorithmic

and manual data integration techniques simultaneously. Figure 4.1 presents a simplified

architecture of our system. We start by giving an overview of our system below in Section 4.3.1,

and then describe in more detail some of its components in Sections 4.3.2 to 4.3.4.

41

Page 64: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 4. Human Intelligence Task Quality Assurance

Micro Matching

Tasks

HTMLPages

LOD Open Data Cloud

CrowdsourcingPlatform

ZenCrowdEntity

Extractors

LOD Index

Output

Probabilistic Network

Decision Engine

AlgorithmicLinkers

Micr

o-Ta

sk M

anag

er

Workers Decisions

AlgorithmicMatchers

InputDataset Pair

Graph DB 1

2

3

HTML + RDFa Pages

New Matchings<owl:sameAs>

Indexing

Input

Figure 4.1 – The architecture of ZenCrowd: For the Instance Matching task (green pipeline),the system takes as input a pair of datasets to be interlinked and creates new links between thedatasets using <owl:sameAs> RDF triples. ZenCrowd uses a three-stage blocking procedurethat combines both algorithmic matchers and human workers in order to generate high qualityresults. For the Entity Linking task (orange pipeline), our system takes as input a collectionof HTML pages and enriches them by extracting textual entities appearing in the pages andlinking them to the Linked Open Data cloud.

4.3.1 System Overview

In the following we describe the different components of the ZenCrowd system focusing first

on the Instance Matching and then on the Entity Linking pipeline.

Instance Matching Pipeline

In order to create new links, ZenCrowd takes as input a pair of datasets from the LOD cloud.

Among the two datasets, one is selected as the source dataset and one as the target dataset.

Then, for each instance of the source dataset, our system tries to come up with candidate

matches from the target dataset.

First, the label used to name the source instance is used to query the LOD Index (see Sec-

tion 4.3.2) in order to obtain a ranked list of candidate matches from the target dataset. This

can efficiently, and cheaply, filter out numerous clear non-matches out of potentially numer-

ous (in the order of hundreds of millions for some LOD datasets) instances available. Next,

top-ranked candidate instances are further examined in the graph database. This step is taken

to obtain more complete information about the target instances, both to compute a more

accurate matching score as well as to provide information to the Micro-Task Manager (see

Figure 4.1), which has to fill the HIT templates for the crowd (see Section 4.3.5, which describes

42

Page 65: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

4.3. ZenCrowd Architecture

our three-stage blocking methodology in more detail). At this point, the candidate matches

that have a low confidence score are sent to the crowd for further analysis. The Decision Engine

collects confidence scores from the previous steps in oder to decide what to crowdsource,

together with data from the the graph database to construct the HITs.

Finally, we gather the results provided by the crowd into the Probabilistic Network component,

which combines them to come up with a final matching decision. The generated matchings

are then given as output by ZenCrowd in the form of RDF <owl:sameAs> links that can be

added back to the LOD cloud.

Entity Linking Pipeline

The other task ZenCrowd performs is Entity Linking, that is, identifying occurrences of LOD

entities in textual content and creating links from the text to corresponding instances stored

in a database. ZenCrowd takes as input sets of HTML pages (that can for example be provided

by a Web crawler). The HTML pages are then passed to Entity Extractors that inspect the

pages and identify potentially relevant textual entities (e.g., persons, companies, places, etc.)

mentioned in the page. Once detected, the entities are fed into Algorithmic Linkers that

attempt to automatically link the textual entities to semantically similar instances from the

LOD cloud. As querying the Web of data dynamically to link each entity would incur a very

high latency, we build a local cache (called LOD Index in Figure 4.1) to locally retrieve and

index relevant information from the LOD cloud. Algorithmic linkers return lists of top-k links

to LOD entities, along with a confidence value for each potentially relevant link.

The results of the algorithmic linkers are stored in a Probabilistic Network, and are then

combined and analyzed using probabilistic inference techniques. ZenCrowd treats the results

of the algorithmic linkers in three different ways depending on their quality. If the algorithmic

results are deemed excellent by our Decision Engine, the results (i.e., the links connecting a

textual entity extracted from an HTML page to the LOD cloud) get stored in a local database

directly. If the results are deemed useless (e.g., when all the links picked by the linkers have a

low confidence value), the results get discarded. Finally, if the results are deemed promising but

uncertain (for example because several algorithmic linkers disagree on the links, or because

their confidence values are relatively low), they are then passed to the Micro-Task Manager,

which extracts relevant snippets from the original HTML pages, collects all promising links,

and dynamically creates a micro-task using a templating engine. An example of micro-task for

the entity linking pipeline is shown in Figure 4.4. Once created, the micro-task is published

on a crowdsourcing platform, where it is handled by the crowd workers. When the human

workers have performed their task (i.e., when they have picked the relevant links for a given

textual entity), workers results are fed back to the Probabilistic Network. When all the links

are available for a given HTML page, an enriched HTML page—containing both the original

HTML code as well as RDFa annotations linking the textual entities to their counterpart from

the LOD cloud—is finally generated.

43

Page 66: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 4. Human Intelligence Task Quality Assurance

4.3.2 LOD Index and Graph Database

The LOD index engine is an information retrieval engine that we built on top of the LOD

dataset to speed up the entity retrieval process. While most LOD datasets provide a public

SPARQL interface, they are in practice very cumbersome to use due to the very high latency

(from several hundreds of milliseconds to several seconds) and bandwidth consumption they

impose. Instead of querying the LOD cloud dynamically for each new instance to be matched,

ZenCrowd caches locally pertinent information from the LOD cloud. Our LOD Index engine

receives as input a list of SPARQL endpoints or LOD dumps as well as a list of triple patterns,

and iteratively retrieves all corresponding triples from the LOD datasets. Using multiple LOD

datasets improves the coverage of our system, since some datasets cover only geographical

locations, while other datasets cover the scientific domain or general knowledge. The infor-

mation thus extracted is cached locally in two ways: in a local graph query engine—offering

a SPARQL interface—and in an inverted index to provide efficient support for unstructured

queries.

After ranked results are obtained from the LOD index, a more in-depth analysis of the candidate

matches is performed by means of queries to a graph database. This component stores and

indexes data from the LOD datasets and accepts SPARQL queries to retrieve predicate value

pairs attached to the query node. This component is used both to define the confidence

scoring function by means of schema-matching results (Section 4.4.1) as well as to compute

confidence scores for candidate matches and to show matching evidence to the crowd (Section

4.4.2).

4.3.3 Probabilistic Graph & Decision Engine

Instead of using heuristics or arbitrary rules, ZenCrowd systematizes the use of Probabilistic

Networks to make sensible decisions about the potential instance matches and entities links.

All evidences gathered both from the algorithmic methods and the crowd are fed into a

Probabilistic Network, and used by our decision engine to process all entities accordingly. Our

probabilistic models are described in detail in Section 4.5.

4.3.4 Extractors, Algorithmic Linkers & Algorithmic Matchers

The Extractors and Algorithmic Linkers are used exclusively by the Entity Linking pipeline (see

Figure 4.1). The Entity Extractors receive HTML as input, and extract named entities appearing

in the HTML content as output. Entity Extraction is an active area of research and a number of

advances have recently been made in that field (using for instance third-party information or

novel NLP techniques). Entity extraction is not the focus of our work in ZenCrowd. However,

we support arbitrary entity extractors through a generic interface in our system and union

their respective output to obtain additional results.

Once extracted, the textual entities are inspected by algorithmic linkers, whose role is to

44

Page 67: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

4.3. ZenCrowd Architecture

find semantically related entities from the LOD cloud. ZenCrowd implements a number of

state of the art linking techniques (see Section 4.7 for more details) that take advantage of

the LOD Index component to efficiently find potential matches. Each matcher also imple-

ments a normalized scoring scheme, whose results are combined by our Decision Engine (see

Section 4.5).

4.3.5 Three-Stage Blocking for Crowdsourcing Optimization

For the Instance Matching pipeline, a naive implementation of an Algorithmic Matcher would

check each pair of instances from two input datasets. However, the problem of having to

deal with too many candidate pairs rapidly surfaces. Moreover, crowdsourcing all possible

candidate pairs is unrealistic: For example, matching two datasets containing just 1’000

instances each would cost $150’000 if we crowdsource 1’000’000 possible pairs to 3 workers

paying $0.05 per task. Instead, we propose a three-stage blocking approach.

A common way to deal with the quadratic number of potential comparisons is blocking (see

Section 4.8). Basically, blocking groups promising candidate pairs together in sets using a

computationally inexpensive method (e.g., clustering) and, as a second step, performs all

possible comparisons within such sets using a more expensive method (e.g., string similarity).

ZenCrowd uses a three-stage blocking approach that involves crowdsourcing as an additional

step in the blocking process (see the three stages in Figure 4.1). Crowdsourcing the instance

matching process is expensive both in terms of latency as well as financially. For this reason,

only a very limited set of candidate pairs should be crowdsourced when matching large

datasets.

Given a source instance from a dataset, ZenCrowd considers all instances of the target dataset

as possible matches. The first blocking step is performed by means of an inverted index

over the labels of all instances in the target dataset. This allows to produce a list of instances

ranked by a scoring function that measures the likelihood of matching the source instance

very efficiently (i.e., in the order of milliseconds).

As a second step, ZenCrowd computes a more accurate but also more computationally ex-

pensive matching confidence for the top-ranked instances generated by the first step. This

confidence value is computed based on schema matching results among the two datasets and

produces a score in [0,1]. This value is not computed on all instances of the target dataset

but rather for those that are likely to be a good match as given by the first blocking step (see

Section 4.4.1).

This hybrid approach exploiting the interdependence of unstructured indices as well as

structured queries against a graph database is similar to the approach taken in [151] where,

for the task of Ad-Hoc Object Retrieval, a ranked list of results is improved by means of an

analysis of the result vicinity in the graph.

45

Page 68: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 4. Human Intelligence Task Quality Assurance

The final step consists in asking the crowd about candidate matching pairs. Based on the

confidence score computed during the previous step, ZenCrowd takes a decision about which

HITs to create on the crowdsourcing platform. As the goal of the confidence score is to

indicate how likely it is that a pair is a correct match, the system selects those cases where

the confidence is not already high enough so that it can be further improved by asking the

crowd. Possible instantiations of this step may include the provision of a fixed budget for the

crowdsourcing platform, which the system is allowed to spend in order to optimize the quality

of the results. Generally speaking, the system produces a ranked list of candidate pairs to be

crowdsourced based on the confidence score. Then, given the available resources, top pairs

are crowdsourced by batch to improve the accuracy of the matching process. On the other

hand, improving the task completion time can be obtained by increasing the reward assigned

to workers.

4.3.6 Micro-Task Manager

The micro-task manager is responsible for dynamically creating human computation tasks

that are then published on a crowdsourcing platform. Whenever a match is deemed promising

by our Decision Engine (see below for details), it is sent to the crowd for further examination.

The micro-task manager dynamically builds a Web page to be published on the crowdsourcing

platform using three resources: i) the name of the source instance ii) some contextual infor-

mation generated by querying the graph database and iii) the current top-k matches for the

instance from the blocking process. Once created and published, the matching micro-tasks

can be selected by workers on the crowdsourcing platform, who are then asked to select the

relevant matches (if any) for the source instance, given its name, the contextual information

from the graph database, and the various candidate matches described as in the LOD cloud.

Once performed, the results of the micro-matching tasks are sent back to the Micro-Task

Manager, which inserts them in the Probabilistic Network.

4.4 Effective Instance Matching based on Confidence Estimation and

Crowdsourcing

In this section, we describe the final steps of the blocking process that assure high quality

instance matching results. We first define our schema-based matching confidence measure,

which is then used to decide which candidate matchings to crowdsource. Then, we present

different approaches to crowdsourcing instance matching tasks. Specifically we compare two

different HIT designs where different context information about the instances is presented to

the worker.

46

Page 69: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

4.4. Effective Instance Matching based on Confidence Estimation and Crowdsourcing

OrganizationDBPedia Freebase

http://www.w3.org/2000/01/rdf-schema#label http://rdf.freebase.com/ns/type.object.namehttp://dbpedia.org/property/established http://rdf.freebase.com/ns/education.educational_institution.foundedhttp://dbpedia.org/property/foundation http://rdf.freebase.com/ns/business.company.founded

http://dbpedia.org/property/companyName http://rdf.freebase.com/ns/type.object.namehttp://dbpedia.org/property/founded http://rdf.freebase.com/ns/sports.sports_team.founded

PersonDBPedia Freebase

http://www.w3.org/2000/01/rdf-schema#label http://rdf.freebase.com/ns/type.object.namehttp://dbpedia.org/ontology/birthdate http://rdf.freebase.com/ns/people.person.date_of_birth

http://dbpedia.org/property/name http://rdf.freebase.com/ns/type.object.namehttp://dbpedia.org/property/dateOfBirth http://rdf.freebase.com/ns/people.person.date_of_birth

http://dbpedia.org/property/dateOfDeath http://rdf.freebase.com/ns/people.deceased_person.date_of_deathhttp://dbpedia.org/property/birthname http://rdf.freebase.com/ns/common.topic.alias

LocationDBPedia Freebase

http://www.w3.org/2000/01/rdf-schema#label http://rdf.freebase.com/ns/type.object.namehttp://dbpedia.org/property/establishedDate http://rdf.freebase.com/ns/location.dated_location.date_founded

http://dbpedia.org/ontology/demonym http://rdf.freebase.com/ns/freebase.linguistic_hint.adjectival_formhttp://dbpedia.org/property/name http://rdf.freebase.com/ns/type.object.name

http://dbpedia.org/property/isocode http://rdf.freebase.com/ns/location.administrative_division.iso_3166_2_codehttp://dbpedia.org/property/areaTotalKm http://rdf.freebase.com/ns/location.location.area

Table 4.1 – Top ranked schema element pairs in DBPedia and Freebase for the Person, Location,and Organization instances.

4.4.1 Instance-Based Schema Matching

While using the crowd to match instances across two datasets typically results in high quality

matchings, it is often infeasible to crowdsource all potential matches because of the very

high financial cost associated. Thus, as a second filtering step, we define a new measure that

computes the confidence of a matching as generated by the initial inverted index blocking

step.

Formally, given a candidate matching pair (i 1, i 2) we define a function f (i 1, i 2) that creates a

ranked list of candidate pairs such that the pairs ranked at the top are the most likely to be

correct. In such a way, it is possible to selectively crowdsource candidate matchings with lower

confidence to improve matching precision with a limited cost.

The matching confidence measure used by ZenCrowd is based on schema matching informa-

tion. The first step in the definition of the confidence measure consists in using a training set

of matchings among the two datasets5. Given a training pair (t1, t2) we retrieve all predicates

and values for the instances t1 and t2 and perform an exact string match comparison of their

values. At the end of such process, we rank predicate pairs by the number of times an exact

match on their values has occurred. Table 4.1 gives the top ranked predicate pairs for the

DBPedia and Freebase datasets. We observe that this simple instance-based schema mapping

techniques yields excellent results for many LOD schemas. For instance, for the entity type

5In our experiments we use 100 ground-truth matchings that are discarded later when evaluating the proposedmatching approaches.

47

Page 70: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 4. Human Intelligence Task Quality Assurance

person in Table 4.1, where ‘birthdate’ from DBPedia is correctly matched to ‘date_of_birth’

from Freebase.

After the list of schema elements have been matched across the two datasets, we define the

confidence measure for an individual candidate matching pair. To obtain a confidence score

in [0,1] we compute the average Jaccard similarity among all tokenized values of all matched

schema elements for the two candidate instances u1 and u2. In the case a list of values is

assigned to a schema element (e.g., a DBPedia instance may have multiple labels that represent

the instance name in different languages) we retain the maximum Jaccard similarity value in

the list for that schema element. For example, the confidence score of the following matching

pairs will be (2/3)+(1)2 = 0.83.

u1 u2

rdfs:label barack h. obama fb:name barack obama

dbp:dateOfBirth 08-04-61 fb:date_of_birth 08-04-61

4.4.2 Instance Matching with the Crowd

We now turn to the description of two HIT designs we experimented with for crowdsourc-

ing instance matching in ZenCrowd. Previous work also compared different interfaces to

crowdsourcing instance matching tasks [164]. Specifically, the authors compared pairwise and

table-based matching interfaces. Instead, we compare matching interfaces based on different

pieces of information given to the worker directly on the HIT page.

Figure 4.2 and 4.3 show our two different interfaces for the instance matching task. The label-

only matching interface asks the crowd to find a target entity among the proposed matches.

In this case, the target entity is presented as its label with a link to the corresponding LOD

webpage. Then, the top ranked instances from the DBPedia dataset, which are candidates to

match the target entity, are shown. This interface is reminiscent of the automatic approach

based on the inverted index that performs the initial blocking step though on a larger scale

(i.e., only few candidates are shown to the worker in this case).

The molecule interface also asks the worker to identify the target entity (from Freebase in the

figure) in the table containing top-ranked entities from DBPedia. This second interface defines

a simpler task for the worker by presenting directly on the HIT page relevant information

about the target entity as well as about the candidate matches. In this second version of the

interface, the worker is asked to directly match the instance on the left with the corresponding

instance on the right. Compared to the first matching interface, the molecule interface does

not just display the labels but also additional information (property and value pairs) about

each instance. Such information is retrieved from the graph database and displayed to the

worker.

In both interfaces, the worker can select the “No match” option if no instance matches the

48

Page 71: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

4.5. Probabilistic Models

Figure 4.2 – The Label-only instance matching HIT interface, where entities are displayed astextual labels linking to the full entity descriptions in the LOD cloud.

target entity. An additional field is available for the worker to leave comments.

Manual inspection of crowdsourcing results has shown that most of errors on the many-to-

many matching interface were due to the fact that the workers did not match the target entity

but, rather, they correctly matched a different entity. For example, when the NYT target entity

was a city, many workers instead selected from both Freebase and DBPedia tables an instance

about the music festival hosted in that city. Therefore, while the matching between the two

tables is correct as the same instance was identified in both candidate sets, this was not the

target instance the task was asking to match.

4.5 Probabilistic Models

ZenCrowd exploits probabilistic models to make sensible decisions about candidate results.

We describe below the probabilistic models used to systematically represent and combine

information in ZenCrowd, and how those models are implemented and handled by our system.

In the following we use factor-graphs to graphically represent probabilistic variables and

distributions. Note that our approach is not bound to this representation—we could use series

of conditional probabilities only or other probabilistic graphical models—but we decided to

use factor-graphs for their illustrative merits. For an in-depth coverage on factor graphs, we

refer the interested reader to one of the many overviews on this domain, such as [95], or to our

brief introduction made in [50].

49

Page 72: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 4. Human Intelligence Task Quality Assurance

Figure 4.3 – The Molecule instance matching HIT interface, where the labels of the entities aswell as related property-value pairs are displayed.

4.5.1 Graph Models

We start by describing the probabilistic graphs used to combine all matching evidences

gathered for a given candidate URI. Consider an instance from the source dataset. The

candidate matches are stored as a list of potential matchings m j from a LOD dataset. Each m j

has a prior probability distribution pm j computed from the confidence matching function.

Each candidate can also be examined by human workers wi performing micro-matching tasks

and performing clicks ci j to express the fact that a given candidate matching corresponds (or

not) to the source instance from his/her perspective.

Workers, matchings, and clicks are mapped onto binary variables in our model. Workers

accept two values {Good ,B ad} indicating whether they are reliable or not. Matchings can

either be Cor r ect or Incor r ect . As for click variables, they represent whether the worker i

considers that the source instance is the same as the proposed matching m j (Cor r ect ) or not

(Incor r ect ). We store prior distributions—which represent a priori knowledge obtained for

example through training phases or thanks to external sources—for each workers (pwi ()) and

each matching (pm j ()). The clicks are observed variables and are set to Cor r ect or Incor r ect

depending on how the human workers clicked on the crowdsourcing platform.

A simple example of such an entity graph is given in Figure 4.5. Clicks, workers, and matchings

are further connected through two factors described below.

The same network can be instantiated for each entity of an Entity Linking task where m j are

candidate links from the LOD instead.

50

Page 73: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

4.5. Probabilistic Models

Figure 4.4 – The Entity Linking HIT interface.

Matching & Linking Factors

Specific task (either matching or linking) factors m f j () connect each candidate to its related

clicks and the workers who performed those clicks. Examining the relationships between

those three classes of variables, we make two key observations: i) clicks from reliable workers

should weight more than clicks from unreliable workers (actually, clicks from consistently

unreliable workers deciding randomly if a given answer is relevant or not should have no

weight at all in our decision process) and ii) when reliable workers do not agree, the likelihood

of the answer being correct should be proportional to the fraction of good workers indicating

the answer as correct. Taking into account both observations, and mapping the value 0 to

Incor r ect and 1 to Cor r ect , we write the following function for the factor:

m f (w1, . . . , wm ,c1, . . . ,cn ,m) ={

0.5, if ∀wi ∈ {w1, . . . , wm} wi = B ad∑i 1(wi =Good ∧ ci =m)∑

i 1(wi =Good),otherwise

(4.1)

where 1(cond) is an indicator function equal to 1 when cond is true and 0 otherwise.

Unicity Constraints for Entity Linking

Given that the instance matching task definition assumes that only one instance from the

target dataset can be a correct match for the source instance. Similarly, a concept appearing

in textual content can only be mapped to a single entity from a given dataset. We can thus

rule out all configurations where more than one candidate from the same LOD dataset are

considered as Cor r ect . The corresponding factor u() is declared as being equal to 1 and is

51

Page 74: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 4. Human Intelligence Task Quality Assurance

w1 w2

m1 m2

pw1( ) pw2( )

mf1( ) mf2( )

pm1( ) pm2( )

m3

mf3( )

pm3( )

c11 c22c12c21 c13 c23

sa2-3( )u1-2( )

Figure 4.5 – An entity factor-graph connecting two workers (wi ), six clicks (ci j ), and threecandidate matchings (m j ).

defined as follows:

u(m1, . . . ,mn) =

0, if ∃(mi ,m j ) ∈ {m1, . . . ,mn}

| mi = m j =Cor r ect

1, otherwise

(4.2)

SameAs Constraints for Entity Linking

SameAs constraints are exclusively used in Entity Linking graphs. They exploit the fact that

the resources identified by the links to the LOD cloud can themselves be interlinked (e.g.,

dbpedia:Fribourg is connected through an owl:sameAs link to fbase:Fribourg in the LOD

cloud)6. Considering that the SameAs links are correct, we define a constraint on the variables

connected by SameAs links found in the LOD cloud; the factor sa() connecting those variables

puts a constraint forbidding assignments where the variables would not be set to the same

values as follows:

sa(l1, . . . , ln) ={

1 if ∀(li , l j ) ∈ {l1, . . . , ln} li = l j

0 otherwise

We enforce the constraint by declaring sa() = 1. This constraint considerably helps the decision

process when strong evidences (good priors, reliable clicks) are available for any of the URIs

connected to a SameAs link. When not all SameAs links should be considered as correct, further

probabilistic analyses (e.g., on the transitive closures of the links as defined in idMesh [45])

can be put into place.

6We can already see the benefit of having better matchings across datasets for that matter.

52

Page 75: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

4.5. Probabilistic Models

4.5.2 Reaching a Decision

Given the scheme above, we can reach a sensible decision by simply running a probabilistic

inference method (e.g., the sum-product algorithm described above) on the network, and

considering as correct all matchings with a posterior probability P (l =Cor r ect) > 0.5. The

Decision Engine can also consider a higher threshold τ > 0.5 for the decisions in order to

increase the precision of the results.

4.5.3 Updating the Priors

Our computations always take into account prior factors capturing a priori information about

the workers. As time passes, decisions are reached on the correctness of the various matches,

and the Probabilistic Network iteratively accumulates posterior probabilities on the reliability

of the workers. Actually, the network gets new posterior probabilities on the reliability of the

workers for every new matching decision that is reached. Thus, the Decision Engine can decide

to modify the priors of the workers by taking into account the evidences accumulated thus

far to enhance future results. In a probablisitic graphical model with missing observations,

this corresponds to a learning parameters phase. To tackle this type of problem, we use in the

following a simple Expectation-Maximization [44, 52] process as follows:

- Initialize the prior probability of the workers using a training phase during which work-

ers are evaluated on k matches whose results are known. Initialize their prior reliability

to #cor r ect_r esul t s/k. If no information is available or no training phase is possible,

start with P (w = r el i abl e) = P (w = unr el i able) = 0.5 (maximum entropy principle).

- Gather posterior evidences on the reliability of the workers P (w = r el i abl e|mi =Cor r ect/Incor r ect ) as soon as a decision is reached on a matching. Treat these ev-

idences as new observations on the reliability of the workers, and update their prior

beliefs iteratively as follows:

P (w = r el i abl e) =k∑

i=1Pi (w = r el i abl e|mi )k−1 (4.3)

where i runs over all evidences gathered so far (from the training phase and from the posterior

evidences described above). Hence, we make the prior values slowly converge to their maxi-

mum likelihood to reflect the fact that more and more evidences are being gathered about the

mappings as we reach more decisions on the instances. This technique can also be used to

identify and blacklist unreliable workers dynamically.

4.5.4 Selective Model Instantiation

The framework described above actually creates a gigantic probabilistic graph, where all in-

stances, clicks, and workers are indirectly connected through various factors. However, only a

small subset of the variables need to be considered by the inference engine at any point in

53

Page 76: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 4. Human Intelligence Task Quality Assurance

time. Our system updates the various priors iteratively, but only instantiates the handful of

variables useful for reaching a decision on the entity currently examined. It thus dynamically

instantiates instance matching and entity linking factor-graphs, computes posterior probabili-

ties for the matchings and linking, reaches a decision, updates the priors, and stores back all

results before de-instantiating the graph and moving to the next instance/entity.

4.6 Experiments on Instance Matching

In this section, we experimentally evaluate the effectiveness of ZenCrowd for the Instance

Matching (IM) task. ZenCrowd is a relatively sophisticated system involving many components.

In the following, we present and discuss the results of a series of focused experiments, each

designed to illustrate the performance of a particular feature of our IM pipeline. We present

extensive experimental results evaluating the Entity Linking pipeline (depicted using an

orange background in Figure 4.1) in Section 4.7. Though many other experiments could have

been performed, we believe that the set of experiments presented below gives a particularly

accurate account of the performance of ZenCrowd for the IM task. We start by describing our

experimental setting below.

4.6.1 Experimental Setting

To evaluate the ZenCrowd IM pipeline based on Probabilistic Networks as well as on crowd-

sourcing, we use the following datasets: The ground truth matching data comes from the

Data Interlinking task from the Instance Matching track of the Ontology Alignment Evaluation

Initiative (OAEI) in 20117. In this competition, the task was to match a given New York Times

(NYT) URI8 to the corresponding URI in DBpedia, Freebase, and Geonames. The evaluation of

automatic systems is based on manual matchings created by the NYT editorial team. Starting

from such data, we obtained the corresponding Freebase-to-DBpedia links via transitivity

through NYT instances. Thus, the ground truth is available for the task of matching a Freebase

instance to the corresponding one in DBPedia, which is more challenging then the original

task as both Freebase and DBPedia are very large datasets generated semi-automatically as

compared to NYT data which is small and manually curated.

In addition, we use a standard graph dataset containing data about all instances in our test-

set (that is, the Billion Triple Challenge BTC 2009 dataset9) in order to run our graph-based

schema matching approach and to retrieve data that is presented to the crowd. The BTC 2009

consists of a crawl of RDF data from the Web containing more than one billion facts about 800

million instances.

7http://oaei.ontologymatching.org/2011/instance/8http://data.nytimes.com/9http://km.aifb.kit.edu/projects/btc-2009/

54

Page 77: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

4.6. Experiments on Instance Matching

First Blocking Phase: LOD Indexing and Instance Ranking. In order to select candidate

matchings for the source instance, we adopt IR techniques similar to those that have been

used by participants of the Entity Search evaluation at the Semantic Search workshop for

the AOR task, where a string representing an entity (i.e., the query) is used to rank URIs that

identify the entity. We build an inverted index over 40 million instance labels in the considered

LOD datasets, and run queries against it using the source instance labels in our test collection.

Unless specified otherwise, the top-5 results ranked by TF-IDF are used as candidates for the

crowdsourcing task after their confidence score has been computed.

Micro-Task Generation and ZenCrowd Aggregation. To evaluate the quality of each step

in the ZenCrowd IM pipeline, we selected a subset of 300 matching pairs from the ground

truth of different categories (100 persons, 100 locations, and 100 organizations). Then we

crowdsourced the entire collection to compare the quality of crowd matching against other

automatic matching techniques and their combinations.

The crowdsourcing tasks were run over Amazon Mechanical Turk10 as two independent

experiments for the two proposed matching interfaces (see Section 4.4.2). Each matching task

has been assigned to five different workers and was remunerated $0.05 each, employing a total

of 91 workers11.

We aggregate the results from the crowd using the method described in Section 4.5, with

an initial training phase consisting of 5 entities, and a second, continuous training phase,

consisting of 5% of the other entities being offered to the workers (i.e., the workers are given a

task whose solution is known by the system every 20 tasks on average).

Evaluation Measures. In order to evaluate the effectiveness of the different components, we

compare—for each instance—the selected matches against the ground truth that provides

matching/non-matching data for each source instance. Specifically, we compute (P)recision

and (R)ecall which are defined as follows: We consider as true positives (tp) all cases where

both the ground truth and the approach select the same matches, false positives (fp) the cases

where the approach selects a match which is not considered as correct by the ground truth,

and false negatives (fn) the cases where the approach does non select a match while the ground

truth does. Then, Precision is defined as P = t p/(t p + f p) and Recall as R = t p/(t p + f n).

In the following, all the final matching approaches (automatic, crowd majority vote, and

ZenCrowd) are optimized to return high precision values. We decided to focus on Precision

from the start, since from our experience it is the most useful metric in practice but we have

observed that high Recall is obtained in most configurations.

10http://www.mturk.com11The test-set we have created together with the matching results from the crowd are available for download at

the page: http://exascale.info/ZenCrowd

55

Page 78: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 4. Human Intelligence Task Quality Assurance

4.6.2 Experimental Results

In the following we report the experimental results aiming at comparing the effectiveness

of different matching techniques at different stages of the blocking process. In detail, we

compare the results of our inverted index based matching, which is highly scalable but not

particularly effective, the matching based on schema information, and the matching provided

by the crowd whose results are excellent but which is not cost and time efficient because of the

high monetary cost it necessitates and of the high latency it generates.

Recall of the First Blocking Phase. The first evaluation we perform is centered on the initial

blocking phase based on keyword queries over the inverted index. It is critical that such a

step, while being efficiently performed over a large amount of potential candidate matchings,

preserves as many correct results as possible in the generated ranked list (i.e., high Recall) in

order for the subsequent matching phases to be effective. This allows the graph and crowd

based matching schemes to focus on high Precision in turn.

Figure 4.6 shows how Recall varies by considering the top-N results as ranked by the inverted

index using TF-IDF values. As we can see, we retrieve the correct matches for all the instances

in our test-set after five candidate matches already.

1 2 3 4 575

808590

95100

Top−N TF−IDF Results

Rec

all

Figure 4.6 – Maximum achievable Recall by considering top-K results from the the invertedindex.

Second Blocking Phase: Matching Confidence Function. The second blocking step in-

volves the use of a matching confidence measure. This function measures the likelihood

of a match given a pair of instances based on schema matching results and string comparison

on the values directly attached to the instances in the graph (see Section 4.3.5). The goal of

such a function is to be able to identify the matching pairs that are worth to crowdsource in

order to improve the effectiveness of the system.

Figure 4.7 shows how Precision and Recall vary by considering matching pairs that match best

according to our schema-based confidence measure. Specifically, by setting a threshold on

the confidence score we can let the system focus either on high Precision or on high Recall.

For instance, if we only trust matches with a confidence value of 1.0 then Precision is at

is maximum (100%), but the recall is low (25%). That is, we would need to initiate many

crowdsourcing tasks to compensate.

56

Page 79: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

4.6. Experiments on Instance Matching

Figure 4.7 – Precision and Recall as compared to Matching confidence values.

Final Phase: Crowdsourcing and Probabilistic Reasoning. After the confidence score has

been computed and the matching pairs have been selected, our system makes it possible to

crowdsource some of the results and aggregate them into a final matching decision. A standard

approach to aggregate the results from the crowd is majority voting: the 5 automatically

selected candidate matchings are all proposed to 5 different workers who have to decide which

matching is correct for the given instance. After the task is completed, the matching with most

votes is selected as valid matching. Instead, the approach used by ZenCrowd is to aggregate

the crowd results by means of the Probabilistic Network described in Section 4.5.

Table 4.2 shows the precision values of the crowd on all the matching pairs in our test-set.

Table 4.3 shows the precision values of the automatic approaches and their combinations with

the crowd results based both on Majority Voting as well as using ZenCrowd.

HIT Aggregation Organizations People Locations

Label-onlyMaj.Vote 0.67 0.70 0.65

ZenCrowd 0.77 0.75 0.73

MoleculeMaj.Vote 0.74 0.85 0.73

ZenCrowd 0.81 0.87 0.81

Table 4.2 – Crowd Matching Precision over two different HIT design interfaces (Label-only andMolecule) and two different aggregation methods (Majority Vote and ZenCrowd).

Organizations People LocationsInverted Index Baseline 0.78 0.98 0.89Majority Vote 0.87 0.98 0.96ZenCrowd 0.89 0.98 0.97

Table 4.3 – Matching Precision for purely automatic and hybrid human/machine approaches.

From Table 4.2, we observe that i) the crowd performance improves by using the Molecule

interface, that is, displaying data about the matching candidates directly from the graph

database leads to higher Precision consistently across different entity types as compared to

the interface that only displays the instance name and lets the worker click on their link to

obtain additional information; We also observe that ii) the Probabilistic Network used by

ZenCrowd to aggregate the outcome of crowdsourcing outperforms the standard Majority

Vote aggregation scheme in all cases.

57

Page 80: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 4. Human Intelligence Task Quality Assurance

0 0.25 0.5 0.75 10

50

100

150

200

250

Confidence ValueN

umbe

r of H

its

Figure 4.8 – Number of tasks generated for a given confidence value.

From Table 4.3 we can see that ZenCrowd outperforms i) the purely automatic matching base-

line based on the inverted index ranking function as well as ii) the hybrid matching approach

based on automatic ranking, schema-based matching confidence, and crowdsourcing. Addi-

tionally we observe that the most challenging type of instances to match in our experiment

is Organizations while People can be matched with high Precision using automatic methods

only. On average over the different entity types, we could match data with a 95% accuracy12

(as compared to the initial 88% average accuracy of the purely automatic baseline).

Crowdsourcing Cost Optimization. In addition to being interested in the effectiveness of

the different matching methods, we are also interested in their cost in order to be able to

select the best trade-off among the available combinations. In the following, we report on

results focusing on an efficient selection of the matching pairs that the system crowdsources.

After the initial blocking step based on the inverted index (that is able to filter out most of the

non-relevant instances) we compute a confidence matching score for all top ranked instances

using the schema-based method. This second blocking step allows ZenCrowd to select, based

on a threshold on the computed confidence score, which matching pairs to crowdsource.

Setting a threshold allows to crowdsource cases with low confidence only.

Figure 4.8 shows how many HITs are generated by ZenCrowd by varying the threshold on

the confidence score. As we can see when we set the confidence threshold to 0 then we

trust completely the automatic approach and crowdsource no matching. By increasing the

threshold on the matching confidence we are required to crowdsource matchings for more

than half of out test-set instances. Compared to Figure 4.7 we can see that the increase in the

gap between Precision and Recall corresponds to the number of crowdsourced tasks: if Recall

is low we need to crowdsource new matching tasks to obtain results about those instances the

automatic approach could not match with high confidence.

Crowd Performance Analysis. We are also interested in understanding how the crowd per-

forms on the instance matching task. Figure 4.9 shows the trade-off between the crowdsourc-

ing cost and the matching precision. We observe that our system is able to improve the overall

matching Precision by rewarding more workers (i.e., we select top-K workers based on their

12This is the average accuracy over all entity types reported in Table 4.3.

58

Page 81: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

4.6. Experiments on Instance Matching

Figure 4.9 – ZenCrowd money saving by considering results from top-K workers only.

prior probability which is computed according to their past performance). On the other hand,

it is possible to reduce the cost (as compared to the original 5 workers setup) with a limited

loss in Precision by considering fewer workers.

Table 4.4 compares the crowd performance over the two different HIT designs. When compar-

ing the two designs, we can observe that more errors are done with the Label-only interface

(i.e., 66 vs 38) as the workers do not have much information directly on the HIT page. Interest-

ingly, we can also see that the common errors are minimal (i.e., 20 out of 300) which motivates

further analysis and possible combinations of the two designs.

Label-onlyCorrect Wrong

MoleculeCorrect 176 66Wrong 38 20

Table 4.4 – Correct and incorrect matchings as by crowd Majority Voting using two differentHIT designs.

Figure 4.10 presents the worker accuracy as compared to the number of tasks performed by

the worker. As we can see most of the workers reach Precision values higher than 50% and the

workers who contributed most provide high quality results. When compared with the worker

Precision over the Entity Linking task (see Figure 4.16 top) we can see that while the Power Law

distribution of completed HITs remains (see Figure 4.17), the crowd Precision on the Instance

Matching task is clearly higher than on the Entity Linking task.

Finally, we briefly comment on the efficiency of our IM approach. In its current implemen-

tation, ZenCrowd takes on average 500ms to select and rank candidate matchings out of

the inverted index, 125ms to obtain instance information from the graph DB, and 500ms to

generate a micro-matching task on the crowdsourcing platform. The decision process takes

on average 100ms. Without taking into account any parallelization, our system can thus offer

a new matching task to the crowd roughly every second, which in our opinion is sufficient

for most applications. Once on the crowdsourcing platform, the tasks have a much higher

latency (several minutes to a few hours), latency which is however mitigated by the fact that

instance matching is an embarrassingly parallel operation on crowdsourcing platforms (i.e.,

large collections of workers can work in parallel at any given point in time).

59

Page 82: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 4. Human Intelligence Task Quality Assurance

0 100 2000

50

100

Number of Hits per Worker

Prec

isio

n of

the

Wor

ker %

Worker

Figure 4.10 – Distribution of the workers’ precision using the Molecule design as compared tothe number of tasks performed by the workers.

4.6.3 Discussion

Looking back at the experimental results presented so far, we first observe that crowdsourcing

instance matching is useful to improve the effectiveness of an instance matching system. State

of the art majority voting crowdsourcing techniques can relatively improve Precision up to

12% over a purely automatic baseline (going from 0.78 to 0.87). ZenCrowd takes advantage of

a probabilistic framework for making decisions and performs even better, leading to a relative

performance improvement up to 14% over our best automatic matching approach (going

from 0.78 to 0.89)13.

A more general observation is that instance matching is a challenging task, which can rapidly

become impractical when errors are made at the initial blocking phases. Analyzing the

population of workers on the crowdsourcing platform (see Figure 4.17), we observe that the

number of tasks performed by a given worker exhibit a long tail distribution (i.e., few workers

perform many tasks, while many workers perform a few tasks only). Also, we observe that

the average precision of the workers is broadly distributed between [0.5,1] (see Figure 4.10).

As workers cannot be selected dynamically for a given task on the current crowdsourcing

platforms (all we can do is prevent some workers from receiving any further task through

blacklisting or decide not to reward workers who consistently perform bad), obtaining perfect

matching results is thus in general unrealistic for non-controlled settings.

4.7 Experiments on Entity Linking

4.7.1 Experimental Setting

Dataset Description. In order to evaluate ZenCrowd on the Entity Linking (EL) task, we

created an ad-hoc test collection14. The collection consists of 25 news articles written in

English from CNN.com, NYTimes.com, washingtonpost.com, timesofindia.indiatimes.com,

13The improvement is statistically significant (t-test p < 0.05).14The test collection we created is available for download at: http://exascale.info/zencrowd/.

60

Page 83: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

4.7. Experiments on Entity Linking

and swissinfo.com, which were manually selected to cover global interest news (10), US local

news (5), India local news (5), and Switzerland local news (5). After the full text of the articles

has been extracted from the HTML page [93], 489 entities were extracted from it using the

Stanford Parser [92] as entity extractor. The collection of candidate URIs is composed of all

entities from DBPedia15, Freebase16, Geonames17, and NYT18, summing up to approximately

40 million entities (23M from Freebase, 9M from DBPedia, 8M from Geonames, 22K from

NYT). Expert editors manually selected the correct URIs for all the entities in the collection to

create the ground truth for our experiments. Crowdsourcing was performed using the Amazon

MTurk19 platform where 80 distinct workers have been employed. A single task, paid $0.01,

consisted of selecting the correct URIs out of the proposed five URIs for a given entity.

In the following, we present and discuss the results of a series of focused experiments, each

designed to illustrate the performance of a particular feature of our EL pipeline or of related

techniques. We start by describing a relatively simple base-configuration for our experimental

setting below.

LOD Indexing, Entity Linking and Ranking. In order to select candidate URIs for an entity,

we adopt the same IR techniques used for the IM task. We build an inverted index over 40

million entity labels in the considered LOD datasets, and run queries against it using the

entities extracted from the news articles in the test collection. Unless specified otherwise, the

top 5 results ranked by TF-IDF are used as candidates for the crowdsourcing task.

Micro-Task Generation. We dynamically create a task on MTurk for each entity sent to

the crowd. We generate a micro-task where the entity (possibly with some textual context)

is shown to the worker who has then to select all the URIs that match the entity, with the

possibility to click on the URI and visit the corresponding webpage. If no URI matches the

entity, the worker can select the “None of the above” answer. An additional field is available

for the worker to leave comments.

Evaluation Measures. In order to evaluate the effectiveness of our EL methods we compare,

for each entity, the selected URIs against the ground truth which provides matching/non-

matching information for each candidate URI. Similarly to what we did for the IM task eval-

uation, we compute (P)recision, (R)ecall, and (A)ccuracy which are defined as follows: We

consider as true positives (tp) all cases where both the ground truth and the approach select

the URI, true negatives (tn) the cases where both the ground truth and the approach do not se-

lect the URI for the entity, false positives (fp) the cases where the approach selects a URI which

15http://dbpedia.org/16http://www.freebase.com/17http://www.geonames.org/18http://data.nytimes.com/19http://www.mturk.com

61

Page 84: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 4. Human Intelligence Task Quality Assurance

All Entities Linkable EntitiesP R P R

GL News 0.27 0.67 0.40 1.0US News 0.17 0.46 0.36 1.0IN News 0.22 0.62 0.36 1.0SW News 0.21 0.63 0.34 1.0All News 0.24 0.63 0.37 1.0

Table 4.5 – Performance results for the candidate selection approach.

is not considered correct by the ground truth, and false negatives (fn) the cases where the ap-

proach does non select a URI that is correct in the ground truth. Then, Precision is defined as

P = t p/(t p+ f p), Recall as R = t p/(t p+ f n), and Accuracy as A = (t p+tn)/(t p+tn+ f p+ f n).

In the following, all the final EL approaches (automatic, majority vote, and ZenCrowd) are op-

timized to return high precision values. We decided to focus on precision from the start, since

from our experience it is the most useful metric in practice (i.e., entity linking applications

typically tend to favor precision to foster correct information processing capabilities, at the

expense of not linking some of the entities).

4.7.2 Experimental Results

Entity Extraction and Linkable Entities. We start by evaluating the performance of the

entity extraction process. As described above, we use a state of the art extractor (the Stanford

Parser) for this task. According to our ground truth, 383 out of the 488 automatically extracted

entities can be correctly linked to URIs in our experiments, while the remaining ones are either

wrongly extracted, or are not available in the LOD cloud we consider. Unless stated otherwise,

we average our results over all linkable entities, i.e., all entities for which at least one correct

link can be picked out (we disregard the other entities for several experiments, since they were

wrongly extracted from the text or are not at all available in the LOD data we consider and

thus can be seen as a constant noise level in our experiments).

Candidate Selection. We now turn to the evaluation of our candidate selection method. As

described above, candidate selection consists in the present case in ranking URIs using TF-IDF

given an extracted entity20. We focus on high Recall for this phase (i.e., we aim at keeping

as many potentially interesting candidates as possible), and decided to keep the top-5 URIs

produced by this process. Thus, we aim at preserving as many correct URIs as possible for

later linking steps (e.g., in order to provide good candidate URIs to the crowd). We report on

the performance of candidate selection in Table 4.5.

20Our approach is hence similar to [29], though we do not use BM25F as a ranking function.

62

Page 85: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

4.7. Experiments on Entity Linking

0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1  

0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9   1  

Recall  of  Top

 5  

Max  Matching  Probability  

Figure 4.11 – Average Recall of candidate selection when discriminating on max relevanceprobability in the candidate URI set.

As we can observe, results are consistent with our goal since all interesting candidates are

preserved by this method (Recall of 1 for the linkable entities set).

Then, we examine the potential role of the highest confidence scores in the candidate selection

process. This analysis helps us decide when crowdsourcing an EL task is useful and when it is

not. In Figure 4.11, we report on the average recall of the top-5 candidates when classifying

results based on the maximum confidence score obtained (top-1 score). The results are

averaged over all extracted entities21.

As expected, we observe that high confidence values for the candidates selection lead to high

recall and, therefore, to candidate sets which contain many of the correct URIs. For this reason,

it is useful to crowdsource EL tasks only for those cases exhibiting relatively high confidence

values (e.g., > 0.5). When the highest confidence value in the candidate set is low, it is then

more likely that no URI will match the entity (because the entity has no URI in the LOD cloud

we consider, or because the entity extractor extracted the entity wrongly).

On the other hand, crowdsourcing might be unnecessary for cases where the Precision of the

automatic candidate selection phase is already quite high. The automatic selection techniques

can be adapted to identify the correct URIs in a completely automatic fashion. In the following,

we automatically select top-1 candidates only (i.e., the link with the highest confidence),

in order to focus on high Precision results as required by many practical applications. A

different approach focusing on recall might select all candidates with a confidence higher

than a certain threshold. Figure 4.12 reports on the performance of our fully automatic entity

linking approaches. We observe that when the top-1 URI is selected, the automatic approach

reaches a Precision value of 0.70 at the cost of low recall (i.e., fewer links are picked). As latter

results will show, crowdsourcing techniques can improve both Precision and Recall over this

automatic entity linking approaches in all cases.

21Confidence scores have all been normalized to [0,1] by manually defining a transformation function.

63

Page 86: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 4. Human Intelligence Task Quality Assurance

0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1  

0   0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  

Precision  /  Re

call  

Matching  Probability  Threshold  

P  R  

0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1  

1   2   3   4   5  

Precision  /  Re

call  

Top  N  Results  

P  R  

Figure 4.12 – Performance results (Precision, Recall) for the automatic approach.

Entity Linking using Crowdsourcing with Majority Vote. We now report on the perfor-

mance of a state of the art crowdsourcing approach based on majority voting: the 5 au-

tomatically selected candidate URIs are all proposed to 5 different workers who have to decide

which URI(s) is (are) correct for the given entity. After the task is completed, the URIs with

at least 2 votes are selected as valid links (we tried various thresholds and manually picked 2

in the end since it leads to the highest precision scores while keeping good recall values for

our experiments). We report on the performance of this crowdsourcing technique in Table

4.6. The values are averaged over all linkable entities for different document types and worker

communities.

US Workers Indian WorkersP R A P R A

GL News 0.79 0.85 0.77 0.60 0.80 0.60US News 0.52 0.61 0.54 0.50 0.74 0.47IN News 0.62 0.76 0.65 0.64 0.86 0.63SW News 0.69 0.82 0.69 0.50 0.69 0.56All News 0.74 0.82 0.73 0.57 0.78 0.59

Table 4.6 – Performance results for crowdsourcing with majority vote over linkable entities.

The first question we examine is whether there is a difference in reliability between the various

populations of workers. In Figure 4.13 we show the performance for tasks performed by

workers located in USA and India (each point corresponds to the average Precision and Recall

over all entities in one document). On average, we observe that tasks performed by workers

located in the USA lead to higher precision values. As we can see in Table 4.6, Indian workers

obtain higher Precision and Recall on local Indian news as compared to US workers. The

biggest difference in terms of accuracy between the two communities can be observed on the

global interest news.

A second question we examine is how the textual context given for an entity influences the

worker performance. In Figure 7.11, we compare the tasks for which only the entity label is

64

Page 87: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

4.7. Experiments on Entity Linking

0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1  

0   0.2   0.4   0.6   0.8   1  

Recall  

Precision  

US  India  

Figure 4.13 – Per document task effectiveness.

0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1  

1   2   3   4   5   6   7   8   9   10  

Precision  

Document  

Simple  

Snippet  

Figure 4.14 – Crowdsourcing results with two different textual contexts.

US Workers Indian WorkersP R A P R A

GL News 0.84 0.87 0.90 0.67 0.64 0.78US News 0.64 0.68 0.78 0.55 0.63 0.71IN News 0.84 0.82 0.89 0.75 0.77 0.80SW News 0.72 0.80 0.85 0.61 0.62 0.73All News 0.80 0.81 0.88 0.64 0.62 0.76

Table 4.7 – Performance results for crowdsourcing with ZenCrowd over linkable entities.

given (simple) to those for which a context consisting of all the sentences containing the entity

are shown to the worker (snippets). Surprisingly, we could not observe a significant difference

in effectiveness caused by the different textual contexts given to the workers. Thus, we focus

on only one type of context for the remaining experiments (we always give the snippet context).

Entity Linking with ZenCrowd. We now focus on the performance of the probabilistic infer-

ence network as proposed in this chapter. We consider the method described in Section 4.5,

with an initial training phase consisting of 5 entities, and a second, continuous training phase,

consisting of 5% of the other entities being offered to the workers (i.e., the workers are given a

task whose solution is known by the system every 20 tasks on average).

65

Page 88: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 4. Human Intelligence Task Quality Assurance

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  

Precision  

 

Document  

Agr.  Vote  

ZenCrowd  

Top  1  

Figure 4.15 – Comparison of three linking techniques.

In order to reduce the number of tasks having little influence in the final results, a simple

technique of blacklisting of bad workers is used. A bad worker (who can be considered as a

spammer) is a worker who randomly and rapidly clicks on the links, hence generating noise in

our system. In our experiments, we consider that 3 consecutive bad answers in the training

phase is enough to identify the worker as a spammer and to blacklist him/her. We report the

average results of ZenCrowd when exploiting the training phase, constraints, and blacklisting

in Section 4.7.2. As we can observe, precision and accuracy values are higher in all cases when

compared to the majority vote approach (see Table 4.6).

Finally, we compare ZenCrowd to the state of the art crowdsourcing approach (using the

optimal majority vote) and our best automatic approach on a per-task basis in Figure 4.15.

The comparison is given for each document in the test collection. We observe that in most

cases the human intelligence contribution improves the precision of the automatic approach.

We also observe that ZenCrowd dominates the overall performance (it is the best performing

approach in more than 3/4 of the cases).

Efficiency. Finally, we briefly comment on the efficiency of our approach. In its current im-

plementation, ZenCrowd takes on average 200ms to extract an entity from text, 500ms to select

and rank candidate URIs, and 500ms to generate a micro-linking task. The decision process

takes on average 100ms. Without taking into account any parallelization, our system can thus

offer a new entity to the crowd roughly every second, which in our opinion is sufficient for

most applications (e.g., enriching newspaper articles or internal company documents). Once

on the crowdsourcing platform, the tasks have a much higher latency (several minutes to a few

hours), latency which is however mitigated by the fact that entity linking is an embarrassingly

parallel operation on crowdsourcing platforms (i.e., large collections of workers can work in

parallel at any given point in time).

66

Page 89: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

4.7. Experiments on Entity Linking

Top  US  Worker  

0  

0.5  

1  

0   250   500  

Worker  P

recision

 

Number  of  Tasks  

US  Workers  

IN  Workers  

0.6  0.62  0.64  0.66  0.68  0.7  

0.72  0.74  0.76  0.78  0.8  

1   2   3   4   5   6   7   8   9  

Precision  

Top  K  workers  

Figure 4.16 – Distribution of the workers’ Precision for the Entity Linking task as comparedto the number of tasks performed by the worker (top) and task Precision with top k workers(bottom).

4.7.3 Discussion

Looking at the experimental results about the EL task presented above, we observe that the

crowdsourcing step improves the overall EL effectiveness of the system.

Standard crowdsourcing techniques (i.e., using majority vote aggregation) yields a relative

improvement of 6% in Precision (from 0.70 to 0.74). ZenCrowd, by leveraging the probabilistic

framework for making decisions, performs better, leading to a relative performance improve-

ment ranging between 4% and 35% over the majority vote approach, and on average of 14%

over our best automatic linking approach (from 0.70 to 0.80). In both cases, the improvement

is statistically significant (t-test p < 0.05).

Analyzing worker activities on the crowdsourcing platform (see Figure 4.17), we observe that

the number of tasks performed by a given worker is Zipf-distributed (i.e., few workers perform

many tasks, while many workers perform a few tasks only).

Augmenting the numbers of workers performing a given task is not always beneficial: Figure

4.16, bottom, shows how the average Precision of ZenCrowd when (virtually) employing the

available top-k workers for a given task. As can be seen from the graph, the quality of the

results gets worse after a certain value of k, as more and more mediocre workers are picked

out. As a general rule, we observe that limiting the number of workers to 4 or 5 good workers

for a given task gives the best results.

The intuition behind using the Probabilistic Network is that a worker who proves that he is

good, i.e., has a high prior probability, should be trusted for future jobs. Furthermore, his/her

answer should always prevail and help identifying other good workers. Also, the Probabilistic

Network takes advantage of constraints to help the decision process.

While the datasets used for the IM and EL evaluations are different, we can make some

observation on the average effectiveness reached for each task. On average, the effectiveness

of the workers on the IM task is higher than that on the EL task. However, we observe that

67

Page 90: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 4. Human Intelligence Task Quality Assurance

Figure 4.17 – Number of HITs completed by each worker for both IM and EL ordered by mostproductive workers first.

ZenCrowd is able to exploit the work performed by the most effective workers (e.g., top US

worker in Figure 4.16 top or the highly productive workers in Figure 4.10).

4.8 Related Work on Entity Linking and Instance Matching

4.8.1 Instance Matching

The first task that we address is that of matching instances of multiple types among two

datasets. Thanks to the LOD movement, many datasets describing instances have been

created and published on the Web.

A lot of attention has been put on the task of automatic instance matching, which is defined as

the identification of the same real world object described in two different datasets. Classical

matching approaches are based on string similarities (“Barack Obama” vs. “B. Obama”) such

as the edit distance [106], the Jaro similarity [81], or the Jaro-Winkler similarity [169]. More

advanced techniques, such as instance Group Linkage [126], compare groups of records to find

matches. A third class of approaches uses semantic information. Reference Reconciliation

[57], for example, builds a dependency graph and exploits relations to propagate information

among entities. Recently, approaches exploiting Wikipedia as background corpus have been

proposed as well [36, 43]. In [72], the authors propose entity disambiguation techniques using

relations between entities in Wikipedia and concepts. The technique uses for example the link

between “Micheal Jordan” and the “University of California, Berkeley” or to “basketball” on

Wikipedia.

The number of candidate matching pairs between two datasets grows rapidly (i.e., quadrat-

ically) with the size of the data, making the matching task rapidly intractable in practice.

Methods based on blocking [167, 127] have been proposed to tackle scalability issues. The

idea is to adopt a computationally inexpensive method to first group together candidate

matching pairs and, as a second step, to adopt a more accurate and expensive measure to

compare all possible pairs within the candidate set.

68

Page 91: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

4.8. Related Work on Entity Linking and Instance Matching

Crowdsourcing techniques have already been leveraged for instance matching. In [164], the

authors propose a hybrid human-machine approach that exploits both the scalability of

automatic methods as well as the accuracy of manual matching. The focus of their work is

on how to best present the matching task to the crowd. Instead, our work focuses on how to

combine automated and manual matching by means of a three-stage blocking technique and

a Probabilistic Network able to identify and weight-out low quality answers.

In idMesh[45], the authors built disambiguation graphs based on the transitive closures of

equivalence links for networks containing uncertain information. Our present work focuses

on hybrid matching techniques for LOD datasets, combining both automated processes and

human computation in order to obtain a system that is both scalable and highly accurate.

4.8.2 Entity Linking

The other task performed by ZenCrowd is Entity Linking, that is, identifying instances from

textual content and linking them to their description in a database. Entities, that is, real world

objects described by a given schema/ontology, are increasingly becoming a first-class citizen

on the Web. A large amount of online search queries are about entities [133], and search

engines exploit entities and structured data to build their result pages [70]. In the field of

Information Retrieval (IR) a lot of attention has been given to entities: At TREC22, the main

IR evaluation initiative, the task of Expert Finding, Related Entity Finding, and Entity List

Completion have been studied [17, 19].

The problem of assigning identifiers to instances mentioned in textual content (i.e., entity

linking) has been widely studied by the database and the Semantic Web research communities.

A related effort has for example been carried out in the context of the OKKAM project23, which

suggested the idea of an Entity Name System (ENS) to assign identifiers to entities on the Web

[30].

The first step in entity linking consists in extracting entities from textual content. Several

approaches developed within the NLP field provide high-quality entity extraction for persons,

locations, and organizations [21, 40]. State of the art techniques are implemented in tools like

Gate [46], the Stanford parser [92] (which we use in our experiments), and Extractiv24. Once

entities are extracted, they still need to be disambiguated and matched to semantically similar

but syntactically different occurrences of the same real-world object (e.g., “Mr. Obama” and

“President of the USA”).

The final step in entity linking is that of deciding which links to retain in order to enrich the

entity. Systems performing such a task are available as well (e.g., Open Calais25, DBPedia

Spotlight [118]). Relevant approaches aim for instance at enriching documents by automat-

22http://trec.nist.gov23http://www.okkam.org24http://extractiv.com/25http://www.opencalais.com/

69

Page 92: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 4. Human Intelligence Task Quality Assurance

ically creating links to Wikipedia pages [120, 143], which can be seen as entity identifiers.

While previous work selects Uniform Resource Identifiers (URIs) from a specific corpus (e.g.,

DBPedia, Wikipedia), our approach is to assign entity identifiers from the larger LOD cloud26

instead.

4.9 Conclusions

We have presented a data integration system, ZenCrowd, based on a probabilistic framework

leveraging both automatic techniques and punctual human intelligence feedback captured

on a crowdsourcing platform. ZenCrowd adopts a novel three-stage blocking process that can

deal with very large datasets while at the same time minimizing the cost of crowdsourcing by

carefully selecting the right candidate matches to crowdsource.

As our approach incorporates a human intelligence component, it typically cannot perform

instance matching and entity linking tasks in real-time. However, we believe that it can still be

used in most practical settings, thanks to the embarrassingly parallel nature of data integration

in crowdsourcing environments. ZenCrowd provides a reliable approach to entity linking

and instance matching, which exploits the trade-off between large-scale automatic instance

matching and high-quality human annotation, and which according to our results improves

the precision of the instance matching results up to 14% over our best automatic matching

approach for the instance matching task. For the Entity Linking task, ZenCrowd improves

the precision of the results by 4% to 35% over a state of the art and manually optimized

crowdsourcing approach, and on average by 14% over our best automatic approach.

Finally, we can generalize the Data Integration use-case of ZenCrowd to any task with Multiple

Choice Questions stemming from a hybrid human-machine algorithm. The probabistic

framework that we have built can deal with noisy workers answers by assigning weights (or

priors) to them based on test questions. If a worker did not pass a test question he will

be assigned a score based on his peers (the ones answering the some task). Other priors

and constraints, out of the algorithms pre-processing step, can be added to the inference

framework if available.

The crowdsourcing model used in ZenCrowd (through AMT) provides no guarantees on which

crowd workers will perform a given task. As such, our algorithm can only be executed as a

post processing step to perform result aggregation. In the next chapter, we will investigate a

technique that will allow us to select which crowd workers to ask for a given task.

26http://linkeddata.org/

70

Page 93: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

5 Human Intelligence Task Routing

5.1 Introduction

Human Intelligence Tasks are simple tasks that anyone (with the required cognitive abilities)

can perform. In some cases, it can be beneficial for the worker to be acquainted with the task

(or even trained) in order to provide more accurate, and possibly faster answers. Take, for

example, the case of marine life image labeling. It is natural that an expert, or even an amateur,

in marine species, can perform the task much easily and quickly than a crowd worker with an

access to an encyclopedia to double check his answers. Similarly, in the previous chapter, we

have asked crowd workers to perform IM/EL tasks on news articles. We suspect that many

workers would be more effective if they had some background knowledge about the articles

(e.g., news articles from their respective countries, or area of interest).

Current approaches to crowdsourcing adopt a pull methodology where tasks are published

on specialized Web platforms where workers can pick their preferred tasks on a first-come-

first-served basis. While this approach has many advantages, such as simplicity and short

completion times, it does not guarantee that the task is performed by the most suitable worker.

In this chapter, we propose and evaluate Pick-A-Crowd, a software architecture to crowdsource

micro-tasks based on pushing tasks to specific workers. Our system constructs user profiles

for each worker in the crowd in order to assign HITs to the most suitable available worker.

We build such worker profiles based on information available on social networks using, for

instance, information about the worker personal interests. The underlying assumption is that

if a potential worker is interested in several specific categories (e.g., movies), he/she will be

more competent at tackling HITs related to that category (e.g., movie genre classification). In

our system, workers and HITs are matched based on an underlying taxonomy that is defined

on categories extracted both from the tasks at hand and from the workers’ interests. Entities

appearing in the users’ social profiles are linked to the Linked Open Data (LOD) cloud1, where

they are then matched to related tasks that are available on the crowdsourcing platform. We

1http://linkeddata.org/

71

Page 94: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 5. Human Intelligence Task Routing

experimentally evaluate our push methodology and compare it against traditional crowd-

sourcing approaches using tasks of varying types and complexity. Results show that the quality

of the answers is significantly higher when using a push methodology.

In summary, the contributions described in this chapter are:

• a Crowdsourcing framework that focuses on pushing HITs to the crowd.

• a software architecture that implements the newly proposed push crowdsourcing method-

ology.

• category-based, text-based, and graph-based approaches to assign HITs to workers

based on links in the LOD cloud.

• an empirical evaluation of our method in a real deployment over different crowds

showing that our Pick-A-Crowd system is on average 29% more effective than traditional

pull crowdsourcing platforms over a variety of HITs.

The rest of this chapter is structured as follows: Section 5.2 gives an overview of the architecture

of our system, including its HIT publishing interface, its crowd profiling engine, and its HIT

assignment and reward estimation components. We introduce our formal model to match

human workers to HITs using category-based, text-based, and graph-based approaches in

Section 5.3. We describe our evaluation methodology and discuss results we obtained from a

real deployment of our system in Section 5.4. We review the related worker on Recommender

Systems, and Expert Finding in Section 5.5, before concluding in Section 5.6.

5.2 System Architecture

In this section, we describe the Pick-A-Crowd framework and provide details on each of its

components.

5.2.1 System Overview

Figure 7.2 gives a simplified overview of our system. Pick-a-Crowd receives as input tasks

that need to be completed by the crowd. The tasks are composed of a textual description,

which can be used to automatically select the right crowd for the task, actual data on which to

run the task (e.g., a Web form and set of images with candidate labels), as well as a monetary

budget to be spent to get the task completed. The system then creates the HITs, and predicts

the difficulty of each micro-task based on the crowd profiles and on the task description.

The monetary budget is split among the generated micro-tasks according to their expected

difficulty (i.e., a more difficult task will be given a higher reward). The HITs are then assigned

to selected workers from the crowd and published on the social network application. Finally,

answers are processed as a stream from the crowd, aggregated and sent back to the requester.

We detail the functionalities provided by each component of the system in the following.

72

Page 95: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

5.2. System Architecture

Figure 5.1 – Pick-A-Crowd Component Architecture. Task descriptions, Input Data, and aMonetary Budget are taken as input by the system, which creates HITs, estimates their difficultyand suggests a fair reward based on the skills of the crowd. HITs are then pushed to selectedworkers and results get collected, aggregated, and finally returned back to the requester.

5.2.2 HIT Generation, Difficulty Assessment, and Reward Estimation

The first pipeline in the system is responsible for generating the HITs given some input data

provided by the requester. HITs can for instance be generated from i) a Web template to classify

images in pre-defined categories, together with ii) a set of images and iii) a list of pre-defined

categories. The HIT Generator component dynamically creates as many tasks as required (e.g.,

one task per image to categorize) by combining those three pieces of information.

Next, the HIT Difficulty Assessor takes each HIT and determines a complexity score for it.

This score is computed based on both the specific HIT (i.e., description, keywords, candidate

answers, etc.) and on the worker profiles (see Section 5.3 for more detail on how such profiles

are constructed). Different algorithms can be implemented to assess the difficulty of the tasks

in our framework. For example, a text-based approach can compare the textual description of

the task with the skill description of each worker and compute a score based on how many

workers in the crowd could perform well on such HITs.

An alternative a more advanced prediction method can exploit entities involved in the task

and known by the crowd. Entities are extracted from the textual descriptions of the tasks and

disambiguated to LOD entities. The same can be performed on the worker profiles: each

Facebook page that is liked by the workers can be linked to its respective LOD entities. Then

the set of entities representing the HITs and the set of entities representing the interests of the

crowd can be directly compared. The task is classified as difficult when the entities involved in

the task heavily differ from the entities liked by the crowd.

73

Page 96: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 5. Human Intelligence Task Routing

A third example of task difficulty prediction method is based on Machine Learning. A classifier

assessing the task difficulty is trained by means of previously completed tasks, their description

and their result accuracy. Then, the description of a new task is given as a test vector to the

classifier, which returns the predicted difficulty for the new task.

Finally, the Reward Estimation component takes as input a monetary budget B and the results

of the HIT assessment to determine a reward value for each HIT hi .

A simple way to redistribute the available monetary budget is simply by rewarding the same

amount of money for each task of the same type. An second example of reward estimation

function is:

r ew ar d(hi ) = B ·d(hi )∑j d(h j )

(5.1)

which takes into account the difficulty d() of the HIT hi as compared to the others and assigns

a higher reward to more difficult tasks.

A third approach computes a reward based on both the specific HIT as well as the worker

who performs it. In order to do this, we can exploit the HIT assignment models adopted by

our system. These models generate a ranking of workers by means of computing a function

match(w j ,hi ) for each worker w j and HIT hi (see Section 5.3). Given such a function, we can

assign a higher reward to better suited workers by

r ew ar d(hi , w j ) = B ·match(w j ,hi )∑k,l match(wk ,hl )

(5.2)

More advanced reward schemes can be applied as well. For example, in [83], authors propose

game theoretic based approaches to compute the optimal reward for paid crowdsourcing

incentives in the presence of workers who collude in order to game the system.

Exploring and evaluating different difficulty prediction and reward estimation approaches is

not our focus and is left as future work.

5.2.3 Crowd Profiler

The task of the Crowd Profiler component is to collect information about each available

worker in the crowd. Pick-A-Crowd uses contents available on the social network platform as

well as previously completed HITs to construct the workers’ profiles. Those profiles contain

information about the skills and interests of the workers and are used to match HITs with

available workers in the crowd.

In detail, this module generates a set of worker profiles C = {w1, .., wn} where wi = {P,T }, P is

the set of worker interests (e.g., when applied on top of the Facebook platform pi ∈ P are the

Facebook pages the worker likes) and Ti = {t1..tn} is the set of tasks previously completed by

74

Page 97: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

5.2. System Architecture

wi . Each Facebook page pi belongs to a category in the Facebook Open Graph2.

5.2.4 Worker Profile Linker

This component is responsible for linking each Facebook page liked by some worker to the

corresponding entity in the LOD cloud. Given the page name and, possibly, a textual descrip-

tion of the page, the task is defined as identifying the correct URI among all the ones present

in the LOD graph using, for example, a similarity measure based on adjacent nodes in the

graph. This is a well studied problem where both automatic [71] or crowdsourcing-based

techniques [50] can be used.

5.2.5 Worker Profile Selector

HITs and workers are matched based on the profiles described above. Intuitively, a worker

who only likes many music bands will not be assigned a task that asks him/her to identify who

is the movie actor depicted in the displayed picture. The similarity measure used for matching

workers to tasks takes into account the entities included in the workers’ profiles but is also

based on the Facebook categories their liked pages belong to. For example, it is possible to

use the corresponding DBPedia entities and their YAGO type. The YAGO knowledge-base

provides a fine-grained high-accuracy entity type categorization which has been constructed

by combining Wikipedia category assignments with WordNet synset information. The YAGO

type hierarchy can help the system better understand which type of entity correlates with the

skills required to effectively complete a HIT (see also Section 5.3 for a formal definition of such

methods). For instance, our graph-based approach concludes that for our music related task,

the top Facebook pages that indicate expertise on the topic are ‘MTV’ and ‘Music & top artists’.

A generic similarity measure to match workers and task is defined as

si m(w j = {P,T },hi = {t ,d , A,C at }) =∑

k,l si m(pk , al )

|P | · |A| ,∀pk ∈ P, al ∈ A (5.3)

where A is the set of candidate answers for task hi and si m() measures the similarity between

the worker profile and the task description.

5.2.6 HIT Assigner and Facebook App

The HIT Assigner component takes as input the final HITs with the defined reward and

publishes them onto the Facebook App. We developed a dedicated, native Facebook App called

SocialBrain{r}3 to implement this final component of the Pick-A-Crowd platform. Figure 5.2

shows a few screenshots of SocialBrain{r}. As any other application on the Facebook platform,

it has access to several pieces of information about the users that accept to use it. We follow a

2https://developers.facebook.com/docs/concepts/opengraph/3http://apps.facebook.com/socialbrainr/

75

Page 98: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 5. Human Intelligence Task Routing

Figure 5.2 – Screenshots of the SocialBrain{r} Facebook App. Above, the dashboard displayingHITs available to a specific worker. Below, a HIT about actor identification assigned to a workerwho likes several actors.

non-intrusive approach; In our case, the liked pages for each user are stored in an external

database that is used to create a worker profile containing his/her interests. The application

we developed also adopts crowdsourcing incentive schemes different than the pure financial

one. For example, we use the fan incentive where a competition involving several workers

competing on trivia questions on their favorite topic can be organized. The app also allows

to directly challenge other social network contacts by sharing the task, which is also helpful

to enlarge the application user base. While from the worker point of view this represents a

friendly challenge, from a platform point of view this means that the HIT will be pushed to

another expert worker, following the assumption that a worker would challenge someone who

is also knowledgeable about the topic addressed by the task.

76

Page 99: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

5.3. HIT Assignment Models

5.2.7 HIT Result Collector and Aggregator

The final pipeline is composed of stream processing modules, where the Facebook App an-

swers are being streamed from the crowd to the answer creation pipeline. The first component

collects the answers from the crowd and is responsible for a first quality check based on

potentially available gold answers for a small set of training questions. Then, answers that are

considered to be valid (based on available ground-truth data) are forwarded to the HIT Result

Aggregator component, which collects and aggregates them in the final answer for the HIT.

When a given number of answers has been collected (e.g., five answers), then the component

outputs the partial aggregated answer (e.g., based on majority vote) back to the requester. As

more answers reach the aggregation component, the aggregated answer presented to the re-

quester gets updated. Additionally, as answers are collected, the workers’ profiles get updated

and the reward gets granted to the workers who performed the task through the Facebook

App.

5.3 HIT Assignment Models

In this section, we define the HIT assignment tasks and describe several approaches for

assigning workers to such tasks. We focus on HIT assignment rather than on other system

components as the ability to assign tasks automatically is the most original feature of our

system as compared to other crowdsourcing platforms.

Given a HIT hi = {ti ,di , Ai ,C ati } from the requester, the task of assigning it to some workers

is defined as ranking all available workers C = {w1, .., wn} on the platform and selecting the

top-n ranked workers. A HIT consists of a textual description ti (e.g., the task instruction

which is being provided to the workers)4, a data field di that is used to provide the context

for the task to the worker (e.g., the container for an image to be labelled), and, optionally,

the set of candidate answers Ai = {a1, .., an} for the multiple-choices tasks (e.g, a list of music

genres used to categorize a singer) and a list of target Facebook categories C ati = {c1, ..cn}.

A worker profile w j = {P,T } is assigned a score based on which it is ranked for the task hi .

This score is determined based on the likelihood of matching w j to hi . Thus, the goal is to

define a scoring function match(w j ,hi ) based on the worker profile, the task description and,

possibly, external resources such as the LOD datasets or a taxonomy.

5.3.1 Category-based Assignment Model

The first approach we define to assign HITs to workers is based on the same idea that Facebook

uses to target advertisements to its users. A requester has to select the target community of

users who should perform the task by means of selecting one or more Facebook pages or

page categories (in the same way as someone who wants to place an ad). Such categories are

4When applied to hybrid human-machine systems ti can be defined as the data context of the HIT. For example,in crowdsourced databases ti can be the name of the column, table, etc. the HIT is about.

77

Page 100: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 5. Human Intelligence Task Routing

Figure 5.3 – An example of the Expert Finding Voting Model.

defined in a 2 levels structure with 6 top levels (e.g., “Entertainment”, “Company”), each of

them having several sub-categories (e.g., “Movie”, “Book”, “Song”, etc. are sub-categories of

“Entertainment”).

Once some second-level categories are selected by the requester, the platform can generate a

ranking of users based on the pages they like. More formally, given a set of target categories

C at = {c1, ..cn} from the requester, we define P (ci ) = {p1, .., pn} as the set of pages belonging

to category ci . Then, for each worker w j ∈C we take the set of pages he/she likes P (w j ) and

measure its intersection with the pages belonging to any category selected by the requester

RelP =∪i P (ci ). Thus, we can assign a score to the worker based on the overlap between the

likes and the target category |P (w j )∩RelP | and rank all w j ∈C based on such scores.

5.3.2 Expert Profiling Assignment Model

A second approach we propose to rank workers given a HIT hi is to follow an expert finding

approach. Specifically, we define a scoring function based on the Voting Model for expert

finding [110]. For the HIT we want to assign, we take the set of its candidate answers Ai , when

available. Then, we define a disjunctive keyword query based on all the terms composing

the answers q =∧i ai . In case Ai is not available, for example because the task is asking an

open-ended question, then q can be extracted out of ti by mining entities mentioned in its

content. The query q is then used to rank Facebook pages using an inverted index built over

the collection of documents ∪i Pi ∀w j ∈C . We consider each ranked page as a vote for the

workers who like them on Facebook and rank workers accordingly. That is, if Retr P is the set

of pages retrieved with q , we can define a worker ranking function as |P (w j )∩Retr P |. More

interestingly, we can take into account the ranking generated by q and give a higher score to

workers liking pages that were ranked higher. An example of how to rank workers following

the voting model is depicted in Figure 5.3.

78

Page 101: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

5.3. HIT Assignment Models

5.3.3 Semantic-Based Assignment Model

The third approach we propose is based on third-party information. Specifically, we first link

candidate answers and pages to an external knowledge base (e.g., DBPedia) and exploit its

structure to better assign HITs to workers. For a given HIT hi , the first step is to identify the

entity corresponding to each a j ∈ Ai (if Ai is not available, entities in ti can be used instead).

This task is related to entity linking [50] and ad-hoc object retrieval [133, 151] where the goal is

to find the correct URI for a description of the entity using keywords. In this work, we take

advantage of state-of-the-art techniques for this task but do not focus on improving over

such techniques. Then, we identify the entity that represents each page liked by the crowd

whenever it exists in the knowledge base. Once both answers and pages are linked to their

corresponding entity in the knowledge base, we exploit the underlying graph structure to

determine the extent to which entities that describe the HIT and entities that describe the

interests of the worker are similar. Specifically, we define two scoring methods based on the

graph.

The first scoring method takes into account the vicinity of the entities in the entity graph.

We measure how many worker entities are directly connected to HIT entities using SPARQL

queries over the knowledge base as follows:

1 SELECT ?x2 WHERE { <uri(a_i)> ?x <uri(p_i)> }

This follows the assumption that a worker who likes a page is able to answer questions about

related entities. For example, if a worker likes the page ‘FC Barcelona’, then he/she might be a

good candidate worker to answer a question about ‘Lionel Messi’ who is a player of the soccer

team liked by the worker.

Our second scoring function is based on the type of entities. We measure how many worker

entities have the same type as the HIT entity using SPARQL queries over the knowledge base

as follows:

1 SELECT ?x2 WHERE { <uri(a_i)> <rdf:type > ?x .3 <uri(p_i)> <rdf:type > ?x4 }

The underlying assumption in that case is that a worker who likes a page is able to answer

questions about entities of the same type. For example, if a worker likes the pages ‘Tom Hanks’

and ‘Julia Roberts’, then he/she might be a good candidate worker to answer a question about

‘Meg Ryan’ as it is another entity of the same type (i.e., actor).

79

Page 102: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 5. Human Intelligence Task Routing

5.4 Experimental Evaluation

Given that the main innovation of Pick-A-Crowd as compared to classic crowdsourcing plat-

forms such as AMT is the ability to push HITs to workers instead of letting the workers select

the HITs they wish to work on, we focus in the following on the evaluation and comparison of

different HIT assignment techniques and compare them in terms of work quality against a

classic crowdsourcing platform.

5.4.1 Experimental Setting

The Facebook app SocialBrain{r} we have implemented within the Pick-A-Crowd framework

currently counts more than 170 workers who perform HITs requiring to label images contain-

ing popular or less popular entities and to answer open-ended questions. Overall, more than

12K distinct Facebook pages liked by the workers have been crawled over the Facebook Open

Graph. SocialBrain{r} is implemented using cloud-based storage and processing back-end to

ensure scalability with an increasing number of workers and requesters. SocialBrain{r} workers

have been recruited via AMT, thus making a direct experimental comparison to standard AMT

techniques more meaningful.

The type of task categories we evaluate our approaches on are: actors, soccer players, anime

characters, movie actors, movie scenes, music bands, and questions related to cricket. Our

experiments cover both multiple answer questions as well as open-ended questions: Each

task category includes 50 images for which the worker either has to select the right answer

among 5 candidate answers or to answer 20 open-ended questions related to the topic. Each

type of question can be skipped by the worker in case he/she has no idea about that particular

topic.

In order to analyze the performance of workers in the crowd, we measure Precision, Recall (as

the worker is allowed to skip questions when he/she does not know the answer), and Accuracy

of their answers for each HIT obtained via majority vote over 3 and 5 workers5.

5.4.2 Motivation Examples

As we can see from Figure 5.4, the HITs that asks questions about cricket clearly show how

workers can perform differently in terms of accuracy. There are 13 workers out of 35 who

were not able to provide any correct answer while the others spread over the Precision/Recall

spectrum, with the best worker performing at 0.9 Precision and 0.9 Recall. This example

motivates the need to selectively assign the HIT to the most appropriate worker and not

following a first-come-first-served approach as proposed, for example, by AMT. Thus, the goal

of Pick-A-Crowd is to adopt HIT assignment models that are able to identify the workers in

5The set of HITs and correct answers we used in our experiments are available for comparative studies online at:http://exascale.info/PickACrowd

80

Page 103: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

5.4. Experimental Evaluation

Figure 5.4 – Crowd performance on the cricket task. Square points indicate the 5 workersselected by our graph-based model that exploits entity type information.

Figure 5.5 – Crowd performance on the movie scene recognition task as compared to moviepopularity.

the top-right area of Figure 5.4, based solely on their social network profile. As an anecdotal

observation, a worker from AMT provided as feedback to the cricket task in the available

comment field the following comment “I had no idea what to answer to most questions...”

which clearly demonstrates that for the tasks requiring background knowledge, not all workers

are a good fit.

An interesting observation is the impact of the popularity of a question. Figure 5.5 shows the

correlation between task accuracy on the movie scene recognition task and the popularity of

the movie based on the overall number of Facebook likes on the IMDB movie page. We can

observe that when a movie is popular, then workers easily recognize it. On the other hand,

when a movie is not so popular it becomes more difficult to find knowledgeable workers for

the task.

81

Page 104: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 5. Human Intelligence Task Routing

Figure 5.6 – SocialBrain{r} Crowd age distribution.

Figure 5.7 – SocialBrain{r} Notification click rate.

5.4.3 SocialBrain{r} Crowd Analysis

Figure 5.6 shows some statistics about the user base of SocialBrain{r}. The majority of workers

are in the age interval 25-34 and are from the United States.

Another interesting observation can be made about the Facebook Notification click rate.

Once the Pick-A-Crowd system selects a worker for a HIT, the Facebook app SocialBrain{r}

sends a notification to the worker with information about the newly available task and its

reward. Figure 5.7 shows a snapshot of the notifications clicked by workers as compared to the

notification sent by SocialBrain{r} over a few days. We observe an average rate of 57% clicks

per notification sent.

A third analysis looks on how the relevant likes of a worker correlates with his/her accuracy for

the task. Figure 5.8 shows a distribution of worker accuracy over the relevant pages liked using

the category-based HIT assignment model to define the relevance of pages. In a first look, we

do not see a perfect correlation between the number of likes and the worker accuracy for any

task. On the other hand, we observe that when many relevant pages are in the worker profile

(e.g., >30), then accuracy tends to be high (i.e., the bottom-right part of the plot is empty).

However, when only a few relevant pages belong to the worker profile, then it becomes difficult

to predict his/her accuracy. Note that not-liking relevant pages is not an indication of being

unsuitable for a task: Having an incomplete profile just does not allow to model the worker

and to assign him/her the right tasks (i.e., the top-left part of the plot contains high-accuracy

workers with incomplete profiles). Having worker profiles containing several relevant pages is

not problematic when the crowd is large enough (as it is on Facebook).

82

Page 105: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

5.4. Experimental Evaluation

Figure 5.8 – SocialBrain{r} Crowd Accuracy as compared to the number of relevant Pages aworker likes.

Task AMT 3 AMT 5 AMT Masters 3Soccer 0.8 0.8 0.1Actors 0.82 0.82 0.9Music 0.76 0.7 0.7

Book Authors 0.7 0.5 0.58Movies 0.6 0.64 0.66Anime 0.94 0.86 0.1Cricket 0.004 0 0.72

Table 5.1 – A comparison of the task accuracy for the AMT HIT assignment model assigningeach HIT to the first 3 and 5 workers and to AMT Masters.

5.4.4 Evaluation of HIT Assignment Models

In the literature, common crowdsourcing tasks usually adopt 3 or 5 assignments of the same

HIT in order to aggregate the answers from the crowd, for example by majority vote. In the

following, we compare different assignment models evaluating both the cases where 3 and

5 assignments are considered for a given HIT. As a baseline, we compare against the AMT

model that assigns the HIT to the first n workers performing the task. We also compare against

AMT Masters who are workers being awarded a special status by Amazon based on their past

performances6. Our proposed models first rank workers in the crowd based on their estimated

accuracy and then assign the task to the top-3 or top-5 workers.

Table 1 presents an overview of the performances of the assignment model used by AMT. We

observe that while on average there is not a significant difference between using 3 or 5 workers,

Masters perform better than the rest of the AMT crowd on some tasks but do not outperform

the crowd on average (0.54 versus 0.66 Accuracy). A per-task analysis shows that some tasks

are easier than others: While tasks about identifying pictures of popular actors obtain high

accuracy for all three experiments, topic-specific tasks such as cricket questions may lead to a

very low accuracy.

6Note that to be able to recruit enough Masters for our tasks we had to reward $1.00 per task as compared to$0.25 granted to standard workers.

83

Page 106: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 5. Human Intelligence Task Routing

Task Requester Selected Categories Category-based 3 Category-based 5Soccer Sport,Athlete,Public figure 0.94 0.98Actors Tv show, Comedian, Movie, Artist, Actor/director 0.94 0.96Music Musician/band,Music 0.96 0.96

Book Authors Author,Writer,Book 0.98 0.94Movies Movie,Movie general,Movies/music 0.44 0.74Anime Games/toys,Entertainment 0.62 0.7Cricket Sport,Athlete,Public figure 0.63 0.54

Table 5.2 – A comparison of the effectiveness for the category-based HIT assignment modelsassigning each HIT to 3 and 5 workers with manually selected categories.

Task VotingModel q = ti 3 VotingModel q = ti 5 VotingModel q = Ai 3 VotingModel q = Ai 5Soccer 0.92 0.92 0.86 0.86Actors 0.92 0.94 0.92 0.88Music 0.96 0.96 0.76 0.78

Book Authors 0.94 0.96 0.3 0.84Movies 0.70 0.60 0.70 0.42Anime 0.54 0.84 0.56 0.54Cricket 0.63 0.72 0.72 0.72

Table 5.3 – Effectiveness for different HIT assignments based on the Voting Model assigningeach HIT to 3 and 5 workers and querying the Facebook Page index with the task descriptionq = ti and with candidate answers q = Ai respectively.

Table 2 gives the results we obtained by assigning tasks based on the Facebook Open Graph

categories manually selected by the requester. We observe that the Soccer and Cricket tasks

have been assigned to the same Facebook category which does not distinguish between

different types of sports. Anyhow, we can see that for the cricket task the category-based

method does not perform well, as the pages contained into the categories cover many different

sports and, according to our crowd at least, soccer-related tasks are simpler than cricket-

related tasks.

Table 3 presents the results when assigning HITs following the Voting Model for expert finding.

We observe that in the majority of cases, assigning each task to 5 different workers selected

using the Facebook Page indexing the task description as query leads to the best results.

Table 4 shows the results of our graph-based approaches. We observe that in the majority

of these cases, the graph-based approach that follows the entity type (“En. type”) edges and

selects workers who like Pages of the same type as the entities involved in the HIT outperforms

the approach that considers the directly-related entities within one step in the graph (“1-step”).

5.4.5 Comparison of HIT Assignment Models

Table 5 presents the average Accuracy obtained over all the HITs in our experiments (which

makes a total of 320 questions) by each HIT assignment model. As we can see, our proposed

84

Page 107: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

5.5. Related Work in Task Routing

Task En. type 3 En. type 5 1-step 3 1-step 5Soccer 0.98 0.92 0.86 0.86Actors 0.92 0.92 0.92 0.90Music 0.62 0.68 0.64 0.54

Book Authors 0.28 0.50 0.50 0.82Movies 0.70 0.78 0.46 0.62Anime 0.46 0.90 0.62 0.62Cricket 0.63 0.82 0.63 0.63

Table 5.4 – Effectiveness for different HIT assignments based on the entity graph in the DBPediaknowledge base assigning each HIT to 3 and 5 workers.

Assignment Method Average AccuracyAMT 3 0.66AMT 5 0.62

AMT Masters 3 0.54Category-based 3 0.79Category-based 5 0.83Voting Model ti 3 0.80Voting Model ti 5 0.85Voting Model Ai 3 0.69Voting Model Ai 5 0.72

En. type 3 0.66En. type 5 0.79

1-step 3 0.661-step 5 0.71

Table 5.5 – Average Accuracy for different HIT assignment models assigning each HIT to 3 and5 workers.

HIT assignment models outperform the standard first-come-first-served model adopted by

classic crowdsourcing platforms such as AMT. On average over the evaluated tasks, the best

performing model is the one based on the Voting Model defined for the expert finding problem

where pages relevant to the task are seen as votes for the expertise of the workers. Such an

approach obtains on average a 29% relative improvement over the best accuracy obtained by

the AMT model.

5.5 Related Work in Task Routing

5.5.1 Crowdsourcing over Social Networks

A first attempt to crowdsource micro-tasks on top of social networks has been proposed by

[54], where authors describe a framework to post questions as tweets that users can solve by

85

Page 108: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 5. Human Intelligence Task Routing

tweeting back an answer. As compared to this early approach, we propose a more controlled

environment where workers are known and profiled in order to push tasks to selected users.

Crowdsourcing over social networks is also used by CrowdSearcher [31, 32, 33], which im-

proves automatic search systems by means of asking questions to personal contacts. The

crowdsourcing architecture proposed in [34] considers the problem of assigning tasks to se-

lected workers. However, authors do not evaluate automatic assignment approaches but only

let the requesters manually select individual workers, which they want to push the task to.

Instead, in this work, we assess the feasibility and effectiveness of automatically mapping HITs

to workers based on their social network profiles.

Also related to our system is the study of trust in social networks. Golbeck [68], for instance,

proposes different models to rank social network users based on trust and applies them to

recommender systems as well as other end-user applications.

5.5.2 Task Recommendation

Assigning HITs to workers is similar to the task performed by recommender systems (e.g.,

recommending movies to potential customers). We can categorize recommender systems

into content-based and collaborative filtering approaches. The former approaches exploit

the resources contents and match them to user interests. The latter ones only use similarity

between user profiles constructed out of their interests (see [130] for a survey). Recommended

resources are those already consumed by similar users. Our systems adopts techniques from

the field of recommender systems as it aims at matching HITs (i.e., tasks) to human workers

(i.e., users) by constructing profiles that describe worker interests and skills. Such profiles are

then matched to HIT descriptions that are either provided by the task requester or by analyzing

the questions and potential answers included in the task itself (see Section 5.3). Recommender

systems built on top of social networks already exits. For example, in [9], authors propose a

news recommendation system for social network groups based on community descriptions.

5.5.3 Expert Finding

In order to push tasks to the right worker in the crowd, our system aims at identifying the most

suitable person for a given task. To do so, our Worker Profile Selector component generates

a ranking of candidate workers who can be contacted for the HIT. This is highly related to

the task of Expert Finding studied in Information Retrieval. The Enterprise track at the TREC

evaluation initiative7 has constructed evaluation collections for the task of expert finding

within an organizational setting [20]. The studied task is that of ranking candidate experts

(i.e., employees of a company) given a keyword query describing the required expertise. Many

approaches have been proposed for such tasks (see [18] for a comprehensive survey). We can

classify most of them as either document-based, when document ranking is performed before

7http://trec.nist.gov

86

Page 109: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

5.6. Conclusions

identifying the experts, or as candidate-based, when expert profiles are first constructed before

being ranked given a query. Our system follows the former approach by ranking online social

network pages and using them to assign work to the best matching person.

5.6 Conclusions

A simplistic task allocation procedure, such as pull-crowdsourcing, is suboptimal when it

comes to efficiently leverage individual workers skills and point of interest to obtain high-

quality answers. For this reason, we proposed Pick-A-Crowd, a novel crowdsourcing scheme

focusing on pushing tasks to the right worker rather than letting the workers spend time

finding tasks that suit them. We described a novel crowdsourcing architecture that builds

worker profiles based on their online social network activities and tries to understand the

skills and interests of each worker. Thanks to such profiles, Pick-A-Crowd can assign each task

to the right worker dynamically.

To demonstrate and evaluate our proposed architecture, we have developed an deployed So-

cialBrain{r}, a native Facebook application that pushes crowdsourced tasks to selected workers

and collects the resulting answers. We additionally proposed and extensively evaluated HIT

assignment models based on 1) Facebook categories manually selected by the task requester,

2) methods adapted from an expert finding scenario in an enterprise setting, and 3) methods

based on graph structures borrowed from external knowledge bases. Experimental results

over the SocialBrain{r} user-base show that all of the proposed models outperform the classic

first-come-first-served approach used by standard crowdsourcing platforms such as Amazon

Mechanical Turk. Our best approach provides on average 29% better results than the AMTmodel.

A potential limitation of our approach is that it may lead to longer task completion times: While

on pull crowdsourcing platforms the tasks gets completed quickly (since any available worker

can perform the task), following a push methodology may lead to delays in the completion of

the tasks. In the next chapters, we will investigate techniques that will allow us to improve the

efficiency of a crowdsourcing campaign for both pull and push crowdsourcing.

87

Page 110: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible
Page 111: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

6 Human Intelligence Task Retention

6.1 Introduction

We now turn our attention to improving the execution time for crowdsourcing tasks where

the timely completion is hardly guaranteed, and many factors influence its progression pace,

including: the crowd availability, time-of-day [136, 76], the amount of the micro-payments [62],

the number of remaining tasks in a given batch, concurrent campaigns, or the reputation of

the publisher [80]. A common observation that is often made when running a crowdsourcing

campaign on micro-task crowdsourcing platforms is the long-tail distribution of work done

by people [64, 50, 76]: Many workers complete just one or a few HITs while a small number

of workers do most of the HITs in a batch (see Figure 6.1). While this distribution has been

repeatedly observed in a variety of settings, we argue in the following that it is hardly the

optimal case from a batch latency point of view.

As shown in previous work [62], long batches of Human Intelligence Task (HITs) submitted

to crowdsourcing platforms tend to attract more workers as compared to shorter batches. A

consequence of the long tail distribution of the workers, however, long batches tend to attract

fewer workers towards their end—that is, when only a few HITs are left—as fewer workers are

willing to engage with the almost-completed batch. In this case, it is particularly important

that current workers continue to do as much work as possible before they drop out and prompt

the hiring of new workers for the remaining HITs. In addition, when workers become scarce

(e.g., when the demand is high), such turnovers can become a serious obstacle to rapid batch

completion.

In this chapter, we will explore worker retention as a technique that can be used to improve

batch execution time. For that we introduce a set of pricing schemes designed to improve the

retention rate of workers working in long batches of similar tasks. We show how increasing or

decreasing the monetary reward over time influences the number of tasks a worker is willing

to complete in a batch, as well as how it influences the overall latency. We compare our new

pricing schemes against traditional pricing methods (e.g., constant reward for all the HITs

in a batch) and empirically show how certain schemes effectively function as an incentive

89

Page 112: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 6. Human Intelligence Task Retention

Scale−up

Scale−out0

10

20

30

40

50

0 5 10 15 20Worker ID

Num

ber

of T

asks

Sub

mitt

ed

Figure 6.1 – The classic distribution of work in crowdsourced tasks follows a long-tail distribu-tion where few workers complete most of the work while many workers complete just one ortwo HITs.

for workers to keep working longer on a given batch of HITs. Our experimental results show

that the best pricing scheme in terms of worker retention is based on punctual bonuses paid

whenever the workers reach predefined milestones.

In summary, the main contributions presented in this chapter are:

• A novel crowdsourcing optimization problem focusing on retaining workers longer in

order to minimize the execution time of long batches.

• A set of new incentives schemes focusing on making individual workers more engaged

with a given batch of HITs in order to improve worker retention rates.

• An open-source software library to embed the proposed schemes inside current HIT

interfaces1.

• An extensive experimental evaluation of our new techniques over different tasks on a

state-of-the-art crowdsourcing platform.

The rest of the chapter is structured as follows: In section 6.2 we formally define the problem

and introduce different pricing schemes to retain workers longer on a set of HITs given a

fixed monetary budget. section 6.3.2 presents empirical results comparing the efficiency

of our different pricing schemes and discussing their effect on crowd retention and overall

latency, followed by a discussion in section 6.4. Finally, we review related work focusing on

contributions related to pricing schemes for crowdsourcing platforms and their effects on the

behavioral patterns of the workers. before concluding in section 6.6.

1A library based on the Django framework available at: https://github.com/XI-lab/BonusBar

90

Page 113: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

6.2. Worker Retention Schemes

6.2 Worker Retention Schemes

Our main retention incentive is based on compensation of workers engaged in a batch of tasks,

using monetary bonuses and qualifications. We start this section by formally characterizing

our problem below. We then introduce our various pricing schemes, before describing the

visual interface we implemented in order to inform the workers of the monetary rewards, and

the different types of tasks we considered for the HITs.

6.2.1 Problem Definition

Given a fixed retention budget B allocated to pay workers w1, . . . , wm to complete a batch of

n analogous tasks H = {h1, . . . ,hn}, our task is to allocate B for the various HITs in the batch

in order to maximize the average number of tasks completed by the workers. More formally,

our goal is to come up with a function b(h) which, for each h j ∈ H gives us the optimal reward

upon completion of h j such as to maximize the average number of tasks completed by the

workers, i.e.:

b(h)opt = argmaxb(h)

m−1n−1m−1∑i=0

n−1∑j=0

1C (wi ,h j ,b(h j )) (6.1)

where 1C (wi ,h j ,b(h j )) is an indicator function equal to 1 when worker wi completed task h j

under rewarding regime b(h), and to 0 otherwise. For simplicity, we assume in the following

that workers complete their hits sequentially, i.e., that ∀hi ,h j ∈ H , hi is submitted before h j

if i < j , though they can drop out at any point in time in the batch.

6.2.2 Pricing Functions

Fixed Bonus. The standard pricing scheme used in micro-task crowdsourcing platforms like

AMT is uniform pricing. Under this regime, the worker receives the same monetary bonus for

each completed task in the batch:

b(hi ) = B

|H | ∀ hi ∈ H (6.2)

Training Bonus. Instead of paying the same bonus for each task in a batch, one might try to

overpay workers at the beginning of the batch in order to make sure that they do not drop out

early as is often the case. This scheme is especially appealing for more complex tasks requiring

the workers to learn some new skill initially, making the first HITs less appealing due to the

initial overhead. This scheme allows the requester to compensate the implicit training phase

by initially fixing a high hourly wage despite the low productivity of the worker. Many different

reward functions can be defined to achieve this goal. In our context, we propose a linearly

91

Page 114: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 6. Human Intelligence Task Retention

decreasing pricing scheme as follows:

b(hi ) = B

|H | +(⌈ |H |

2

⌉− i

)· · ·

(B

|H | ·2

|H |)

(6.3)

where we add to the average HIT reward B/|H | a certain bonus payment increment (i.e., the

last term of the equation) a certain number of times based on the current HIT in the batch.

The general idea behind this scheme is to distribute the available budget in a way that HITs

are more rewarded at the beginning and such that the bonus incrementally decreases after

that. One potential advantage of this pricing scheme is the possibility to attract many workers

to the batch due to the initial high pay. On the other hand, retention may not be optimal since

workers could drop out as soon as the bonus gets too low.

Increasing Bonus. By flipping the (+) sign in Equation 6.3 into a (−), we obtain the opposite

effect, that is, a pricing scheme with increasing reward over the batch length. That way, the

requesters are overpaying workers towards the end of the batch instead of at the beginning.

This approach potentially has two advantages: First, as workers get increasingly paid as they

complete more HITs in the batch, they might be motivated to continue longer in order to

complete the most rewarding HITs at the very end of the batch. Second, workers get rewarded

for becoming increasingly trained in the type of task present in the batch. On the other hand,

a possible drawback of this scheme is the fairly low initial appeal of the batch due to the low

bonuses granted at first.

Milestone Bonus. In all the previous schemes, bonuses are attributed after each completed

HIT. However, depending on the budget and the exact bonus function used, the absolute

value of the increments can be very small. To generate bigger bonuses, one could instead

try to accumulate increments over several HITs and distribute bonuses occasionally only.

Following this intuition, we introduce in the following the notion of milestone bonuses. Under

this regime, an accumulated bonus is rewarded punctually after completing a specific number

of tasks. For a fixed interval I , I <= n, we formulate this scheme using the following function:

b(hi ) ={ ⌈

B ·I|H |

⌉if i mod I = 0

0 otherwise(6.4)

Qualifications. In addition to the monetary reward that is offered at each interval, the re-

quester can define a qualification level that can be granted after each milestone. Qualifications

are a powerful incentive as they constitute a promise on exclusivity for future work.

92

Page 115: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

6.2. Worker Retention Schemes

Figure 6.2 – Screenshot of the Bonus Bar used to show workers their current and total reward.

Figure 6.3 – Screenshot of the Bonus Bar with next milestone and bonus.

Random Bonus. An additional scheme that we consider is the attribution of a bonus drawn

once at random2, from a predefined distribution of the total retention budget B . In particular,

we consider the Zipf distribution in order to create a lottery effect so that a worker can get a

high bonus at any point while progressing through the batch.

6.2.3 Visual Reward Clues

On current micro-task crowdsourcing platforms such as Amazon Mechanical Turk, one can

implement the above rewarding schemes by allocating bonuses for the HITs. Hence, workers

complete HITs in a batch in exchange of to the usual fixed reward, but get a bonus that possibly

varies from one HIT to another. In order to make this scheme clear to the workers, we decided

to augment the HIT interface with a Bonus Bar, an open-source toolkit that requesters can

easily integrate with their HITs3. Figure 6.2 gives a visual rendering of the payment information

displayed to a worker completing one of our HITs.

6.2.4 Pricing Schemes for Different Task Types

We hypothesize that the pricing schemes proposed above perform differently based on the task

at hand. In that sense, we decided to address three very different types of tasks and to identify

the most appropriate pricing scheme for each type in order to maximize worker retention.

The first distinction we make for the tasks is based on their length: Hence, we differentiate

short tasks that only require few seconds each (e.g., matching products) and longer tasks that

require one minute or more (e.g., searching the Web for a customer service phone number).

Note that in any case we only consider micro-tasks, that is, tasks requiring little effort to be

completed by individuals and that can be handled in large batches.

The second distinction we make is based on whether or not the task require some sort or initial

training. The example we decided to pick for this work is the classification of butterfly images

in a predefined set of classes. We assume that at the beginning of the batch the worker is not

confident in performing the task and repeatedly needs to check the corresponding Wikipedia

2The attributed bonus value is removed from the distribution’s list to insure that the budget limit is met.3Specifically, AMT requesters can use the toolkit by means of the ExternalQuestion data structure: That is, an

externally hosted Web form which is embedded in the AMT webpages.

93

Page 116: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 6. Human Intelligence Task Retention

Butterfly Classification Customer Service Phone Item Matching

0

10

20

30

40

50

0

5

10

15

20

0

10

20

30

40

50

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50Worker ID

#Tas

ks S

ubm

itted

Scheme Fixed Bonus Training Bonus Increasing Bonus Milestone Bonus Random Bonus

Figure 6.4 – Effect of different bonus pricing schemes on worker retention over three differentHIT types. Workers are ordered by the number of completed HITs.

Batch Type #Workers #HITs Base Budget Bonus Budget Avg. HIT Time Avg. Hourly RateItem Matching 50 50 $0.5 $0.5 22sec $5.3/hrButterfly Classification 50 50 $0.5 $0.5 15sec $9.4/hrCustomer Care Phone Number Search 50 20 $0.2 $0.4 78sec $2.2/hr

Table 6.1 – Statistics for the three different HIT types.

pages in order to correctly categorize the various butterflies. After a few tasks, however, most

worker will have assimilated the key differentiating features of the butterflies and will be able

to perform the subsequent tasks much more efficiently. For such tasks, we expect the training

bonus scheme to be particularly effective since it overpays the worker at the beginning of the

batch as he/she is spending a considerable amount of time to complete each HIT. After the

worker gets trained, one can probably lower the bonuses while still maintaining the same

hourly reward rate.

6.3 Experimental Evaluation

6.3.1 Experimental Setup

In order to experimentally compare the different pricing schemes we introduced above, we

consider three very different tasks:

• Item Matching: Our first batch is a standard dataset of HITs (already used in [164])

asking workers to uniquely identify products that can be referred to by different different

names (e.g., ‘iPad Two’ and ‘iPad 2nd Generation’).

• Butterfly Classification: This is a collection of 619 images of six types of butterflies:

Admiral, Black Swallowtail, Machaon, Monarch, Peacock, and Zebra [103]. Each batch

of HITs uses 50 randomly selected images from the collection that are presented to the

workers for classification.

94

Page 117: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

6.3. Experimental Evaluation

• Customer Care Phone Number Search: In this batch, we ask the workers to find the

customer-care phone number of a given US-based company using the Web.

Our first task is composed of relatively simple HITs that do not require the workers to leave the

HIT page but just to take a decision based on the information displayed. Our second task is

more complex as it requires to classify butterfly images into predefined classes. We assume

that the workers will not be familiar with this task and will have to learn about the different

classes initially. In that sense, we provide workers with links to Wikipedia pages describing

each of the butterfly species. Our third task is a longer task that requires no special knowledge

but rather to spend some time on the Web to find the requested piece of information.

section 6.2.4 gives some statistics for each task, including the number of workers and HITs

we considered for each task (always set to 50 for both), the base and bonus budgets, and the

resulting average execution times and hourly rates. All the tasks were run on the AMT platform.

Our main experimental goals are i) to observe the impact of our different pricing schemes

on the total number of tasks completed the workers in a batch (worker retention) and ii) to

compare the resulting batch execution times. Hence, the first goal of our experiments is not

to complete each batch but rather to observe how long workers keep working on the batch.

Towards that goal, we decided to recruit exactly 50 distinct workers for each batch, and do not

allow the workers to work twice on a given task. We build the backend such that each worker

works on his/her HITs in isolation without any concurrency. This is achieved by allowing 50

repetitions per HIT and recording the worker Id the first time the HIT is accepted, once the

count of Ids reaches 50, any new comer is asked not to accept the HIT. All batches were started

at random times during the day and left online long enough to alleviate any effect due to the

timezones.

6.3.2 Experimental Results

Worker Retention. Figure 6.4 shows the effect of the different pricing schemes on worker

retention for the different types of HITs we consider in this work. The first observation we

can make is that the pricing scheme based on the Milestone Bonus that grants rewards when

reaching predefined goals performs best in terms of worker retention: more workers complete

the batch of tasks as compared to other pricing schemes over all the different task types.

Another observation is that in the Butterfly Classification task the training bonus pricing

scheme retains workers better than the increasing or the fixed bonus scheme. This supports

our assumption that overpaying workers at the beginning of the batch while they are learning

about the different butterfly classes helps them feeling rewarded for the learning effort and

helps keeping them working on the batch longer.

On the other hand, the increasing pricing scheme performed worse both for the Item Matching

and the Butterfly Classification batches. This is probably the case as workers did feel underpaid

for the work they were doing and preferred to drop the batch before its end.

95

Page 118: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 6. Human Intelligence Task Retention

Butterfly Classification Customer Care Phone Item Matching

0

50

100

150

0 10 20 30 40 50 5 10 15 20 0 10 20 30 40 50Task Submission Sequence

Task

Exe

cutio

n T

ime

(in s

ec)

Category Long Medium Short

Figure 6.5 – Average of the HITs execution time with standard error ordered by their sequencein the batch. Results are grouped by worker category (long, medium and short term workers).In many cases, the Long term workers improve their HIT time execution. This is expected tohave a positive impact on the overall batch latency.

The final comment is about the fixed pricing scheme: This shows bad performance in terms

of worker retention over all the task types we have considered. Note that this is the standard

payment scheme used in paid micro-task crowdsourcing platforms like AMT where each HIT in

a batch is rewarded equally for everyone independently on how many other HITs the workers

has performed in the batch.

Learning Curve. We report on how the execution time varies across the different task types

in Figure 6.5. We group the results based on three different classes of workers: a) the Short

category, which includes workers having completed 25% or less tasks in the batch, b) the

Medium category, which includes workers having completed between 25% and 75% of the

HITs in the batch, and c) the Long category, which includes those workers who completed

more than 75% of the tasks.

From the results displayed in Figure 6.5, we observe a significant learning curve for the Butterfly

Classification batch: On average, the first tasks in the batch require workers a substantially

longer time to complete as compared to the final ones. For the Customer Care Phone Number

Search batch, we see that the task completion time varies from HIT to HIT. We also note that

workers who remained until the end of the batch are becoming slightly faster over time. The

Item Matching batch shows a similar trend, where tasks submitted towards the end of the

batch require on average less time than those submitted initially. Across the different types

of tasks, we also note that workers who are categorized as Short always start slower than

others on average (i.e., workers dropping out early are also slower initially). This is hence an

interesting indicator of potential drop-outs.

96

Page 119: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

6.3. Experimental Evaluation

Short Medium Long

0.00

0.25

0.50

0.75

1.00

10 20 30 50#Tasks submitted

Aver

age

prec

isio

n pe

r wor

ker

Category Short Medium Long

Figure 6.6 – Overall precision per worker and category of worker for the Butterfly Classificationtask (using Increasing Bonus).

These results are particularly important for our goal of improving latency, since the retained

workers tend to get faster with new HITs performed. This gain is expected to have a direct

impact on the overall execution time of the batch. Next, we check whether this has an impact

on the quality of the submitted HITs.

Impact on Work Quality. We report on the quality of the crowdsourced results in Figure 6.6.

We observe that the average precision of the results does not vary across workers who perform

many or few tasks. We observe however that the standard deviation is higher for the workers

dropping early than for those working longer on the batches. In addition, those workers who

perform most of the HITs in the batch never yield low precision results (the bottom right of

the plot is empty). This could be due to a self-selection phenomenon through which workers

who perform quite badly at the beginning of the batch decide to drop out early.

6.3.3 Efficiency Evaluation

In this final experiment, we evaluate the impact of our best approach (Milestone Bonus) on

the end-to-end execution of a batch of HITs, and we compare with a) the classical approach

with no bonus, and b) using the bonus budget to increase the base reward. In order to get

independent and unbiassed results, we decided to create a new task for this experiment4,

which consists in correcting 10 english essays from the ESOL dataset [170]. We run the three

batches on AMT, each having 10 HITs and requiring 3 repetitions, that is, 3 entries are required

from different workers for each HIT. A summary of our setting is shown in Table 6.2. The three

setups differ as follows:

• Batch A(Milestones): Workers who select Batch A are presented with the interface

4In the previous set of experiments, we hired more than 450 distinct workers.

97

Page 120: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 6. Human Intelligence Task Retention

displaying the Bonus Bar configured with a bonus at 3, 6 and 10 HITs milestones offering

respectively ($0.2), ($0.4),($0.8) bonuses for a maximum retention budget of $1.4*3=$4.2.

• Batch B(Classic): Workers who select Batch B are presented with a classical interface

and receive a fixed reward of $0.2 for each submission they make.

• Batch C(High Reward): Workers who select Batch C are presented with a classical

interface. Here, we use the bonus budget to increase the base reward, thus, workers will

receive a fixed reward of $0.34 for each submission they make.

We perform 5 repeated runs as follows: a) we start both batches A and B at the same time and

let them run concurrently – this measures the sole effect of retention, b) batch C was launched

separately since it offers a higher base reward and might influence A and B5.

Batch Type #HITs #Repetitions Reward Base Budget Bonus Budget Avg. HIT Time Avg. Hourly RateA (Milestones) 10 3 $0.2 $6 $4.2 268sec $5.7/hrB (Classic) 10 3 $0.2 $6 N/A 310sec $2.4/hrC (High Reward) 10 3 $0.34 $10.2 N/A 302sec $3.9/hr

Table 6.2 – Statistics of the second experimental setting – English Essay Correction

Figure 6.7 shows the results of 5 repeated experiments of the above settings. We report the

overall execution time after each batch finishes (i.e., when all the 3*10 HITs are submitted), the

budget used by each run, the number of workers involved and how many HITs each worker

submitted. We can observe the effects of retention in batch A as it involves less workers who

submit a greater number of HITs on average as compared to batches B and C. From a latency

perspective, batch A consistently outperforms batch B’s execution time, on average by 33%,

thanks to the retention budget in use. While batch C is faster overall – which can be explained

by the fact that it attracts more workers due to its high reward – it uses the entirety of its budget,

as compared to A that only uses $2.44 on average.

6.4 Discussion

To summarize, the main findings resulting from our experimental evaluation are:

• Giving workers a punctual bonus for reaching a predefined objective defined as a given

number of tasks improves worker retention.

• Overpaying workers at the beginning of a batch is useful in case the tasks require an

initial training: Workers feel rewarded for their initial effort and usually continue working

for a lower pay after the learning phase.

• While retention comes at a cost, it also improves latency. Based on our experiments

comparing different setups over multiple runs, we observe that the bonus scheme

involved less workers who perform more tasks on average. This property is particularly

important when the workforce is limited on the crowdsourcing platform.

5To minimize timezones effects we run the batch at a similar time of the day as A and B

98

Page 121: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

6.5. Related Work on Worker Retention and Incentives

Number of taskssubmitted per worker

Number ofworkers per run

Budget(in USD)

Time(in minutes)

2

4

6

8

10

5

10

15

6

7

8

9

10

100

200

300

400

500

A B C A B C

Setup A B C

Figure 6.7 – Results of five independent runs of A, B and C setups. Type A batches include theretention focused incentive while Type B is the standard approach using fixed pricing, Batch Cuses a higher fixed pricing – but leveraging the whole bonus budget.

6.5 Related Work on Worker Retention and Incentives

A number of recent contributions studied the effect of monetary incentives on crowdsourcing

platforms. In [112], Mao et al. compared crowdsourcing results obtained using both volun-

teers and paid workers. Their findings show that the quality of the work performed by both

populations is comparable, while the results are obtained faster when the crowd is financially

rewarded.

Wang et al. [163] looked at pricing schemes for crowdsourcing platforms focusing on the

quality dimension: The authors proposed methods to estimate the quality of the workers and

introduce new pricing schemes based on the expected contribution of the workers. While also

proposing an adaptive pricing strategy for micro-task crowdsourcing, our work focuses instead

on retaining the crowd longer on a given batch of tasks in order to improve the efficiency of

individual workers and to minimize the overall batch execution time.

Another recent piece of work [101] analyzed how task interruption and context switching

decreases the efficiency of workers while performing micro-tasks on a crowdsourcing platform.

This motivates our own work, which aims at providing new incentives to convince the workers

to keep working longer on a given batch of tasks.

Chandler and Horton [38] analyzed (among others) the effect of financial bonuses for crowd-

sourcing tasks that would be ignored otherwise. Their results show that monetary incentives

99

Page 122: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 6. Human Intelligence Task Retention

worked better than non-monetary ones given that they are directly noticeable by the workers.

In our own work, we display bonus bars on top of the task to inform the worker on his/her

hourly rate, fixed pay, and bonuses for the current HITs.

Recently also, Singer et al. [147] studied the problem of pricing micro-tasks in a crowdsourcing

marketplace under budget and deadline constraints. Our approach aims instead at varying

the price of individual HITs in a batch (i.e., by increasing or decreasing the monetary rewards)

in order to retain workers longer.

Faradani et al. [62] studied the problem of predicting the completion of a batch of HITs and at

its pricing given the current marketplace situation. They proposed a new model for predicting

batch completion times showing that longer batches attract more workers. In comparison, we

experimentally validate our work with real crowd workers completing HITs on a micro-task

crowdsourcing platform (i.e., on AMT).

In [113], Mao et al. looked into crowd worker engagement. Their work is highly related to ours

as it aims to characterize how workers perceive tasks and to predict when they are going to

stop performing HITs. The main difference with our work is that [113] looked at a volunteer

crowdsourcing setting (i.e., they used data from Galaxy Zoo where people classify pictures

of galaxies). This is a key difference as our focus is specifically on finding the right pricing

scheme (i.e., the correct financial reward) to engage workers working on a batch of HITs.

Another setting where retaining workers is critical is push crowdsourcing. Push crowdsourcing

[56] is a special type of micro-task platform where the system assigns HITs to selected workers

instead of letting them do any available HIT on the platform. This is done to improve the

effectiveness of the crowd by selecting the right worker in the crowd for a specific type of HIT

based on the worker profile which may include previous HITs history, skills and preferences.

Since attracting the desired workers is not guaranteed, keeping them on the target task is

essential.

On a separate note, this piece of work was also inspired from studies on talent management in

corporate settings. Companies have long realized the shortage of highly qualified workers and

the fierce competition to attract top talents. In that context, retaining top-performing employ-

ees longer constitutes an important factor of performance and growth [22, 119]. Although our

present setting is radically different from traditional corporate settings, we identified many

cases where the crowdsourcing requesters (acting as a virtual employer) could use common

human resources practices. In the following, we particularly investigate practices such as:

training cost, bonuses, and attribution of qualifications [75, 16].

6.6 Conclusions

In this chapter, we addressed the problem of speeding up a crowdsourcing campain by in-

centivizing workers such that they keep working longer on a given batch of HITs. Increased

100

Page 123: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

6.6. Conclusions

worker retention is valuable in order to avoid the problem of batch starvation (when only a few

remaining HITs are left in a batch and no worker selects them), or if the workforce is limited

on the crowdsourcing platform (a requester tries to keep the workers longer on his batch). We

defined the problem of worker retention and proposed a variety of bonus schemes in order

to maximize retention, including fixed, random, training, increasing, and milestone-based

schemes. We performed an extensive experimental evaluation of our approaches over real

crowds of workers on a popular micro-task crowdsourcing platform. The results of our ex-

perimental evaluation show that the various pricing schemes we have introduced perform

differently depending of the type of tasks. The best performing pricing scheme in terms of

worker retention is based on milestone bonuses, which are punctually given to the workers

who reach a predefined goal in terms of completed number of HITs.

We also observe that our best bonus schemes consistently outperform the classic fixed pricing

scheme, both in terms of worker retention and efficient execution. The main finding is hence

that it is possible to adopt new pricing schemes in order to make workers stay longer on a

given batch of tasks longer and obtain results faster back from the crowdsourcing platform.

Worker retention is key in terms of efficiency improvement in the context of hybrid human-

machine systems, and a step towards providing crowdsourcing SLAs for pull-crowdsourcing.

For push-crowdsourcing, we will investigate another technique that relies on task scheduling

in the next chapter.

101

Page 124: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible
Page 125: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

7 Human Intelligence Task Scheduling

7.1 Introduction

The backend crowdsourced operators of crowd-powered systems typically yield higher laten-

cies than the machine-processable operators, due to inherent efficiency differences between

humans and machines. This problem can be further amplified by the lack of workers on the

target crowdsourcing platform, and/or, if the workers are shared unequally by a number of

competing requesters – including the concurrent users of the same crowd-powered system.

Moreover, in large enterprise settings, it is common that multiple users with different types of

requests submit queries concurrently through the same meta-requester, and end up compet-

ing among themselves. When this happens, it is necessary to correctly manage requests to

avoid latency being impacted any further. Scheduling is the traditional way of tackling such

problems in computer science, by prioritizing access to shared resources to achieve some

quality of service.

In this chapter, we explore and empirically evaluate scheduling techniques that can be used

to manage the internal operations of a crowdsourcing system. More specifically, we focus on

multi-tenant, crowd-powered systems where multiple batches of Human Intelligence Tasks

(HITs) have to be run concurrently. In order to effectively handle the HIT workload generated

by such systems, we implement and empirically compare a series of scheduling techniques

with the aim of improving the overall efficiency of the system. Specifically, we try to answer

the following questions: “Does known scheduling algorithms exhibit their usual properties

when applied to the crowd?" and “What are the adaptations needed to accommodate the

usual crowd work routine?"

Efficiency concerns have so far mostly been tackled by increasing the price of the HITs or by

repeatedly re-posting the HITs on the crowdsourcing platform[62, 26]. Instead, we propose

the use of a HIT-BUNDLE, that is a group of heterogeneous HITs originating from multiple

clients in a multi-tenant system. This allows to apply HIT scheduling techniques within the

HIT-BUNDLE and to decide which HIT should be served to the next available worker. While

our focus is on efficiency, the proposed techniques are still compatible with other quality

optimization approaches, merging the two aspects is left outside of the scope of this work.

103

Page 126: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 7. Human Intelligence Task Scheduling

7.1.1 Motivating Use-Cases

Example use case 1: reCAPTCHA[159] is a mechanism that protects websites from bots by

presenting a text transcription challenge that only a human can pass. In counterpart, the

collected transcriptions are used to digitize books. If a similar service was open to external

clients having digitization requests, the system would serve the chopped scans of books

according to a scheduling strategy that meets the clients requirements, e.g., throughput

(words/minute), or a deadline target.

Example use case 2: A large organization with multiple departments shares a database system

with crowd-powered user defined functions (UDFs). In our scenario, the marketing and sales

departments issue a series of queries to their system (see Listing 7.1), generating five different

HIT batches on the crowdsourcing platform. Note that with current systems, distinct queries

would generate isolated concurent batches with inter-dependent performances.

1 -- Marketing Department2 -- Q1:3 SELECT * FROM clients r4 WHERE isFemale(r.document_scan)5 AND r.city = ‘‘Philadelphia ’’6 -- Q2:7 SELECT hairColor(p.picture), COUNT (*)8 FROM person p9 GROUP BY hairColor(p.picture)

10

11 -- Sales Department (High Priority)12 -- Q3:13 SELECT * FROM person p14 WHERE isFemale(p.picture)15 AND p.martial_status = ‘‘married ’’16 -- Q4:17 SELECT *, findCustomerCarePhone(c.name)18 FROM clients c19 ORDER BY c.sales DESC20 -- Q5:21 SELECT *, tagESP(b.scan , 2)22 FROM business_cards b

Listing 7.1 – Example queries of a crowd-powered DBMS.

We make the following observations:

• Q1 and Q3 use the same query operators.

• Q2 and Q3 use the same input field.

104

Page 127: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

7.2. Scheduling on Amazon MTurk

• While queries with same UDFs can be merged, in this case Q3 should run with a higher

priority.

• Q4 needs to crowdsource the records of customers with the highest sales first.

• Q5 is a UDF that implements an ESP[156] mechanism for tagging pictures hence requir-

ing live collaboration of two workers.

7.1.2 Objective

We believe that posting HITs individually on a shared crowdsourcing platform, as they get

generated by the crowd-powered DBMS, is suboptimal. Rather, we propose in the following to

manage their execution by regrouping the HITs from the different queries into a single batch

that we call a HIT-BUNDLE. Hence, our goal is to create an intermediate scheduling layer that

has the following objectives:

• improving the overall execution time of the generated workload, while

• ensuring fairness among the different users of the system by equitably balancing the

available workforce, and

• avoiding starvation of smaller requests.

7.1.3 Contributions

We experimentally compare the efficiency of various crowd scheduling approaches with real

crowds of workers working on a micro-task crowdsourcing platform by varying the size of the

crowd, the ordering and priority of the tasks, and the size of the HIT batches. In addition, we

take into account some of the unique characteristics of the crowd workers such as the effect

of context switching and work continuity. Our experimental setting include both controlled

settings with a fixed number of workers involved in the experiments as well as real-world

deployments using HIT workloads taken from a commercial crowdsourcing platform log.

The results of our experimental evaluation indicate that using scheduling approaches for

micro-task crowdsourcing can lead to more efficient multi-tenant crowd-powered DBMSs by

providing faster results and minimizing the overall latency of high-priority work published on

the crowdsourcing platform.

7.2 Scheduling on Amazon MTurk

The AMT Platform: In this work, we aim at comparing approaches to improve as much as

possible the platform efficiency given the current workload of HITs from a certain requester.

We chose to design an experimental framework on top of AMT as 1) it is the currently the

most popular micro-task crowdsourcing platform, 2) there is a continuous flow of workers

and requesters completing and publishing HITs on the platform, and 3) its activity logs are

available to the public [76].

105

Page 128: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 7. Human Intelligence Task Scheduling

0.00

0.25

0.50

0.75

1.00

Jan 01 Jan 15 Feb 01 Feb 15 Mar 01 Mar 15 Apr 01Time (Day)

Cou

nt (

Nor

mal

ized

)

(a) Batch distribution per Size − Most of the Batches present on AMT have 10 HITs or less.

0.00

0.25

0.50

0.75

1.00

Jan 01 Jan 15 Feb 01 Feb 15 Mar 01 Mar 15 Apr 01Time (Day)

Thr

ough

put (

Nor

mal

ized

) (b) Cumulative Throughput per Batch Size − The overall platform throughput is dominated by larger batches.

Tiny[0,10]

Small[10,100]

Medium[100,1000]

Large[1000,Inf]

Figure 7.1 – An analysis of three months activity log on Amazon MTurk January-March 2014obtained from mturk-tracker.com [76] The crawler frequency is every 20 minutes, hence itmight miss some batches. All HITs considered in this plot are rewarded $0.01. Throughputmeasured in HIT/minute for HIT batches of different sizes.

Major Requesters and Meta Requesters: In crowdsourcing platforms, businesses that heavily

rely on micro-task crowdsourcing for their daily operations end up competing with themselves:

If a requester runs concurrent campaigns on a crowdsourcing platform, these will end up

affecting each other. For example, a newly posted large batch of HITs is likely to get more

attention than a two days old batch waiting to be finished with few HITs remaining (see below

for an explanation on that point).

7.2.1 Execution Patterns on Micro-Task Crowdsourcing Platforms

One of the common phenomena in micro-task crowdsourcing is the presence of long-tail

distributions: In a batch of HITs, the bulk of the work is completed by a few workers who

perform most of the tasks while the rest is performed by many different workers who perform

just a few HITs each (see, e.g., [64]). We observe this property in our experiments as well.

Figure 7.3b shows the amount of work (number of HITs submitted during the experiment)

performed by each worker in the crowd during an experiment involving more than 100 crowd

workers (see Section 7.2.2) containing heterogeneous HITs. We can see a long-tail distribution

where few workers perform most of the tasks while many perform just a few tasks.

Another example of long-tail distribution can be observed when considering the throughput:

Large batches are completed at a certain speed by the crowd, up to a certain point when few

HITs are left in the batch. These final few HITs take a much longer time to be completed as

compared to the majority of HITs in the batch. Such a batch starvation phenomenon has been

observed in a number of recent reports, e.g., in [62, 161] where authors observe that the batch

completion time depends on its size and on HIT pricing. HIT completion starts off quickly

but then loses some momentum. A plot depicting this effect on the AMT platform is shown in

Figure 7.1, where we observe that large batches dominate the throughput of a crowdsourcing

platform even if the vast majority of the running batches are very small (less than 10 HITs).

In that sense, large batches of tasks are able to systematically yield higher throughputs as

more crowd workers can work on them in parallel. We can conjecture that these phenomena

106

Page 129: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

7.2. Scheduling on Amazon MTurk

Crowd LayerHIT-Bundle Manager

Multi-TenantCrowd-powered DBMS

CrowdsourcingPlatform

ProgressMonitor API

HIT Scheduler

Human Workers

c1 a1b3..

QueueCrowdsourcing

App

HIT Collection and RewardHIT

Results Aggregator

HIT Manager

Scheduler

External HIT

Page

Batch A $$

Batch B $$$

Batch C $

..

Batch CatalogBatch Creation

and Update

Batch Merging

StatusMETA

DBMS

CrowdSqlBatch Input

MergerResource Tracker

config_file

Figure 7.2 – The role of the HIT Scheduler in a Multi-Tenant Crowd-Powered System Architec-ture (e.g., a DBMS).

are partially due to the preference of the crowd towards large batches. Indeed, the workers

tend to explore new batches with many HITs, since they have a high reward potential, without

requiring to search for and select a new HIT context. This is confirmed by our experimental

results (see section 7.4).

Moreover, we can see in Figure 7.3a that the overall throughput of the system increases linearly

with the number of workers involved in a set of batches.

7.2.2 A Crowd-Powered DBMS Scheduling Layer on top of AMT

We now describe the scheduling layer we established on top of AMT to perform our experimen-

tal comparison of different HIT Scheduling techniques. This layer can be used by multi-tenant

crowd-powered DBMSs to efficiently execute user queries.

HIT-BUNDLE: We study scheduling techniques applied to the crowd on AMT by introducing

the notion of a HIT-BUNDLE, that is, a batch container where heterogeneous HITs of com-

parable complexity and reward get published continuously by a given AMT requester, or, in

our case, by the Crowd-powered DBMS. In this section, we describe the main components

of a Multi-Tenant Crowd-Powered DBMS that uses scheduling techniques to optimize the

execution of batches of HITs. Then, we show that having a HIT-BUNDLE not only permits to

apply different scheduling strategies but it also produces a higher overall throughput (see

Section 7.4.2).

Framework: Our general framework is depicted in Figure 7.2. The input comes from the

different queries submitted to the system. The query optimizer has the role of deciding what

to ask to the crowd. Subsequently, the HIT Manager generates HIT batches together with

a monetary budget to be spent to obtain the results from the crowd. In traditional crowd-

powered systems, these batches are directly sent to the crowdsourcing platform.

In this work, we consider and experimentally evaluate the performance of an addition compo-

nent, the HIT Scheduler, which aims at improving the execution time of selected HITs. Once

107

Page 130: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 7. Human Intelligence Task Scheduling

0

25

50

75

100

18:26 18:28 18:30 18:32 18:34 18:36Time

Wor

ker

Cou

nt −

Thr

ough

put H

ITs/

Min

ute

HITs/Minute

Number of workers

(a) Throughupt vs #Workers

0

50

100

0 10 20 30#HITs submitted

Wor

ker

(b) Work Distribution

Figure 7.3 – Results of a crowdsourcing experiment involving 100+ workers concurrentlyworking in a controlled setting on a HIT-BUNDLE containing heterogeneous HITs (B1-B5, seesection 7.4) scheduled with FS. (a) Throughput (measured in HITs/minute) increases with anincreasing number of workers involved. (b) Amount of work done by each worker.

new HIT batches are generated, they are put in a container of tasks to-be-crowdsourced. The

scheduler is constantly monitoring the crowd workers and assigning to individual workers the

next HIT to work on based on a scheduling algorithm. More specifically, the HIT Scheduler

collects in its Batch Catalog the set of HIT batches generated by the HIT Manager together

with their reward and priorities.

Next, the HIT-BUNDLE Manager creates a crowdsourcing campaign on AMT. Based on the

scheduling algorithm adopted, a HIT queue (specifying which HIT must be served next in the

HIT-BUNDLE) is generated and periodically updated. As soon as a worker is available, the HIT

Scheduler serves the first element in the queue. When HITs are completed, the results are

collected and sent back to the DBMS for aggregation and query answering. Workers are able

to return HITs they find too boring or poorly paid and, obviously, to leave the system at any

point in time. In these cases, the Scheduler takes responsibility of updating the queue and to

reschedule uncompleted HITs.

Next, we describe a number of scheduling algorithms that can be used to generate the HIT

queue for crowdsourcing platforms in section 7.3 and experimentally compare their perfor-

mance in section 7.4.

7.3 HIT Scheduling Models

The rest of this chapter focuses on experimentally evaluating scheduling approaches for crowd-

sourcing platforms within the framework presented above in Section 7.2.2. We revisit below

common scheduling approaches used by popular resource managers in shared environments,

and discuss their advantages and drawbacks when applied to a crowdsourcing platform setting

which, as we show in section 7.4, presents several new dimensions to be taken into account

compared to traditional CPU scheduling.

108

Page 131: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

7.3. HIT Scheduling Models

7.3.1 HIT Scheduling: Problem Definition

We now formally define the problem of scheduling HITs generated by a multi-tenant crowd-

based system on top of a crowdsourcing platform.

A query r submitted to the system and including crowd-powered operators generates a batch

B j of HITs. We define a batch B j = {h1, ..,hn} as a set of HITs hi . Each batch has additional

metadata attached to it: A monetary budget M j to be spent for its execution and a priority

score p j with which it should be completed: Batches with higher priority should be executed

before batches with lower priority. Thus, if a high-priority batch is submitted to the platform

while a low-priority batch is still uncompleted, the HITs from the high-priority batch are to be

scheduled to run first.

The problem of scheduling HITs takes as input a set of available batches {B1, ..,Bn} and a crowd

of workers {w1, .., wm} currently active on the platform, and produces as output an ordered list

of HITs from {B1, ..,Bn} to be assigned to workers in the crowd by publishing them as a single

HIT-BUNDLE. Once a worker wi is available, the system assigns him/her the first task in the

list as decided by the scheduling algorithm.

Scheduling may need to be repeated over time to update the HIT execution queue. Such

re-scheduling operations are necessary, for example when a worker fails to complete some of

his/her HIT or when a new batch of HITs is submitted by one of the clients.

In this way, we obtain some hybrid pull-push behavior on top of AMT as the workers partici-

pating in the crowd sourcing campaign are shown HITs computed by the scheduler. Workers

are still free to decline the HIT, ask for another one, or simply seek for another requester on

AMT.

Worker Context Switch

From the worker perspective, scheduling can lead to randomly alternating task types that

a single worker might receive. In such a situation, the worker has to adapt to the new task

instructions, interface, question etc, and this could be penalizing (see our related work section

7.5). This overhead is called context switch. One of the goals of task scheduling is to improve

the efficiency of each worker by mitigating her context switches.

7.3.2 HIT Scheduling Requirement Analysis

Next, we describe which requirements should be taken into account when applying scheduling

in a crowdsourcing setting. We then use some of these requirement to customize known

scheduling techniques for the crowd.

(R1) Runtime Scalability: Unlike parallel DBMS schedulers, where the compiled query plan

109

Page 132: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 7. Human Intelligence Task Scheduling

dictates where and when the operators should be executed [150], Crowd-Powered

DBMSs are bound to adopt a runtime scheduler that a) dynamically adapts to the

current availability of the crowd, and b) scales to make real-time scheduling decisions as

the work demand grows higher. A similar design consideration is adopted by YARN[153],

the new Hadoop resource manager.

(R2) Fairness: An important feature that any shared system should provide is fairness across

the users of the system. By taking control of the HIT-BUNDLE scheduling, the crowd-

powered system acts as the load balancer of the currently available crowd and the

remaining HITs in the HIT-BUNDLE. For example, the scheduler should provide a steady

progress to large requests without blocking – or starving, the smaller requests.

(R3) Priority: In a multi-tenant System, some queries have a higher priority than others.

For this reason, HITs generated from the queries should be scheduled accordingly. In

the case of high-priority requests, one of the standard SLA requirements is the job

deadline that the requester specifies. In a crowdsourcing scheduling setting, as workers

are not committed to the platform and can leave at any point in time, a Crowd-Powered

DBMS scheduler should be best-effort, that is, the system should do its best to meet the

requester priority requirements without any hard guarantee.

(R4) Multiple Resources: Crowd-Powered UDFs can be designed to include specific require-

ments on resources, e.g., qualifications and number of workers. In that sense, we

consider the very common case of collaborative tasks where multiple crowd workers

are needed concurrently. An example of such a task is the ESP game [158], where two

human individuals have to tag images collaboratively. This problem is analogous to the

gang scheduling problem [63] in machine-based systems, where an algorithm can only

run when a given number of CPUs are reserved for that purpose.

(R5) Need for Speed: In hybrid human-machine systems, the crowd-powered modules are

usually the bottleneck in terms of latency. However, real-time Crowdsourcing is a

necessity for various interactive applications that require human intelligence at scale

to improve what machines can do today. Example applications that require real-time

reactions from the crowd include real-time captioning of speech [98], real-time personal

assistants [102], real-time video filtering [24]. Scheduling HITs belonging to a mixed

workload of real-time and batch jobs is essential to enable real-time crowdsourcing.

(R6) Worker Friendly: Differently from CPUs, people performances are impacted by many

factors including training effects, boringness, task difficulty and interestingness. Schedul-

ing approaches over the crowd should whenever possible take these factors into account.

In this chapter, we experimentally test worker-conscious scheduling methods that aim

at balancing the trade-off between serving similar HITs to workers and providing fair

execution to different HIT batches.

In addition to the machine-specific requirements listed above, we briefly discuss crowd-

specific features that a scheduler needs to take into account when scheduling HITs on crowd-

sourcing platforms.

110

Page 133: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

7.3. HIT Scheduling Models

(C1) Laggers: It often happens that HITs are assigned to a worker on a crowdsourcing plat-

form but never get completed [62]. In distributed systems, task execution failures are

usually mitigated by opportunistically duplicating the task on an idle resource. In an

architecture partially powered by micro-tasks, duplicating HITs opportunistically di-

rectly leads to unnecessary monetary costs, especially when the lagging workers end

up doing the task. Instead, the HIT should be released from the lagging worker after a

batch-specific timeout, and only then be reassigned to a new worker.

(C2) Better Resources: Some resources in a shared system might be better than others (some

may be more powerful, consume less energy, etc.) [105]; Likewise, in the crowd, some

workers may be more efficient than others or might provide higher-quality results.

Previous work [56] showed how it is possible to predict the quality of the results of a

specific worker on a specific task. Such approaches can be used as additional evidence

for HIT scheduling but are not in the scope of this work. Instead, we focus on scheduling

approaches to improve latency of certain batches in a setting where selected HIT batches

have high priority and need to be executed before others.

7.3.3 Basic Space-Sharing Schedulers

Crowdsourcing platforms usually operate in a non-preemptive mode, that is, they do not allow

to interrupt a worker performing a task of low priority to have him perform a task of higher

priority with the risk of reneging. 1. In our evaluation we consider common space-sharing

algorithms where a resource (a crowd worker in this case) is assigned a HIT until he/she

finishes it, or returns it uncompleted to the platform.

FIFO

On crowdsourcing platforms, this scheduling has the effect of serving lists of tasks of the same

batch to the workers until they are finished. By concentrating the entire workforce on a single

job until it is done, FIFO provides the best throughput per batch one can expect from the

platform at a given moment in time.

The potential shortcomings of this scheme are as follows: 1) short jobs and high priority jobs

can get stuck behind long running tasks, minimizing the overall efficiency of the crowdsourcing

system, and 2) when a batch has a large number of tasks, assigned workers can potentially get

bored [138].

Shortest Job First (SJF)

Other simple scheduling schemes offer different tradeoffs depending on the requirements of

the multi-tenant system. Shortest Job First (SJF) offers fast turn-around for short HITs, and

can lead to a minimum of a context switch for part of the crowd, since the shortest jobs are

1Unless the high priority task can take up the reneging cost.

111

Page 134: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 7. Human Intelligence Task Scheduling

either quickly finished or scheduled to the first available workers.

However, SJF is not strategy-proof on current crowdsourcing platforms as the requesters

can lie about the expected HIT execution times. Hence, these schemes should be used

in trusted settings mostly (e.g., in enterprise crowd-DBMSs). Moreover, these schemes do

not systematically interweave tasks from different batches, and thus present also the same

shortcomings as FIFO.

Round Robin (RR)

The previous schemes introduces biases, in the sense that they give an advantage to one batch

over the others. Round Robin removes such biases by assigning HITs from batches in a cyclic

fashion. In this way, all the batches are guaranteed to make regular progress. While Round

Robin ensures an even distribution of the workforce and avoids starvation, it does not meet

one of our requirement (R2) since it is not priority-aware: All the batches are treated equally

with the side effect that batches with short HITs would (proportionally) get more workforce

than longer HITs. Another risk is that a worker might find herself bouncing across tasks and

being forced to continuously switch context, hence loosing time to understand the specific

instructions of the tasks. The negative effect of context switch is evident from our experimental

results (see section 7.4) and should be avoided.

7.3.4 Fair Schedulers

In order to deal with batches of HITs having different priorities while avoiding starvation, we

also consider scheduling techniques frequently used in cluster computing.

Fair Sharing (FS)

Sharing heterogeneous resources across jobs having different demands is a well-known and

complex problem that has been tackled by the cluster computing community. One popu-

lar approach currently used in Hadoop/Yarn is Fair Scheduling (FS) [67]. In the context of

scheduling HITs on a crowdsourcing platform, we borrow this approach in order to achieve fair

scheduling of micro-tasks: Whenever a worker is available, he/she gets a HIT from the batch

with the lowest number of currently assigned HITs which we call r unni ng _t asks. Unlike

Round Robin, this ensures that all the jobs get the same amount of resources (thus being fair).

Algorithm 1 gives the exact way we considered FS in our context.

Weighted Fair Sharing (WFS)

In order to schedule batches with higher priority first (see R2 in Section 7.3.2), weighted fair

scheduling can be used, in order assign a task from the jobs with the least r unni ng _t asks/t ask_pr i or i t y

value. Algorithm1, line 2, gets in that case updated: Sort B by increasing ri /pi . This puts

112

Page 135: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

7.3. HIT Scheduling Models

Algorithm 1 Basic Fair Sharing

Input: B = {bi < p1,r1 >, ..,bn < pn ,rn >} set of batches currently queued with priority pi , andnumber of running HITs per batch ri .

Output: HIT hi .1: When a worker is available for a task2: BSor ted = Sort B by increasing ri

3: hi = BSor ted [0].getNextHit()4: return hi

more weight on batches with few running tasks and a high priority.

The following formula gives the fair share of resources (i.e., number crowd workers) allocated

to a HIT batch j with priority score p j and concurrent running batches with priority scores

{p1..pN } at any given point in time

w j =p j∑N1 pi

. (7.1)

7.3.5 Gang Scheduling for Collaborative HITs

A Crowd-powered UDF can be coded to require live collaboration of K workers for each HIT it

creates. A typical example is the design of games with a purpose [156] where the participants

can be hired through a paid crowdsourcing platform (see R3 in Section 7.3.2). This is the

equivalent of Gang Scheduling in the context of system scheduling (e.g., MPI) where a job

will not start if the central scheduler cannot provision the required number of resources (i.e.,

CPUs). Different scheduling approaches are necessary for such HITs, as they require two or

more concurrent workers to be completed.

Naive Gang Scheduling (NGS)

The most common way to achieve gang scheduling is to place a reservation on a resource until

the job acquires all the necessary resources. In order to make this technique applicable to a

crowdsourcing platform, one needs to place a reservation on a worker, making the k-readily

available workers wait for the remaining K -k workers. This approach is suboptimal in our

context since recruiting all the necessary workers has no time guarantee, hence the workers

might incur unacceptable idle time which has a negative impact on their revenue. The idle

effect is however mitigated when there is a sufficient number of workers on the platform.

7.3.6 Crowd-aware Scheduling

In addition to the standard scheduling techniques described above, we also evaluate a couple

of approach aiming at scheduling tasks taking into account the crowd workers need (see R4 in

113

Page 136: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 7. Human Intelligence Task Scheduling

Algorithm 2 Worker Conscious Fair Share

Input: B = {bi < p1,r1, si >, ..,bn < pn ,rn , sn >} set of batches currently queued with prioritypi , ri number of running HITs, and si concessions initialized to 0.

Input: K = maximum concession thresholdOutput: HIT hi .

1: When a worker w j is available for a HIT2: bl ast = Last batch that w j did // null if it’s a new worker3: BSor ted = Sort B by increasing ri /pi

4: if bl ast == null then5: BSor ted [0].s = 06: return BSor ted [0].getNextHit()7: end if8: for b in BSor ted do9: if b == bl ast then

10: b.s = 011: return b.getNextHit()12: else if b.s < K then13: b.s ++14: continue15: else16: b.s = 017: b.getNextHit()18: end if19: end for

Section 7.2). In that sense, we propose scheduling approaches that offer a tradeoff between

being fair to the batches (by load-balancing the workers) while also being fair to the workers

(by serving HITs with some continuity, if possible, and with minimal wait time).

Worker Conscious Fair Sharing (WCFS)

Worker Conscious Fair Sharing (WCFS) maximizes the likelihood of a worker receiving a

task from a batch he worked on recently, thus avoiding that a workers jumps back and forth

between different tasks (i.e., minimizing context switching). We suggest to achieve this by

having top priority batches concede their positions in favor of one of the next batches in the

queue. Each batch can concede his turn K times, a predefined concession threshold, which is

reset after being scheduled. This approach is the crowd-equivalent of Delay Scheduling [172].

Crowd-aware Gang Scheduling (CGS)

Finally, we propose a crowd-aware version of Gang Scheduling for crowdsourcing platforms

by setting a maximum wait time τ any recruited worker might incur. Whenever the scheduler

decides that a HIT with a gang requirement should be executed, it first checks the expected

114

Page 137: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

7.4. Experimental Evaluation

finish time for all the workers currently active on the platform based on the average finish time

of the HITs they are currently performing. If K workers can be available in a window of time

τ= t1 − t0, then the batch is scheduled to start at the beginning of the time window. Hence,

the first worker who gets the task will be joined by the other workers in a maximum of τ time.

We deal with the uncertainty of having the workers quit from the platform after their last HIT

by over provisioning, that is by assigning more workers than required to the collaborative HIT,

and, by giving the task to any worker who becomes available in the target time window. We

call the resulting technique Crowd-Aware Gang Scheduling (CGS).

7.4 Experimental Evaluation

We describe in the following our experimental results obtained by scheduling HITs on the AMTcrowdsourcing platform. The main questions we want to address are the following:

• Do scheduling approaches keep their properties when used for assigning HITs to workers

in the crowd? (section 7.4.3)

• How do different dimensions like batch priority and crowd size affect execution? (sec-

tion 7.4.3)

• How does scheduling for collaborative HITs perform? (section 7.4.3)

• How do worker-aware approaches behave in terms of throughput and latency? (sec-

tion 7.4.4)

• How do scheduling approaches behave on a real deployment over a commercial crowd-

sourcing platform? (section 7.4.4)

• Do larger HIT batches attract more workers in current crowdsourcing platforms? (sec-

tion 7.4.2)

• Do context switch (i.e., working on a sequence of different HIT types) affect worker

efficiency? (section 7.4.2)

As a general experimental setup, we implemented the architecture proposed in Section 7.2.2

on top of AMT’s API. Our implementation and datasets are available as an open-source project

for reproducibility purposes and as a basis for potential extensions2.

7.4.1 Datasets

For our experiments, we created a dataset of 7 batches of varying complexity, sizes, and

reference prices. The data was partly created by us and partly collected from related works;

it includes typical tasks that could have been generated by crowd-powered DBMSs. Table

7.1 gives a summary of our dataset and provides a short description and references when

applicable. We note that for the purpose of our experiments, we vary the batch sizes and prices

according to the setup.

2https://github.com/XI-lab/hitscheduler

115

Page 138: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 7. Human Intelligence Task Scheduling

ID Dataset DescriptionPriceperHIT

#HITs

Avg.Time

perHIT

B1Customer CarePhone NumberSearch

Find the customer-care phonenumber of a given US-basedcompany using the Web.

$0.07 50 75sec

B2Image Tagging Type all the relevant keywords

related to a picture from theESP game dataset. [158]

$0.02 50 40sec

B3Sentiment Analysis Classify the expressed senti-

ment of a product review (posi-tive, negative, neutral).

$0.05 200 22sec

B4

Type a Short Text This a is study on short memory,where a worker is presentedwith text for a few seconds, thenhe is asked to type it from mem-ory. [155]

$0.03 100 11sec

B5Spelling Correction A collection of short paragraphs

to spell check from StackEx-change.

$0.03 100 36sec

B6

Butterfly Classifica-tion

Classify a butterfly image toone of 6 species (Admiral,Black Swallowtail, Machaon,Monarch, Peacock, and Zebra).[103]

$0.01 600 15sec

B7

Item Matching Uniquely identify products thatcan be referred to by differ-ent names (e.g., ‘iPad Two’ and‘iPad2nd Generation’). [164]

$0.01 96 22sec

Table 7.1 – Description of the batches constituting the dataset used in our experiments.

7.4.2 Micro Benchmarking

The goal of the following micro benchmark experiments is to validate some of the hypotheses

that motivates the use of a HIT-BUNDLE and the design of a worker-aware scheduling algorithm

that minimizes tasks switching for the crowd workers.

Batch Split-up

The first question we address is whether smaller or larger batches of homogeneous HITs

are more attractive to the workers on AMT. We experimentally check if a single large batch

executes faster than when breaking the same batch into smaller ones. To this end, we use the

batch B6 which we split into 1, 10 and 60 individual batches, containing respectively 600, 60

and 10 HITs each. Next, we run all these batches on AMT concurrently with non-indicative

titles and similar unit prices of $0.01. Note that the batch combinations were published at

the same time on the crowdsourcing platform so all the variables like crowd population and

116

Page 139: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

7.4. Experimental Evaluation

0

200

400

600

0 500 1000 1500 2000Time (seconds)

#HIT

s R

emai

ning

variable

1 Batch of 600 HITs

10 Batches of 60Hits

60 Batches of 10Hits

Figure 7.4 – A performance comparison of batch execution time using different groupingstrategies publishing a large batch of 600 HITs vs smaller batches (From B6).

size, concurrent requesters, and rewards are the same across the different settings. Figure

7.4 shows how the three different batch splitting strategies executed overtime on B6. We

observe that running B6 as one large batch of 600 HITs completed first. We also observe that

the strategy with 10 batches only really kicks-off when the large batch finishes (and similarly

for the strategy with 60 batches). From this experiment, we conclude that larger batches

provide a better throughput and constitute a better organizational strategy. This finding is

especially interesting for requesters who would periodically run queries that use a common

crowdsourcing operator (albeit, with a different input), by pushing new HITs into an existing

HIT-BUNDLE.

Merging Heterogenous Batches

We extend the above experiment to compare the execution of two heterogenous batches run

separately or within a single HIT-BUNDLE. Unlike the previous experiment, where the fine-

grained batches were one to two orders of magnitude smaller than the larger one, this scenario

involves two batches of type B6 and B7 containing 96 HITs each, versus one HIT-BUNDLEregrouping all 192 HITs. We run the three batches concurrently on AMT, with non-indicative

titles and similar unit prices of $0.01 and without altering the default serving order within the

HIT-BUNDLE 3. The results are depicted in Figure 7.5. Again, the HIT-BUNDLE exhibits a faster

throughput as compared to individual batches. Moreover, the embedded batches both finish

before their counterparts that are running separately.

At this point, we have shown that requesters who would run queries invoking different crowd-

sourcing operators can also benefit from pushing their HITs into the same HIT-BUNDLE. Since

a DBMS might support multiple crowdsourcing operators, the next question we explore is

whether context switches (i.e., alternating HIT types) affects workers efficiency.

3We observe that AMT randomly selects the input to serve.

117

Page 140: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 7. Human Intelligence Task Scheduling

0

25

50

75

100

0 1000 2000 3000 4000Time (seconds)

#HIT

s R

emai

ning

B6 − Outlet

B7 − Outlet

B6

B7

Figure 7.5 – A performance comparison of batch execution time using different groupingstrategies publishing two distinct batches of 192 HITs separately vs combined inside anHIT-BUNDLE.

Workers Sensitivity to Context Switch

The following experimental setup involves three groups of 24 distinct workers each. Each

group was exposed to three types of HIT serving strategies, namely:

• RR: a worker in this group would receive work in an alternating order from types

{B6,B7,B6,B7, ..,B6,B7} etc.

• SEQ10: here the workers will receive 10 tasks from B6 then 10 tasks from B7 then again

10 from B6 and so on.

• SEQ25: similar to SEQ10 but with sequences of 25 tasks. In order to trigger the context

switch, each participant was asked to do at least 10, and up to 100 tasks.

Figure 7.6 shows the average execution time of all the 100 HITs under each execution group.

We observe that the average of execution time of HITs is worse of when using RR as compared

to workers performing longer alternating sequences in SEQ10 and SEQ25. To test the statistical

significance of these improvements, and since the distribution of HIT execution time cannot

be assumed to be normally distributed, we perform a Willcoxon signed-rank test. SEQ10

has a p=0.09 which is not enough to achieve statistical significance. However, the SEQ25

improvement over RR is statistically significant with p<0.05.

In conclusion, context switch generates a significant slowdown for the workers, thus reducing

their overall efficiency. Hence, this result motivates the design of a scheduling algorithm that

takes into account workers efficiency by scheduling longer sequences of HITs of the same

type.

7.4.3 Scheduling HITs for the Crowd

Now we move our attention to experimentally comparing the scheduling algorithm that are

used to manage the distribution of HITs within a HIT-BUNDLE.

118

Page 141: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

7.4. Experimental Evaluation

** (p−value=0.023)** (p−value=0.023)

●● ●

0

20

40

60

RR SEQ10 SEQ25Experiment Type

Exe

cutio

n tim

e pe

r H

IT (

Sec

onds

)

RR SEQ10 SEQ25

Figure 7.6 – Average Execution time for each HIT submitted from the experimental groups RR,SEQ10 and SEQ25.

Controlled Experimental Setup

In order to develop a clear understanding of the properties of classical scheduling algorithms

when applied to crowdsourcing, we put in place an experimental setup that mitigates the

effects of workforce variability overtime4.

In our controlled setting, each experiment that we run involves between |wor k f or ce| =[Mi nw , M axw ] crowd workers at any point in time. To be within this range target, the workers

who arrive first are presented with a reCaptcha to solve (paid $0.01 each), until Mi nw workers

join the system, at that point the experiment begins serving tasks. From that point on, new

workers are still accepted up to a maximum M axw . If the number of active sessions drops

bellow Mi nw , then the system starts accepting new sessions again. Unless otherwise stated,

we use the following configuration:

• |wor k f or ce| = [10,15].

• Fair Sharing, with price as weighting factor.

• a HIT-BUNDLE of {B1,B2,B3,B4,B5}.

• FIFO order is [B1,B2,B3,B4,B5].

• SJF order is [B4,B3,B5,B2,B1].

Also, we note that each experiment involves a distinct crowd of workers to avoid any further

training effects on the tasks.

4We decided not to run simulations, but rather to report the actual results obtained with human workers as partof the evaluated system.

119

Page 142: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 7. Human Intelligence Task Scheduling

0

500

1000

1500

2000

B1 B2 B3 B4 B5Batch

Tim

e (S

econ

ds)

FIFO FS RR SJF

(a) Batch Latency

0

500

1000

1500

2000

FIFO FS RR SJFScheduling Scheme

Tim

e (S

econ

ds)

(b) Overall Experiment Latency

Figure 7.7 – Scheduling approaches applied to the crowd.

Comparing Scheduling Algorithms

First, we compare how different scheduling algorithms perform from a latency point of view,

taking into account the results of individual batches as well as the overall performance. We

create a HIT-BUNDLE out of {B1,B2,B3,B4,B5}, which is then published to AMT. In each run, we

use a different scheduling algorithm from: FIFO, FS, RR, and SJF, with |wor k f or ce| = [10,15].

Figure 7.7 shows the completion time of each batch in our experimental setting and the

cumulative execution time of the whole HIT-BUNDLE.

FS achieved the best overall performance, thus maximizing the system utility, though, at the

batch level, FS did not always win (e.g., for B2). We see how FIFO just assigns tasks from a batch

until it is completed. In our setup, we used the natural order of the batches, which explains

why B1 is getting a preferential treatment as compared to B5, which finishes last. Similarly,

SJF performs unfairly over all the batches but manages to get B4 completed extremely fast. In

fact, SJF uses statistics collected from the system on the execution speed of each operator (see

Table 7.1); this explains the fast execution of B4. On the positive side, we observe that both RR

and FS perform best in terms of fairness with respect to the different batches, i.e., there was

no preferential treatment.

Varying the Control Factors

In order to test our priority control mechanism across different batches of a HIT-BUNDLE(tuned using the price), we run an experiment with the same setup as in Section7.4.3, but

varying the price attached to B2 and using the FS algorithm only. Figure 7.8 shows that

batches with a higher priority (reward) lead to faster completion times using the FS scheduling

approach (gray bar of batch 2 lower than the black one). This comes at the expense of other

batches being completed later.

Another dimension that we vary is the crowd size. Figure 7.8b shows the batch completion time

of two different crowdsourcing experiments when we vary the crowd size from |wor k f or ce| =[10,15] to |wor k f or ce| = [20,25] (keeping all other settings constant). We can see batches

120

Page 143: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

7.4. Experimental Evaluation

0

300

600

900

B1 B2 B3 B4 B5Batch

Tim

e (s

econ

ds)

B2:$0.02

B2:$0.05

(a)Vary The Price

0

250

500

750

1000

B1 B2 B3 B4 B5Batch

Tim

e (s

econ

ds)

10 workers

20 workers

(b) Vary The Workforce

Figure 7.8 – (a) Effect of increasing B2 priority on batch execution time. (b) Effect of varyingthe number of crowd workers involved in the completion of the HIT batches.

● ● ●● ● ●●● ● ●●● ●● ●● ●●● ● ● ●

● ●● ●●●

●● ●●● ●●●● ●● ●● ●●●●

● ●● ●● ●● ●● ●● ●● ● ●● ●●●

● ●● ●● ●● ●●● ● ●●●

●●●● ●

● ●

23456789

101112131415161718

01:24 01:25 01:26 01:27Time

Wor

ker I

D

● ●Assignement of a Collaborative Batch Assignement of a Normal Batch

Figure 7.9 – An example of a successful scheduling of a collaborative task involving 3 workerswithin a window of 10 seconds.

being completed faster when more workers are involved. However, different batches obtain

different levels of improvement.

Gang Scheduling Algorithm

We now turn to gang scheduling. Figure 7.9 shows a crowdsourcing experiment where (addi-

tionally to the default 5 HIT types), the HIT-BUNDLE contained one additional collaborative

task which required exactly three workers at the same time on the same HIT. In detail, the task

was asking three workers to collaboratively edit a Google Document to translate a news article.

As we can see on the task assignment plot, the gang scheduling algorithm is waiting for three

workers to be available in a time window τ=10sec before assigning the collaborative task (see

Section 7.3.5).

Figure 7.10 compares how two different gang scheduling algorithms behave in terms of ac-

curacy and precision. In this setting, accuracy measures the number of correct scheduling

and no-scheduling decisions made when considering all the decisions taken by the scheduler.

Precision measures the number of correct scheduling decisions over all scheduling decisions.

121

Page 144: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 7. Human Intelligence Task Scheduling

2 Workers 3 Workers 4 Workers 5 Workers

0.00

0.25

0.50

0.75

1.00

5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20Window Size (Seconds)

Acc

urac

y

(a) Accuracy

Crowd−GS Naive−GS

2 Workers 3 Workers 4 Workers 5 Workers

0.00

0.25

0.50

0.75

1.00

5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20Window Size (Seconds)

Pre

cisi

on

(b) Precision

Figure 7.10 – Accuracy and precision of gang scheduling methods.

We observe that for a short time window τ < 4 seconds, CGSs schedules HITs with a higher

accuracy than NGS. The reason is that CGS decides not to schedule a HIT if the time window

is too small (and in that sense, it makes the correct decision). However, both approaches fail

short at having good precision given the small window constraint.

As we increase the time window, we observe that precision increases also (i.e., more scheduling

decisions are correct) while the accuracy of CGS decreases because the approach starts making

wrong scheduling decisions. When the windows size becomes larger, precision is high (e.g.,

it is easy to find 3 workers available within 20 seconds) and accuracy grows again (e.g., for 2

workers and window of more than 15 seconds). Obviously, larger windows are suboptimal as

they require workers to wait longer before starting a HIT. Also, the larger the gang requirement,

the more difficult it gets to come up with a precise schedule.

7.4.4 Live Deployment Evaluation

After the initial evaluation of the different dimensions involved in scheduling HITs over the

crowd, we now evaluate our proposed fair scheduling techniques FS and WCFS in an un-

controlled crowdsourcing setting using HIT-BUNDLE, and compare it against a standard AMT

execution.

More specifically, we create a workload that mimics a 1-hour activity on AMT from a real

requester who had 28 batches running concurrently. Since we do not have access to the input

of the batches, we randomly select batches from all our experimental datasets and adapt the

price and the size to the actual trace. The trace used in that sense is composed of 28 batches

with similar rewards of $0.01; the largest batch has 45 HITs and the smallest 1 HIT only. For

122

Page 145: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

7.4. Experimental Evaluation

0

50

100

150

Batch Type

Exe

cutio

n tim

e pe

r H

IT (

Sec

onds

)

Individual Batches WCFS FS

Figure 7.11 – Average execution time per HIT under different scheduling schemes.

analysis purposes, we group batches by size: 16 small batches (1-9 HITs), 8 medium batches

(9-15 HITs), and 4 large batches (16-45 HITs). The total size of this trace is 286 HITs.

Live Deployment Experimental Setup

We publish concurrently the 28 batches from the previously described trace as individual

batches (standard approach) as well as into two HIT-BUNDLEs, one using FS and the other

using WCFS. The individual batches use meaningful titles and descriptions of their associated

HIT types; on the other hand the HIT-BUNDLE informs the crowd workers that they might

receive HITs from different categories. Other parameters like requester name and reward are

similar.

Average Execution Time

Figure 7.11 shows the average HIT execution time obtained by the different setups. Confirming

the results from Section 7.4.2, we observe that workers perform better when working on

individual batches because of the missing context switch effect (though the performance

difference is minimal). Instead, when HITs are scheduled, execution time increases with the

benefit of prioritizing certain batches. We also see that WCFS provides a trade-off between

letting workers work on the same type of HITs longer and having the ability to schedule batches

fairly as we shall see next.

Results of the Live Deployment Run

We plot the CDFs of HIT completion per category in Figure 7.12. For example, 25% of small

batches completed in 500 seconds when run individually. For all batch sizes, we observe that

123

Page 146: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 7. Human Intelligence Task Scheduling

large medium small

0.00

0.25

0.50

0.75

1.00

0 1000 2000 0 1000 2000 0 1000 2000Time (seconds)

CD

F

FS Individual Batches WCFS

Figure 7.12 – CDF of different batch sizes and scheduling schemes.

individual batches started faster. However, in all cases they also ended last, especially for

smaller batches suffering from some starvation (i.e., long period without progress); here, we

clearly see the benefits of both FS and WCFS at load balancing.

The final plot (Figure 7.13) shows how a large workload executes over time on the crowd-

sourcing platform. We can see how many workers are involved in each setting and which HIT

batch they are working on (each color represents a different batch). Finally, as expected, the

number of active workers varied wildly overtime in each setup. Corroborating the results of

the previous paragraph, Individual Batches received more workforce in the beginning (they

start faster) then workers either left, or took some time to spill over the remaining batches

in the [11:25 - 11:35] time period. Our main observation is that FS and WCFS i) achieve their

desired property of load balancing the batches when there are sufficient number of workers,

ii) they finish all the jobs well before the individual execution (10-15 minutes considering the

95th percentile).

7.5 Related Work on Task Scheduling

Collaborative Crowdsourcing

Some crowdsourcing applications (e.g., games with a purpose [156]) may involve multiple

persons to complete the task at hand. The most notable example of such applications is the

ESP game [158] where two players are presented with a given image and have to type image

tags as fast as possible; a tag is accepted only if both players enter it. In this case, scheduling

approaches that aim at assigning multiple workers to the same HIT are required. In our work,

we studied how gang scheduling can be adapted to the micro-task crowdsourcing setting.

Crowdsourced Workflows

Scheduling HITs is also beneficial in the case of crowdsourced workflows [96]. When more

than one HIT batch has to be crowdsourced in order for the system to produce its desired

output, it is important to make sure that batches executed in parallel get the right priority over

124

Page 147: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

7.5. Related Work on Task Scheduling

0

10

20

30

0

10

20

30

0

10

20

30

FS

Individual Batches

WC

FS

11:20 11:30 11:40 11:50Time

#Act

ive

Wor

kers

Figure 7.13 – Worker allocation with FS, WCFS and classical individual batches in a live deploy-ment of a large workload derived from crowdsourcing platform logs. Each color represents adifferent batch.

the crowd. While this is very difficult to ensure in standard micro-task crowdsourcing, we can

obtain such prioritization thanks to the techniques proposed in our work.

The Effect of Switching Tasks

When scheduling HITs for the crowd, it is necessary to take the human dimension into account.

Recent work [101] showed how disrupting HIT continuity degrades the efficiency of crowd

workers. Taking this result into account, we designed worker-conscious scheduling approaches

that aim at serving HITs of the same type in sequence to crowd workers in order to leverage

training effects and to avoid the negative effects of context switching.

Studies in the psychology domain have shown that switching between different HIT types has

a negative effect on worker reaction time and on the quality of the work done (see, for example,

[41]). In addition to this, in this chapter we show how context switch leads to an overall larger

latency in work completion (Section 7.4.2) and propose scheduling techniques that take this

human factor into account. The authors of [171] study the effect of monetary incentives

on task switching concluding that providing such incentives can help in motivating quality

work in a task switching situation. In our work, we rather aim at reducing task switching by

consciously scheduling tasks to workers.

125

Page 148: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 7. Human Intelligence Task Scheduling

7.6 Conclusions

In a shared crowd-powered system environment, multiple users (or tenants) periodically

issue queries that involve a set of crowd-operators (as supported by the system), resulting in

independent crowdsourcing campaigns published on the crowdsourcing platform. In this

chapter, we pose and experimentally show that the divide strategy is not optimal, and that

the crowd-powered system can increase its overall efficiency by bundling requests into a

single one that we call: HIT-BUNDLE. Our micro-benchmarks show that this approach has

two benefits i) it creates larger batches that have a higher throughput, and ii) it gives the

system the control on what HIT to push next, a feature that we leverage to push high-priority

requests for example, or to provide specific operators needs (e.g., gang scheduling or workflow

management).

Fairness is an important feature that a shared environment should support, including a

crowd-powered system. Thus, we explore the problem of scheduling HITs using weighted Fair

Scheduling algorithms, where priority is expressed as a function of price. However, human

individuals behave very differently from machines, they are sensitive to the context switch that

a regular scheduler might cause. The negative effects of context switching were visible in our

micro benchmarks and are also supported by related studies in psychology.

We proposed a Worker Conscious Fair scheduling (WCFS), a new scheduling variant that

strikes a balance between minimizing the context switches and the fairness of the system.

We experimentally validated our algorithms over real crowds of workers on a popular paid

micro-task crowdsourcing platform running both controlled and uncontrolled experiments.

Our results show that it is possible to achieve i) a better system efficiency—as we reduce the

overall latency of a set of batches—while ii) providing fair executions across batches, resulting

in iii) non starving small jobs.

126

Page 149: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

8 Conclusions

In this thesis, we investigated, designed, and evaluated several methods and algorithms that

improve the efficiency and effectiveness in hybrid human-machine systems. These two dimen-

sions form what we refer to as the Quality of Service of a crowd-powered system. As such, we

explored several aspects related to the execution of batches of HITs on a crowdsourcing plat-

form including quality assurance, routing, retention, and load balancing. All of our proposed

methods take into account inherent human properties (e.g., unpredictability, preferences, and

poor context switching) in order to achieve their respective goals.

We started by tackling the aggregation of responses of multiple-choice questions in order to

lower the error rate in Chapter 4. We dynamically assigned ad-hoc weights to crowd workers

using probabilistic inference based on either gold standard test questions or from consensus

among previously screened workers. We also proposed a novel crowdsourcing mechanism

called push in Chapter 5, which matches tasks to crowd participants based on their general

interests infered from their social profiles.

Next, we turned our attention to efficiency. In Chapter 6 we explored worker retention as a

mean to reduce the execution time of a batch of tasks, and avoid its starvation. We achieved

retention using punctual bonuses as an alternative to increasing the overall batch reward. Load

balancing is another technique that we investigated in Chapter 7 with the aim to improve the

overall efficiency of a shared crowd-powered system that runs several heterogeneous batches

of HITs. While this method has been previously applied to CPUs and clusters, applying it on

top of a crowdsourcing platform requires careful scheduling decisions that maximize task

continuity for each worker.

Finally, the methods explored in this thesis were designed with an eye on scalability; this aspect

will prove valuable especially if both the demand and offer in the crowdsourcing market will

grow in the future. In fact, crowdsourcing platforms might have millions of workers requesting

new tasks to be completed. A smart scheduling system not only has to make a decision on

which worker gets which task, but also has to cope with the increasing load (i.e., thousands of

scheduling decisions per second). For that purpose, our contributions are modular, scalable,

127

Page 150: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 8. Conclusions

and can be integrated separately or combined in a CrowdManager – the logical interface that

bridges a computer program with a paid micro-task crowdsourcing platform.

8.1 Future Work

There are many research directions that are worth investigating in order to improve the QoS

in crowd-powered systems. In the following, we present some important ideas that could be

pursued as an extension of this work, together with ideas that would require new platforms

and crowd organizations.

8.1.1 Toward Crowsourcing Platforms with an Integrated CrowdManager

The CrowdManager components studied throughout this thesis were designed individually;

Combined, they can form the basis of a novel crowdsourcing platform that offers new capa-

bilities to both the requesters and the workers. We envision that such platform will operate

in a push-crowdsourcing mode where tasks will be scheduled to meet the workloads of tasks

published by the requesters. Our scheduling algorithm will take into account the skills of the

prospective workers. Our answer aggregation mechanism will use more precise priors, that is,

the skills of the workers. The task pricing will also be dynamic, taking into account both the

workload on the platform and the skills of the workers. The crowd workers will automatically

receive tasks tailored to their interests and general knowledge without the need to waste time

browsing a long list of tasks on a dashboard as it is the case today. Because these changes

require full knowledge of the workload, workforce, worker profiles etc., we believe that only a

full-fledged platform has the power to provide such a deep integration.

8.1.2 Worker Flow

As we saw in Chapter 6, one of the benefits of worker retention is that it can lead to faster

completion and non-starving crowdsourcing campaigns. In our system, we retained people

by using bonuses. Although, this is a common human resources practice, other retention

schemes could be investigated.

Flow, in psychology, is a concept that designates the state of mind in which an individual

is completely immersed in an activity [42]. As, Figure 8.1 illustrates, being in the flow state

is a balance between the skills that the person has in conducting an activity with a certain

difficulty level. As such, if the person is overskilled, i.e., has high skills as compared to the

given task, he might quickly get bored. Likewise, if the person does not have the skills to

conduct a complex task, he might quickly get anxious. We can hypothesize that maintaining

Flow is desirable for a micro-task worker; both by improving his/her experience, and also

maintaining a high answer quality and low response time. Given the repetitive and potentially

dull nature of micro-tasks, a batch of HITs can be dynamically altered so that it continuously

128

Page 151: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

8.1. Future Work

SkillsDifficulty

Flow

Boredom

Anxiety

Figure 8.1 – The concept of the Flow Theory [42].

challenge or relax the worker to keep him/her in a Flow state. The system should automatically

sense and act respectively in order to help the worker reach and maintain that state. One

possible direction is to create a strategy based on the expected response time of each task

type. For example, if the worker is exhibiting a response time that is consistently lower than

the mean, then the workers might be too skilled for the task at hand and can eventually get

bored. On the contrary, if the response time is higher than the mean, then the worker is most

probably struggling with the task. The system will then dynamically respond to these signals

by proposing easier or respectively more challenging tasks.

8.1.3 HIT Recommender System

In Chapter 5 we mostly described and evaluated task routing from a system perspective and

as a mean to improve the quality of the submitted responses. Still, task routing is essentially a

recommendation system, one that is beneficial to the crowd workers as well. Effective HIT

recommendation would reduce the time to find an interesting or suitable batch to work on

and improves the worker’s productivity.

Our task matching technique relies on the workers’ social profiles; if such information is

not available, one can apply machine learning techniques and infer workers skills and their

knowledge automatically based on historical data, e.g., previously chosen tasks, performance

per task type etc. For example, if the task requires a movie-savvy crowd, we can use a system

similar to a movie-recommender in order to match the tasks to prospective workers.

An initial step in this direction is OpenTurk [7], a Chrome extension that we built which allows

AMT workers to manage their favorite requesters, share HITs they like with other workers

and work on HITs that other workers have liked. Openturk has a recommendation tab that

recommends tasks to workers. Currently, this feature recommends tasks based on their

popularity.

129

Page 152: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Chapter 8. Conclusions

8.1.4 Crowd-Powered Big Data Systems

The engineering efforts around crowdsourcing for data management has been geared toward

DBMSs. While this is a valid pursuit, the relatively shy commercial adoption of this model is

reminiscent of the performance that the crowd can provide in comparison to native operations.

An alternative engineering effort would be to build crowdsourcing modules for batch-oriented

data management systems, where faulty and late execution of some units is tolerable by

design. One can leverage the ManReduce programming model proposed in [10], to extend

MapReduce implementation of Hadoop. In this model, HITs will be initiated and scheduled

(see Chapter 7) like any other execution unit; with the difference that HITs will be sent to a

crowdsourcing platform to be examined by crowd workers. Once each HIT is submitted, the

results are collected and integrated with the rest of the execution pipeline.

8.1.5 Social and Mobile Crowdsourcing

Some tasks can only be completed by a very limited group of people, e.g., translate a dialect

into English, find a missing person, or recognize the geographical location of a place shown in

a photograph. Assuming that this target group of people could be incentivized to perform the

task, the question is how to find them quickly. A possible solution is to build crowdsourcing

platforms with social connections [173]. Here, the workers are no longer isolated – already

many communicate and share thoughts on specialized forums – and can choose to be solicited

on the go. We can then introduce the notion of a referral-task, where a worker gets paid for

referring to the right person or contributing to a successful referral chain.

8.2 Outlook

Crowdsourcing offers a new and unique form of income to web users. This has remarkable

social implications, like breaking geographical barriers and opening new opportunities for

the less favored parts of the world, and for unskilled people. For companies, having an

elastic workforce through crowdsourcing facilitates agile processes, and helps solving complex

tasks at scale and on-demand without long term commitments. Crowdsourcing, however,

raises several concerns regarding global employment, fair wages and social security for crowd

workers. Likewise, companies and requesters have less flexibility in terms of data to be exposed,

and are often worried about low quality results obtained from the crowd.

While the future of crowdsourcing is yet to be shaped – both from a technology and legal

framework perspectives – it is clear that there is a strong market potential. We can foresee that

the platform of the future can: (1) Offer Service Level Agreements to requesters such that they

can use crowdsourcing in mission critical applications. (2) Propose suitable tasks for workers

in order to maximize their revenue and productivity. The platform could provide training

programs for the workers to learn new skills, earn a degree or a certificate. Workers can even

develop platform-specific skills, e.g., manage complex client jobs, decompose larger tasks into

smaller ones, and facilitate collaborative tasks.

130

Page 153: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Bibliography

[1] Amazon mechanical turk. http://www.mturk.com. Last accessed: 2014-12-30.

[2] Clickworker. http://www.clickworker.com. Last accessed: 2014-12-30.

[3] CloudFactory: making data valuable in a hyper-efficient way. http://www.cloudfactory.com.

Last accessed: 2014-12-30.

[4] CrowdFlower people-powered data enrichment platform. http://www.crowdflower.com. Last

accessed: 2014-12-30.

[5] Facebook. http://www.facebook.com/. Last accessed: 2014-12-30.

[6] Mobileworks. http://www.mobileworks.com. Not accessible as of: 2015-01-08.

[7] Openturk. http://www.openturk.com/. Last accessed: 2014-12-30.

[8] psiTurk: crowdsource your research. https://psiturk.org/. Last accessed: 2014-12-30.

[9] M. Agrawal, M. Karimzadehgan, and C. Zhai. An online news recommender system for social

networks. In Proceedings of ACM SIGIR workshop on Search in Social Media, 2009.

[10] S. Ahmad, A. Battle, Z. Malkani, and S. Kamvar. The jabberwocky programming environment

for structured social computing. In Proceedings of the 24th annual ACM symposium on User

interface software and technology, pages 53–64. ACM, 2011.

[11] S. Allan and E. Thorsen. Citizen journalism: Global perspectives, volume 1. Peter Lang, 2009.

[12] O. Alonso and R. A. Baeza-Yates. Design and Implementation of Relevance Assessments Using

Crowdsourcing. In ECIR, pages 153–164, 2011.

[13] Y. Amsterdamer, Y. Grossman, T. Milo, and P. Senellart. Crowd mining. In Proceedings of the 2013

international conference on Management of data, pages 241–252. ACM, 2013.

[14] A. Anagnostopoulos, L. Becchetti, C. Castillo, A. Gionis, and S. Leonardi. Power in unity: Forming

teams in large-scale community systems. In Proceedings of the 19th ACM International Conference

on Information and Knowledge Management, CIKM ’10, pages 599–608, New York, NY, USA, 2010.

ACM.

[15] A. Anagnostopoulos, L. Becchetti, C. Castillo, A. Gionis, and S. Leonardi. Online team formation

in social networks. In Proceedings of the 21st International Conference on World Wide Web, WWW

’12, pages 839–848, New York, NY, USA, 2012. ACM.

[16] D. Arthur. The employee recruitment and retention handbook. AMACOM Div American Mgmt

Assn, 2001.

131

Page 154: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Bibliography

[17] P. Bailey, A. P. de Vries, N. Craswell, and I. Soboroff. Overview of the TREC 2007 Enterprise Track.

In TREC, 2007.

[18] K. Balog, Y. Fang, M. de Rijke, P. Serdyukov, and L. Si. Expertise retrieval. Foundations and Trends

in Information Retrieval, 6(2-3):127–256, 2012.

[19] K. Balog, P. Serdyukov, and A. P. de Vries. Overview of the TREC 2010 Entity Track. In TREC, 2010.

[20] K. Balog, P. Thomas, N. Craswell, I. Soboroff, P. Bailey, and A. De Vries. Overview of the trec 2008

enterprise track. Technical report, DTIC Document, 2008.

[21] M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open Information

Extraction from the Web. In IJCAI, pages 2670–2676, 2007.

[22] C. Bartlett and S. Ghoshal. Building competitive advantage through people. Sloan Mgmt. Rev,

43(2), 2013.

[23] S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering.

In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and

data mining, pages 59–68. ACM, 2004.

[24] M. S. Bernstein, J. Brandt, R. C. Miller, and D. R. Karger. Crowds in two seconds: enabling realtime

crowd-powered interfaces. In UIST ’11, pages 33–42. ACM, 2011.

[25] M. S. Bernstein, J. Teevan, S. Dumais, D. Liebling, and E. Horvitz. Direct answers for search

queries in the long tail. In CHI ’12, pages 237–246. ACM, 2012.

[26] J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White,

S. White, et al. Vizwiz: nearly real-time answers to visual questions. In UIST, pages 333–342.

ACM, 2010.

[27] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity

measures. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge

discovery and data mining, KDD ’03, pages 39–48, New York, NY, USA, 2003. ACM.

[28] R. Blanco, H. Halpin, D. Herzig, P. Mika, J. Pound, H. S. Thompson, and D. T. Tran. Repeatable

and reliable search system evaluation using crowdsourcing. In SIGIR, pages 923–932, 2011.

[29] R. Blanco, P. Mika, and S. Vigna. Effective and Efficient Entity Search in RDF Data. In International

Semantic Web Conference (ISWC), pages 83–97, 2011.

[30] P. Bouquet, H. Stoermer, C. Niederee, and A. Mana. Entity Name System: The Backbone of an

Open and Scalable Web of Data. In Proceedings of the IEEE International Conference on Semantic

Computing, ICSC 2008, pages 554–561.

[31] A. Bozzon, M. Brambilla, and S. Ceri. Answering search queries with CrowdSearcher. In WWW,

pages 1009–1018, New York, NY, USA, 2012. ACM.

[32] A. Bozzon, M. Brambilla, S. Ceri, and A. Mauri. Extending search to crowds: A model-driven

approach. In SeCO Book, pages 207–222. 2012.

[33] A. Bozzon, M. Brambilla, and A. Mauri. A model-driven approach for crowdsourcing search. In

CrowdSearch, pages 31–35, 2012.

[34] A. Bozzon, I. Catallo, E. Ciceri, P. Fraternali, D. Martinenghi, and M. Tagliasacchi. A framework

for crowdsourced multimedia processing and querying. In CrowdSearch, pages 42–47, 2012.

132

Page 155: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Bibliography

[35] L. Breiman and A. Cutler. Random Forests. https://www.stat.berkeley.edu/~breiman/

RandomForests/cc_home.htm. Last accessed: 2015-03-04.

[36] R. C. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation.

In EACL, 2006.

[37] M. Catasta, A. Tonon, D. E. Difallah, G. Demartini, K. Aberer, and P. Cudré-Mauroux. Transac-

tivedb: Tapping into collective human memories. Proceedings of the VLDB Endowment, 7(14),

2014.

[38] D. Chandler and J. J. Horton. Labor Allocation in Paid Crowdsourcing: Experimental Evidence

on Positioning, Nudges and Prices. In Human Computation, 2011.

[39] P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE

Trans. on Knowl. and Data Eng., 24(9):1537–1555, Sept. 2012.

[40] M. Ciaramita and Y. Altun. Broad-coverage sense disambiguation and information extraction

with a supersense sequence tagger. In Proceedings of the 2006 Conference on Empirical Methods

in Natural Language Processing, EMNLP ’06, pages 594–602, Stroudsburg, PA, USA, 2006. ACL.

[41] M. J. Crump, J. V. McDonnell, and T. M. Gureckis. Evaluating amazon’s mechanical turk as a tool

for experimental behavioral research. PloS one, 8(3):e57410, 2013.

[42] M. Csikszentmihalyi and M. Csikzentmihaly. Flow: The psychology of optimal experience, vol-

ume 41. HarperPerennial New York, 1991.

[43] S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings

of EMNLP-CoNLL, volume 2007, pages 708–716, 2007.

[44] P. Cudré-Mauroux, K. Aberer, and A. Feher. Probabilistic Message Passing in Peer Data Manage-

ment Systems. In International Conference on Data Engineering (ICDE), 2006.

[45] P. Cudré-Mauroux, P. Haghani, M. Jost, K. Aberer, and H. De Meer. idMesh: graph-based disam-

biguation of linked data. In WWW ’09, pages 591–600, New York, NY, USA, 2009. ACM.

[46] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework and graphical

development environment for robust NLP tools and applications. In Proceedings of the 40th

Anniversary Meeting of the ACL, 2002.

[47] S. B. Davidson, S. Khanna, T. Milo, and S. Roy. Using the Crowd for Top-k and Group-by Queries.

In Proceedings of the 16th International Conference on Database Theory, ICDT ’13, pages 225–236,

New York, NY, USA, 2013. ACM.

[48] J. Davis, J. Arderiu, H. Lin, Z. Nevins, S. Schuon, O. Gallo, and M.-H. Yang. The HPU. In Computer

Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on,

pages 9–16. IEEE, 2010.

[49] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the

em algorithm. Applied statistics, pages 20–28, 1979.

[50] G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. ZenCrowd: leveraging probabilistic reason-

ing and crowdsourcing techniques for large-scale entity linking. In WWW, pages 469–478, New

York, NY, USA, 2012.

[51] G. Demartini, B. Trushkowsky, T. Kraska, M. J. Franklin, and U. Berkeley. Crowdq: Crowdsourced

query understanding. In CIDR, 2013.

133

Page 156: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Bibliography

[52] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM

algorithm. Journal of the Royal Statistical Society, 39, 1977.

[53] D. Deng, C. Shahabi, and U. Demiryurek. Maximizing the number of worker’s self-selected tasks

in spatial crowdsourcing. In Proceedings of the 21st ACM SIGSPATIAL International Conference

on Advances in Geographic Information Systems, SIGSPATIAL’13, pages 324–333, New York, NY,

USA, 2013. ACM.

[54] E. Diaz-Aviles and R. Kawase. Exploiting twitter as a social channel for human computation. In

CrowdSearch, pages 15–19, 2012.

[55] D. E. Difallah, G. Demartini, and P. Cudré-Mauroux. Mechanical cheat: Spamming schemes and

adversarial techniques on crowdsourcing platforms. In CrowdSearch, pages 26–30, 2012.

[56] D. E. Difallah, G. Demartini, and P. Cudré-Mauroux. Pick-a-crowd: tell me what you like, and

i’ll tell you what to do. In Proceedings of the 22nd international conference on World Wide Web,

pages 367–374. International World Wide Web Conferences Steering Committee, 2013.

[57] X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces.

In SIGMOD, pages 85–96. ACM, 2005.

[58] P. Donmez, J. G. Carbonell, and J. Schneider. Efficiently learning the accuracy of labeling sources

for selective sampling. In Proceedings of the 15th ACM SIGKDD international conference on

Knowledge discovery and data mining, pages 259–268. ACM, 2009.

[59] P. Donmez, J. G. Carbonell, and J. G. Schneider. A probabilistic framework to learn from multiple

annotators with time-varying accuracy. In SDM, volume 2, page 1. SIAM, 2010.

[60] J. S. Downs, M. B. Holbrook, S. Sheng, and L. F. Cranor. Are your participants gaming the system?:

screening mechanical turk workers. In Proceedings of the SIGCHI Conference on Human Factors

in Computing Systems, pages 2399–2402. ACM, 2010.

[61] C. Eickhoff and A. P. de Vries. Increasing cheat robustness of crowdsourcing tasks. Information

retrieval, 16(2):121–137, 2013.

[62] S. Faradani, B. Hartmann, and P. G. Ipeirotis. What’s the right price? pricing tasks for finishing on

time. In Human Computation, 2011.

[63] D. G. Feitelson and L. Rudolph. Gang scheduling performance benefits for fine-grain synchro-

nization. Journal of Parallel and Distributed Computing, 16(4):306 – 318, 1992.

[64] M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. CrowdDB: answering queries

with crowdsourcing. In Proceedings of the 2011 ACM SIGMOD International Conference on

Management of data, SIGMOD ’11, pages 61–72, New York, NY, USA, 2011. ACM.

[65] U. Gadiraju, R. Kawase, and S. Dietze. A taxonomy of microtasks on the web. In Proceedings of

the 25th ACM Conference on Hypertext and Social Media, HT ’14, pages 218–223, New York, NY,

USA, 2014. ACM.

[66] L. Getoor and A. Machanavajjhala. Entity Resolution: Tutorial. In VLDB, 2012.

[67] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant resource

fairness: fair allocation of multiple resource types. In NSDI’11, pages 24–24. USENIX Association,

2011.

[68] J. A. Golbeck. Computing and applying trust in web-based social networks. PhD thesis, College

Park, MD, USA, 2005. AAI3178583.

134

Page 157: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Bibliography

[69] S. Guo, A. Parameswaran, and H. Garcia-Molina. So who won?: dynamic max discovery with the

crowd. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of

Data, pages 385–396. ACM, 2012.

[70] K. Haas, P. Mika, P. Tarjan, and R. Blanco. Enhanced results for web search. In SIGIR, pages

725–734, 2011.

[71] X. Han, L. Sun, and J. Zhao. Collective entity linking in web text: a graph-based method. In SIGIR,

pages 765–774, New York, NY, USA, 2011. ACM.

[72] X. Han and J. Zhao. Named entity disambiguation by leveraging wikipedia semantic knowledge.

In Proceeding of the 18th ACM conference on Information and knowledge management, CIKM ’09,

pages 215–224, New York, NY, USA, 2009. ACM.

[73] M. Hirth, T. Hoßfeld, and P. Tran-Gia. Cost-optimal validation mechanisms and cheat-detection

for crowdsourcing platforms. In Innovative Mobile and Internet Services in Ubiquitous Computing

(IMIS), 2011 Fifth International Conference on, pages 316–321. IEEE, 2011.

[74] J. Howe. The rise of crowdsourcing. Wired magazine, 14(6):1–4, 2006.

[75] M. A. Huselid. The impact of human resource management practices on turnover, productivity,

and corporate financial performance. Academy of management journal, 38(3):635–672, 1995.

[76] P. G. Ipeirotis. Analyzing the amazon mechanical turk marketplace. XRDS: Crossroads, The ACM

Magazine for Students, 17(2):16–21, 2010.

[77] P. G. Ipeirotis and E. Gabrilovich. Quizz: targeted crowdsourcing with a billion (potential) users. In

Proceedings of the 23rd international conference on World wide web, pages 143–154. International

World Wide Web Conferences Steering Committee, 2014.

[78] P. G. Ipeirotis, F. Provost, V. S. Sheng, and J. Wang. Repeated labeling using multiple noisy labelers.

Data Mining and Knowledge Discovery, 28(2):402–441, 2014.

[79] P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In

Proceedings of the ACM SIGKDD workshop on human computation, pages 64–67. ACM, 2010.

[80] L. C. Irani and M. S. Silberman. Turkopticon: Interrupting Worker Invisibility in Amazon Me-

chanical Turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems,

CHI ’13, pages 611–620, New York, NY, USA, 2013. ACM.

[81] M. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of

Tampa, Florida. Journal of the American Statistical Association, 84(406):414–420, 1989.

[82] S. R. Jeffery, L. Sun, M. DeLand, N. Pendar, R. Barber, and A. Galdi. Arnold: Declarative crowd-

machine data integration. In CIDR, 2013.

[83] R. Jurca and B. Faltings. Mechanisms for making crowds truthful. J. Artif. Intell. Res. (JAIR),

34:209–253, 2009.

[84] D. R. Karger, S. Oh, and D. Shah. Budget-optimal task allocation for reliable crowdsourcing

systems. Operations Research, 62(1):1–24, 2014.

[85] G. Kazai. In Search of Quality in Crowdsourcing for Search Engine Evaluation. In ECIR, pages

165–176, 2011.

[86] G. Kazai, J. Kamps, M. Koolen, and N. Milic-Frayling. Crowdsourcing for book search evaluation:

impact of hit design on comparative system ranking. In SIGIR, pages 205–214, 2011.

135

Page 158: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Bibliography

[87] R. Khazankin, H. Psaier, D. Schall, and S. Dustdar. QoS-Based Task Scheduling in Crowdsourcing

Environments. In Proceedings of the 9th International Conference on Service-Oriented Computing,

ICSOC’11, pages 297–311, Berlin, Heidelberg, 2011. Springer-Verlag.

[88] A. Kittur, E. H. Chi, and B. Suh. Crowdsourcing user studies with mechanical turk. In Proceedings

of the SIGCHI conference on human factors in computing systems, pages 453–456. ACM, 2008.

[89] A. Kittur, S. Khamkar, P. André, and R. Kraut. Crowdweaver: visually managing complex crowd

work. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work,

pages 1033–1036. ACM, 2012.

[90] A. Kittur, J. V. Nickerson, M. Bernstein, E. Gerber, A. Shaw, J. Zimmerman, M. Lease, and J. Hor-

ton. The future of crowd work. In Proceedings of the 2013 Conference on Computer Supported

Cooperative Work, CSCW ’13, pages 1301–1318, New York, NY, USA, 2013.

[91] A. Kittur, B. Smus, S. Khamkar, and R. E. Kraut. Crowdforge: Crowdsourcing complex work. In

Proceedings of the 24th annual ACM symposium on User interface software and technology, pages

43–52. ACM, 2011.

[92] D. Klein and C. Manning. Accurate unlexicalized parsing. In Proceedings of the 41st Annual

Meeting on Association for Computational Linguistics-Volume 1, pages 423–430. Association for

Computational Linguistics, 2003.

[93] C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In

WSDM, pages 441–450, 2010.

[94] S. Konomi, W. Ohno, T. Sasao, and K. Shoji. A context-aware approach to microtasking in a public

transport environment. In Communications and Electronics (ICCE), 2014 IEEE Fifth International

Conference on, pages 498–503. IEEE, 2014.

[95] F. Kschischang, B. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE

Transactions on Information Theory, 47(2), 2001.

[96] A. Kulkarni, M. Can, and B. Hartmann. Collaboratively crowdsourcing workflows with turkomatic.

In CSCW ’12, pages 1003–1012. ACM, 2012.

[97] A. Kulkarni, P. Gutheim, P. Narula, D. Rolnitzky, T. Parikh, and B. Hartmann. Mobileworks:

designing for quality in a managed crowdsourcing architecture. Internet Computing, IEEE,

16(5):28–35, 2012.

[98] R. S. Kushalnagar, W. S. Lasecki, and J. P. Bigham. A readability evaluation of real-time crowd

captions in the classroom. In Proceedings of the 14th international ACM SIGACCESS conference

on Computers and accessibility, ASSETS ’12, pages 71–78, New York, NY, USA, 2012. ACM.

[99] T. Lambert and A. Schwienbacher. An empirical analysis of crowdfunding. Social Science Research

Network, 1578175, 2010.

[100] W. S. Lasecki, C. Homan, and J. P. Bigham. Architecting real-time crowd-powered systems.

[101] W. S. Lasecki, A. Marcus, J. M. Tzeszotarski, and J. P. Bigham. Using Microtask Continuity to

Improve Crowdsourcing. In Carnegie Mellon University Human-Computer Interaction Institute -

Technical Reports - CMU-HCII-14-100, 2014.

[102] W. S. Lasecki, R. Wesley, J. Nichols, A. Kulkarni, J. F. Allen, and J. P. Bigham. Chorus: A Crowd-

powered Conversational Assistant. In Proceedings of the 26th Annual ACM Symposium on User

Interface Software and Technology, UIST ’13, pages 151–162. ACM, 2013.

136

Page 159: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Bibliography

[103] S. Lazebnik, C. Schmid, J. Ponce, et al. Semi-local affine parts for object recognition. In British

Machine Vision Conference (BMVC’04), pages 779–788, 2004.

[104] J. Le, A. Edmonds, V. Hester, and L. Biewald. Ensuring quality in crowdsourced search rele-

vance evaluation: The effects of training question distribution. In SIGIR 2010 workshop on

crowdsourcing for search evaluation, pages 21–26, 2010.

[105] G. Lee, B.-G. Chun, and H. Katz. Heterogeneity-aware resource allocation and scheduling

in the cloud. In Proceedings of the 3rd USENIX conference on Hot topics in cloud computing,

HotCloud’11, pages 4–4, Berkeley, CA, USA, 2011. USENIX Association.

[106] V. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet

Physics Doklady, volume 10, pages 707–710, 1966.

[107] E.-P. Lim, J. Srivastava, S. Prabhakar, and J. Richardson. Entity identification in database integra-

tion. In Data Engineering, 1993. Proceedings. Ninth International Conference on, pages 294–301.

IEEE, 1993.

[108] G. Little, L. B. Chilton, M. Goldman, and R. C. Miller. Turkit: tools for iterative tasks on mechanical

turk. In Proceedings of the ACM SIGKDD workshop on human computation, pages 29–30. ACM,

2009.

[109] C. Lofi, K. El Maarry, and W.-T. Balke. Skyline queries in crowd-enabled databases. In Proceedings

of the 16th International Conference on Extending Database Technology, EDBT ’13, pages 465–476,

New York, NY, USA, 2013. ACM.

[110] C. Macdonald and I. Ounis. Voting techniques for expert search. Knowl. Inf. Syst., 16(3):259–280,

2008.

[111] A. Mahmood, W. G. Aref, E. Dragut, and S. Basalamah. The palm-tree index: Indexing with the

crowd. 2013.

[112] A. Mao, E. Kamar, Y. Chen, E. Horvitz, M. E. Schwamb, C. J. Lintott, and A. M. Smith. Volunteering

Versus Work for Pay: Incentives and Tradeoffs in Crowdsourcing. In HCOMP, 2013.

[113] A. Mao, E. Kamar, and E. Horvitz. Why Stop Now? Predicting Worker Engagement in Online

Crowdsourcing. In First AAAI Conference on Human Computation and Crowdsourcing, 2013.

[114] A. Marcus et al. Optimization techniques for human computation-enabled data processing

systems. PhD thesis, Massachusetts Institute of Technology, 2012.

[115] A. Marcus, D. Karger, S. Madden, R. Miller, and S. Oh. Counting with the crowd. Proceedings of

the VLDB Endowment, 6(2):109–120, 2012.

[116] A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. Proceed-

ings of the VLDB Endowment, 5(1):13–24, 2011.

[117] A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Crowdsourced databases: Query

processing with people. CIDR, 2011.

[118] P. N. Mendes, M. Jakob, A. García-Silva, and C. Bizer. DBpedia Spotlight: Shedding Light on

the Web of Documents. In Proceedings of the 7th International Conference on Semantic Systems

(I-Semantics), 2011.

[119] E. Michaels, H. Handfield-Jones, and B. Axelrod. The war for talent. Harvard Business Press,

2001.

137

Page 160: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Bibliography

[120] R. Mihalcea and A. Csomai. Wikify!: linking documents to encyclopedic knowledge. In Proceed-

ings of the sixteenth ACM conference on Conference on information and knowledge management,

CIKM ’07, pages 233–242, New York, NY, USA, 2007. ACM.

[121] P. Minder and A. Bernstein. Crowdlang: a programming language for the systematic exploration

of human computation systems. In Proceedings of the 4th international conference on Social

Informatics, SocInfo’12, pages 124–137, Berlin, Heidelberg, 2012. Springer-Verlag.

[122] J. Mortensen, M. A. Musen, and N. F. Noy. Crowdsourcing the verification of relationships in

biomedical ontologies. In AMIA, 2013.

[123] B. Mozafari, P. Sarkar, M. Franklin, M. Jordan, and S. Madden. Scaling up crowd-sourcing to very

large datasets: A case for active learning. Proceedings of the VLDB Endowment, 8(2), 2014.

[124] C. Nieke, U. Güntzer, and W.-T. Balke. Topcrowd. In Conceptual Modeling, pages 122–135.

Springer, 2014.

[125] V. Nunia, B. Kakadiya, C. Hota, and M. Rajarajan. Adaptive Task Scheduling in Service Oriented

Crowd Using SLURM. In ICDCIT, pages 373–385, 2013.

[126] B. On, N. Koudas, D. Lee, and D. Srivastava. Group linkage. In Data Engineering, 2007. ICDE

2007. IEEE 23rd International Conference on, pages 496–505. IEEE, 2007.

[127] G. Papadakis, E. Ioannou, C. Niederée, T. Palpanas, and W. Nejdl. Beyond 100 million entities:

large-scale blocking-based resolution for heterogeneous data. In Proceedings of the fifth ACM

international conference on Web search and data mining, WSDM ’12, pages 53–62, New York, NY,

USA, 2012. ACM.

[128] A. Parameswaran and N. Polyzotis. Answering queries using databases, humans and algorithms.

In Conference on Innovative Data Systems Research, volume 160, 2011.

[129] A. G. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, A. Ramesh, and J. Widom. Crowd-

screen: Algorithms for filtering data with humans. In Proceedings of the 2012 ACM SIGMOD

International Conference on Management of Data, pages 361–372. ACM, 2012.

[130] S. Perugini, M. A. Gonçalves, and E. A. Fox. Recommender systems research: A connection-

centric survey. J. Intell. Inf. Syst., 23(2):107–143, Sept. 2004.

[131] V. Polychronopoulos, L. de Alfaro, J. Davis, H. Garcia-Molina, and N. Polyzotis. Human-powered

top-k lists. In WebDB, pages 25–30, 2013.

[132] J. Pöschko, M. Strohmaier, T. Tudorache, N. F. Noy, and M. A. Musen. Pragmatic analysis of

crowd-based knowledge production systems with icat analytics: Visualizing changes to the

icd-11 ontology. In AAAI Spring Symposium: Wisdom of the Crowd, 2012.

[133] J. Pound, P. Mika, and H. Zaragoza. Ad-hoc object retrieval in the web of data. In WWW, pages

771–780, 2010.

[134] V. Rajan, S. Bhattacharya, L. E. Celis, D. Chander, K. Dasgupta, and S. Karanam. Crowdcontrol:

An online learning approach for optimal task scheduling in a dynamic crowd platform. In ICML

Workshop on ’Machine Learning meets Crowdsourcing’, 2013.

[135] V. C. Raykar, S. Yu, L. H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy.

Supervised learning from multiple experts: whom to trust when everyone lies a bit. In Proceedings

of the 26th Annual international conference on machine learning, pages 889–896. ACM, 2009.

138

Page 161: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Bibliography

[136] J. Ross, L. Irani, M. Silberman, A. Zaldivar, and B. Tomlinson. Who are the crowdworkers?: shifting

demographics in mechanical turk. In CHI’10 Extended Abstracts on Human Factors in Computing

Systems, pages 2863–2872. ACM, 2010.

[137] S. B. Roy, I. Lykourentzou, S. Thirumuruganathan, S. Amer-Yahia, and G. Das. Optimization in

knowledge-intensive crowdsourcing. CoRR, abs/1401.1302, 2014.

[138] J. M. Rzeszotarski, E. Chi, P. Paritosh, and P. Dai. Inserting micro-breaks into crowdsourcing

workflows. In HCOMP (Works in Progress / Demos), volume WS-13-18 of AAAI Workshops. AAAI,

2013.

[139] C. Sarasua, E. Simperl, and N. F. Noy. Crowdmap: Crowdsourcing ontology alignment with

microtasks. In ISWC, pages 525–541, 2012.

[140] N. Seemakurty, J. Chu, L. von Ahn, and A. Tomasic. Word sense disambiguation via human

computation. In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP

’10, pages 60–63. ACM, 2010.

[141] J. Selke, C. Lofi, and W.-T. Balke. Pushing the boundaries of crowd-enabled databases with

query-driven schema expansion. Proc. VLDB Endow., 5(6):538–549, Feb. 2012.

[142] A. D. Shaw, J. J. Horton, and D. L. Chen. Designing incentives for inexpert human raters. In

Proceedings of the ACM 2011 conference on Computer supported cooperative work, pages 275–284.

ACM, 2011.

[143] W. Shen, J. Wang, P. Luo, and M. Wang. Liege:: link entities in web lists with knowledge base. In

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data

mining, KDD ’12, pages 1424–1432, New York, NY, USA, 2012. ACM.

[144] V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data

mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international

conference on Knowledge discovery and data mining, pages 614–622. ACM, 2008.

[145] A. Sheshadri and M. Lease. SQUARE: A Benchmark for Research on Computing Crowd Consensus.

In Proceedings of the 1st AAAI Conference on Human Computation (HCOMP), 2013.

[146] M. S. Silberman, L. Irani, and J. Ross. Ethics and tactics of professional crowdwork. XRDS,

17(2):39–43, Dec. 2010.

[147] Y. Singer and M. Mittal. Pricing Mechanisms for Crowdsourcing Markets. In Proceedings of

the 22Nd International Conference on World Wide Web, WWW ’13, pages 1157–1166, Republic

and Canton of Geneva, Switzerland, 2013. International World Wide Web Conferences Steering

Committee.

[148] P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective

labelling of venus images. Advances in neural information processing systems, pages 1085–1092,

1995.

[149] M. Stonebraker. What does ‘big data’ mean. Communications of the ACM, BLOG@ ACM, 2012.

[150] M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. Mapreduce

and parallel dbmss: friends or foes? Communications of the ACM, 53(1):64–71, 2010.

[151] A. Tonon, G. Demartini, and P. Cudre-Mauroux. Combining inverted indices and structured

search for ad-hoc object retrieval. In SIGIR, pages 125–134, 2012.

139

Page 162: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Bibliography

[152] B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In

Data Engineering (ICDE), 2013 IEEE 29th International Conference on, pages 673–684. IEEE, 2013.

[153] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe,

H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache

hadoop yarn: Yet another resource negotiator. In SOCC ’13, pages 5:1–5:16. ACM, 2013.

[154] P. Venetis, H. Garcia-Molina, K. Huang, and N. Polyzotis. Max algorithms in crowdsourcing

environments. In Proceedings of the 21st international conference on World Wide Web, pages

989–998. ACM, 2012.

[155] K. Vertanen and P. O. Kristensson. A versatile dataset for text entry evaluations based on gen-

uine mobile emails. In Proceedings of the 13th International Conference on Human Computer

Interaction with Mobile Devices and Services, pages 295–298. ACM, 2011.

[156] L. Von Ahn. Games with a purpose. Computer, 39(6):92–94, 2006.

[157] L. Von Ahn. Human computation. In Design Automation Conference, 2009. DAC’09. 46th

ACM/IEEE, pages 418–419. IEEE, 2009.

[158] L. von Ahn and L. Dabbish. Labeling images with a computer game. In CHI ’04, pages 319–326.

ACM, 2004.

[159] L. Von Ahn, B. Maurer, C. McMillen, D. Abraham, and M. Blum. recaptcha: Human-based

character recognition via web security measures. Science, 321(5895):1465–1468, 2008.

[160] M. Vukovic and A. Natarajan. Operational Excellence in IT Services Using Enterprise Crowd-

sourcing. In IEEE SCC, pages 494–501, 2013.

[161] J. Wang, S. Faridani, and P. Ipeirotis. Estimating the completion time of crowdsourced tasks using

survival analysis models. Crowdsourcing for search and data mining (CSDM 2011), 31, 2011.

[162] J. Wang, P. G. Ipeirotis, and F. Provost. Managing crowdsourcing workers. In The 2011 Winter

Conference on Business Intelligence, pages 10–12, 2011.

[163] J. Wang, P. G. Ipeirotis, and F. Provost. Quality-Based Pricing for Crowdsourced Workers. In NYU

Stern Research Working Paper - CBA-13-06, 2013.

[164] J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution.

Proceedings of the VLDB Endowment, 5(11):1483–1494, 2012.

[165] J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowd-

sourced joins. In Proceedings of the 2013 international conference on Management of data, pages

229–240. ACM, 2013.

[166] P. Welinder and P. Perona. Online crowdsourcing: rating annotators and obtaining cost-effective

labels. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer

Society Conference on, pages 25–32. IEEE, 2010.

[167] S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution

with iterative blocking. In Proceedings of the 2009 ACM SIGMOD International Conference on

Management of data, SIGMOD ’09, pages 219–232, New York, NY, USA, 2009. ACM.

[168] J. Whitehill, T.-f. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo. Whose vote should count

more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural

information processing systems, pages 2035–2043, 2009.

140

Page 163: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Bibliography

[169] W. Winkler. The state of record linkage and current research problems. In Statistical Research

Division, US Census Bureau, 1999.

[170] H. Yannakoudakis, T. Briscoe, and B. Medlock. A new dataset and method for automatically

grading esol texts. In Proceedings of the 49th Annual Meeting of the Association for Computa-

tional Linguistics: Human Language Technologies-Volume 1, pages 180–189. Association for

Computational Linguistics, 2011.

[171] M. Yin, Y. Chen, and Y.-A. Sun. Monetary Interventions in Crowdsourcing Task Switching. In

Proceedings of the 2nd AAAI Conference on Human Computation (HCOMP), 2014.

[172] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling:

a simple technique for achieving locality and fairness in cluster scheduling. In EuroSys ’10, pages

265–278. ACM, 2010.

[173] H. Zhang, E. Horvitz, Y. Chen, and D. C. Parkes. Task routing for prediction tasks. In Proceedings

of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 2,

pages 889–896. International Foundation for Autonomous Agents and Multiagent Systems, 2012.

141

Page 164: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Djellel Eddine DifallahAddress: Bd Perolles 90, Fribourg 1700, Switzerland.Email: [email protected], Phone: +41 76 822 0296

Research and Interests

My research focuses on combining the intelligence of humans in solving complex problems and the scalability of machinesto process large amounts of data. In particular, I try to overlap the two worlds by creating solutions to efficientlymanage crowd-workers to deliver timely inputs to machine requests. My work is supported by the Swiss NationalScience Foundation.Other Interests: data management, distributed systems, big data challenges.

Education

2011–today PhD candidate @ University of Fribourg, Switzerland.– Dissertation on “Quality of Service in Crowd-Powered Systems”.

2009–2011 MSc in Computer Science, University of Louisiana – Lafayette, USA.– Fulbright Foreign Student Scholarship.– Received honors for maintaining a GPA of 4.0 four semesters.

1999–2004 Diploma of Engineer in Informatics, USTHB, Algeria.

Professional Experience

2011–today Research Assistant at eXascale InfoLab.– Mainly focus on my dissertation related projects (Human-computation).– Contribute to other ongoing projects in the lab: semantic web (RDF/Graph storage), memorybased information systems (MEM0R1ES), smart cities (stream processing), array processing(SciDB).– Teaching assistant for the social computing class.– Supervise master students working on smart cities projects in collaboration with IBM Dublin.

Summer of 2013 Research Intern at Microsoft’s Cloud and Information Services Lab.Project: Reservation based scheduling with Hadoop YARN (YARN-1051).The work resulted in a paper to appear in the fifth ACM Symposium on Cloud Computing2014.

Summer of 2010 Student Developer at Google Summer of Code Program.Project: A query cache plugin for Drizzle DBMS based on memcached.

2006–2009 Information Management Engineer at Schlumberger.– On-site client support on data management softwares provided by the company.– Reporting and SQL tuning.

2005–2006 Engineer at EEPAD Internet Services Provider.– In charge of the authentication system platform (RADIUS) for both line and wireless clients.– Developed an automatic provisioning solution to synchronize the information system and thedeployed ADSL hardware using SNMP.

Opensource Projects

Lead OLTPBench, Openturk chrome extension.Contributor Apache Hadoop YARN, Apache Mahout, SciDB, Drizzle.

Relevant Computer Skills

Programming Java, C++, Python, Javascript.DBMS MySQL, Postgres.

Page 165: Quality of Service in Crowd-Powered Systems · 2020-05-18 · but I would be remiss not to thank: Andy Pavlo, Eugene Wu and Sean McKenna. Finally, this work would not have been possible

Languages

French, English, Arabic.

Publications

2015 Djellel E. Difallah, Michele Catasta, Gianluca Demartini, Panagiotis G Ipeirotis, and PhilippeCudre-Mauroux. The dynamics of micro-task crowdsourcing – the case of amazon mturk. InWWW, Florence, Italy, 2015.

Dana Van Aken, Djellel E. Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudre-Mauroux.BenchPress: Dynamic Workload Control in the OLTP-Bench Testbed. In SIGMOD, Melbourne,Australia, 2015. ACM, ACM

2014 Djellel Eddine Difallah, Michele Catasta, Gianluca Demartini, and Philippe Cudre-Mauroux.Scaling-up the crowd: Micro-task pricing schemes for worker retention and latency improvement.In Second AAAI Conference on Human Computation and Crowdsourcing difallah-scaleup., 2014

Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer, andPhilippe Cudre-Mauroux. Transactivedb: Tapping into collective human memories. Proceedingsof the VLDB Endowment, 2013

Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer, andPhilippe Cudre-Mauroux. Hippocampus: answering memory queries using transactive search.In Proceedings of the companion publication of the 23rd international conference on World wideweb companion, pages 535–540. International World Wide Web Conferences Steering Committee,2014

2013 Djellel Eddine Difallah, Gianluca Demartini, and Philippe Cudre-Mauroux. Pick-a-crowd: tellme what you like, and i’ll tell you what to do. In Proceedings of the 22nd international confer-ence on World Wide Web, pages 367–374. International World Wide Web Conferences SteeringCommittee, 2013

Djellel Eddine Difallah, Philippe Cudre-Mauroux, and S McKenna. Scalable anomaly detectionfor smart city infrastructure networks. 2013

Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudre-Mauroux. Large-scalelinked data integration using probabilistic reasoning and crowdsourcing. The VLDB Journal,22(5):665–687, 2013

Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudre-Mauroux. Oltp-bench: An extensible testbed for benchmarking relational databases. Proceedings of the VLDBEndowment, 7(4), 2013

2012 Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer, andPhilippe Cudre-Mauroux. Transactivedb: Tapping into collective human memories. Proceedingsof the VLDB Endowment, 2013

G. Demartini, D.E. Difallah, and P. Cudre-Mauroux. Zencrowd: leveraging probabilistic rea-soning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21stinternational conference on World Wide Web, pages 469–478. ACM, 2012

D.E. Difallah, G. Demartini, and P. Cudr-Mauroux. Mechanical cheat: Spamming schemesand adversarial techniques on crowdsourcing platforms. CrowdSearch 2012 workshop at WWW,pages 26–30, 2012

2011 P. Cudre-Mauroux, G. Demartini, D.E. Difallah, A.E. Mostafa, V. Russo, and M. Thomas. ADemonstration of DNS3: a Semantic-Aware DNS Service. ISWC, 2011

D.E. Difallah, R.G. Benton, V. Raghavan, and T. Johnsten. Faarm: Frequent association actionrules mining using fp-tree. In Data Mining Workshops (ICDMW), 2011 IEEE 11th InternationalConference on, pages 398–404. IEEE, 2011