15
Active Learning for Crowd-Sourced Databases * Barzan Mozafari University of Michigan, Ann Arbor [email protected] Purna Sarkar UC Berkeley [email protected] Michael Franklin UC Berkeley [email protected] Michael Jordan UC Berkeley [email protected] Samuel Madden MIT [email protected] ABSTRACT Crowd-sourcing has become a popular means of acquiring labeled data for many tasks where humans are more accurate than comput- ers, such as image tagging, entity resolution, or sentiment analysis. However, due to the time and cost of human labor, solutions that solely rely on crowd-sourcing are often limited to small datasets (i.e., a few thousand items). This paper proposes algorithms for integrating machine learning into crowd-sourced databases in or- der to combine the accuracy of human labeling with the speed and cost-effectiveness of machine learning classifiers. By using active learning as our optimization strategy for labeling tasks in crowd- sourced databases, we can minimize the number of questions asked to the crowd, allowing crowd-sourced applications to scale (i.e, la- bel much larger datasets at lower costs). Designing active learning algorithms for a crowd-sourced database poses many practical challenges: such algorithms need to be generic, scalable, and easy-to-use for a broad range of practitioners, even those who are not machine learning experts. We draw on the theory of nonparametric bootstrap to design, to the best of our knowledge, the first active learning algorithms that meet all these requirements. Our results, on 3 real-world datasets collected with Amazon’s Mechanical Turk, and on 15 UCI datasets, show that our meth- ods on average ask 12 orders of magnitude fewer questions than the baseline, and 4.544× fewer than existing active learning algo- rithms. 1. INTRODUCTION Crowd-sourcing marketplaces, such as Amazon’s Mechanical Turk, have made it easy to recruit a crowd of people to perform tasks that are difficult for computers, such as entity resolution [5, 7, 35, 42, 53], image annotation [51], and sentiment analysis [37]. Many of these tasks can be modeled as database problems, where each item is represented as a row with some missing attributes (labels) that the crowd workers supply. This has given rise to a new generation of database systems, called crowd-sourced databases [23, 26, 35, 39], that enable users to issue more powerful queries by combin- ing human-intensive tasks with traditional query processing tech- niques. Figure 1 provides a few examples of such queries, where * A shorter version of this manuscript has been published in Pro- ceedings of Very Large Data Bases [36]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. Figure 1: Examples of Labeling Queries in a Crowd-sourced DB. part of the query is processed by the machine (e.g., whether the word “iPad” appears in a tweet) while the human-intensive part is sent to the crowd for labeling (e.g., to decide if the tweet has a positive sentiment). While query optimization techniques [23, 34, 38, 39] can reduce the number of items that need to be labeled, any crowd-sourced database that solely relies on human-provided labels will eventually suffer from scalability issues when faced with web-scale datasets and problems (e.g., daily tweets or images). This is because label- ing each item by humans can cost several cents and take several minutes. For instance, given the example of Figure 1, even if we filter out tweets that do not contain “iPad”, there could still be mil- lions of tweets with “iPad” that require sentiment labels (‘positive’ or ‘non-positive’). To enable crowd-sourced databases to scale up to large datasets, we advocate combining humans and machine learning algorithms (e.g., classifiers), where (i) the crowd labels items that are either inherently difficult for the algorithm, or if labeled, will form the best training data for the algorithm, and (ii) the (trained) algo- rithm is used to label the remaining items much more quickly and cheaply. In this paper, we focus on labeling algorithms (classifiers) that assign one of several discrete values to each item, as opposed to predicting numeric values (i.e., regression), or finding missing items [48], leaving these other settings for future work. Specifically, given a large, unlabeled dataset (say, millions of im- ages) and a classifier that can attach a label to each unlabeled item (after sufficient training), our goal is to determine which questions to ask the crowd in order to (1) achieve the best training data and overall accuracy, given time or budget constraints, or (2) minimize the number of questions, given a desired level of accuracy. Our problem is closely related to the classical problem of ac- tive learning (AL), where the objective is to select statistically op- timal training data [15]. However, in order for an AL algorithm to be a practical optimization strategy for labeling tasks in a crowd- sourced database, it must satisfy a number of systems challenges and criteria that have not been a focal concern in traditional AL literature, as described next. arXiv:1209.3686v4 [cs.LG] 20 Dec 2014

Active Learning for Crowd-Sourced Databases

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Active Learning for Crowd-Sourced Databases

Active Learning for Crowd-Sourced Databases∗

Barzan MozafariUniversity of Michigan, Ann Arbor

[email protected]

Purna SarkarUC Berkeley

[email protected] Franklin

UC [email protected]

Michael JordanUC Berkeley

[email protected]

Samuel MaddenMIT

[email protected]

ABSTRACTCrowd-sourcing has become a popular means of acquiring labeleddata for many tasks where humans are more accurate than comput-ers, such as image tagging, entity resolution, or sentiment analysis.However, due to the time and cost of human labor, solutions thatsolely rely on crowd-sourcing are often limited to small datasets(i.e., a few thousand items). This paper proposes algorithms forintegrating machine learning into crowd-sourced databases in or-der to combine the accuracy of human labeling with the speed andcost-effectiveness of machine learning classifiers. By using activelearning as our optimization strategy for labeling tasks in crowd-sourced databases, we can minimize the number of questions askedto the crowd, allowing crowd-sourced applications to scale (i.e, la-bel much larger datasets at lower costs).

Designing active learning algorithms for a crowd-sourced databaseposes many practical challenges: such algorithms need to be generic,scalable, and easy-to-use for a broad range of practitioners, eventhose who are not machine learning experts. We draw on the theoryof nonparametric bootstrap to design, to the best of our knowledge,the first active learning algorithms that meet all these requirements.

Our results, on 3 real-world datasets collected with Amazon’sMechanical Turk, and on 15 UCI datasets, show that our meth-ods on average ask 1–2 orders of magnitude fewer questions thanthe baseline, and 4.5–44× fewer than existing active learning algo-rithms.

1. INTRODUCTIONCrowd-sourcing marketplaces, such as Amazon’s Mechanical Turk,

have made it easy to recruit a crowd of people to perform tasks thatare difficult for computers, such as entity resolution [5, 7, 35, 42,53], image annotation [51], and sentiment analysis [37]. Many ofthese tasks can be modeled as database problems, where each itemis represented as a row with some missing attributes (labels) thatthe crowd workers supply. This has given rise to a new generationof database systems, called crowd-sourced databases [23, 26, 35,39], that enable users to issue more powerful queries by combin-ing human-intensive tasks with traditional query processing tech-niques. Figure 1 provides a few examples of such queries, where∗A shorter version of this manuscript has been published in Pro-ceedings of Very Large Data Bases [36].

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.

Figure 1: Examples of Labeling Queries in a Crowd-sourced DB.

part of the query is processed by the machine (e.g., whether theword “iPad” appears in a tweet) while the human-intensive part issent to the crowd for labeling (e.g., to decide if the tweet has apositive sentiment).

While query optimization techniques [23, 34, 38, 39] can reducethe number of items that need to be labeled, any crowd-sourceddatabase that solely relies on human-provided labels will eventuallysuffer from scalability issues when faced with web-scale datasetsand problems (e.g., daily tweets or images). This is because label-ing each item by humans can cost several cents and take severalminutes. For instance, given the example of Figure 1, even if wefilter out tweets that do not contain “iPad”, there could still be mil-lions of tweets with “iPad” that require sentiment labels (‘positive’or ‘non-positive’).

To enable crowd-sourced databases to scale up to large datasets,we advocate combining humans and machine learning algorithms(e.g., classifiers), where (i) the crowd labels items that are eitherinherently difficult for the algorithm, or if labeled, will form thebest training data for the algorithm, and (ii) the (trained) algo-rithm is used to label the remaining items much more quickly andcheaply. In this paper, we focus on labeling algorithms (classifiers)that assign one of several discrete values to each item, as opposedto predicting numeric values (i.e., regression), or finding missingitems [48], leaving these other settings for future work.

Specifically, given a large, unlabeled dataset (say, millions of im-ages) and a classifier that can attach a label to each unlabeled item(after sufficient training), our goal is to determine which questionsto ask the crowd in order to (1) achieve the best training data andoverall accuracy, given time or budget constraints, or (2) minimizethe number of questions, given a desired level of accuracy.

Our problem is closely related to the classical problem of ac-tive learning (AL), where the objective is to select statistically op-timal training data [15]. However, in order for an AL algorithm tobe a practical optimization strategy for labeling tasks in a crowd-sourced database, it must satisfy a number of systems challengesand criteria that have not been a focal concern in traditional ALliterature, as described next.

arX

iv:1

209.

3686

v4 [

cs.L

G]

20

Dec

201

4

Page 2: Active Learning for Crowd-Sourced Databases

1.1 Design CriteriaAn AL algorithm must meet the following criteria to be used as

the default optimization strategy in a crowd-sourced database:

1. Generality. Our system must come with a few built-in ALalgorithms that are applicable to arbitrary classification and label-ing tasks, as crowd-sourced systems are used in a wide range ofdifferent domains. In Figure 1, for example, one query involvessentiment analysis while another seeks images containing a dog.Clearly, these tasks require drastically different classifiers. Al-though our system allows expert users to provide their own custom-designed AL that works for their classification algorithm, mostusers may only have a classifier. Thus, to support general queries,our AL algorithm should make minimal or no assumptions aboutthe classifiers that users provide with their labeling tasks.

2. Black-box treatment of the classifier. Many AL algo-rithms that provide theoretical guarantees, need to access and mod-ify the internal logic of the given classifier (e.g., adding constraintsto the classifier’s internal loss-minimization step [8]). Such modi-fications are acceptable in theoretical settings but not in real-worldapplications, as state-of-the-art classifiers used in science and in-dustry are rarely a straightforward implementation of textbook al-gorithms. Rather, these finely-tuned classifiers typically use thou-sands of lines of code to implement many intertwined steps (e.g.,data cleaning, feature selection, parameter tuning, heuristics, etc.).In some cases, moreover, these codebases use proprietary librariesthat cannot be modified. Thus, to make our crowd-sourcing systemuseful to a wide range of practitioners and scientists (who are notnecessarily machine learning experts), we need an AL algorithmthat treats the classifier as a black-box, i.e., no modifications to theinternals of the classifier.

3. Batching. Many (theoretical) AL algorithms are designedfor online (a.k.a. streaming) scenarios in which items are revealedone at a time. This means that the AL algorithm decides whetherto request a label for the current item, and if so, awaits the labelbefore proceeding to the next item. While these settings are attrac-tive for theoretical reasons, they are unrealistic in practice. First,we often have access to a large pool of unlabeled data to choosefrom (not just the next item), which should allow us to make betterchoices. Second, online AL settings typically perform an (expen-sive) analysis for each item [8, 9, 16], rendering them computa-tionally prohibitive. Thus, for efficiency and practicality, the ALalgorithm must support batching,1 so that (i) the analysis is doneonly once per each batch of multiple items, and (ii) items are sentto the crowd in batches (to be labeled in parallel).

4. Parallelism. We aim to achieve human-scalability (i.e., ask-ing the crowd fewer questions) through AL. However, we are alsoconcerned with machine-scalability, because AL often involves re-peatedly training a classifier and can thus be computationally ex-pensive. While AL has been historically applied to small datasets,increasingly massive datasets (such as those motivating this paper)pose new computational challenges. Thus, a design criterion in oursystem is that our AL algorithm must be amenable to parallel exe-cution on modern many-core processors and distributed clusters.

5. Noise management. AL has traditionally dealt with expert-provided labels that are often taken as ground truth (notable ex-

1Batch support is challenging because AL usually involves case-analysis for different combinations of labels and items. These com-binations grow exponentially in the number of items in the batch,unless most of the analysis can be shared among different cases.

ceptions are agnostic AL approaches [16]). In contrast, crowd-provided labels are subject to a greater degree of noise, e.g., inno-cent errors, typos, lack of domain knowledge, and even deliberatespamming.

AL has a rich literature in machine learning [43]. However, thefocus has been largely theoretical, with concepts from learning the-ory used to establish bounds on sample complexity, but leaving asignificant gap between theory and practice. Rather than aiming atsuch theoretical bounds, this paper focuses on a set of practical de-sign criteria and provides sound heuristics for the AL problem; theorigin of these criteria is in real-world and systems considerations(in particular, issues of scale and ease-of-use).

To the best of our knowledge, no existing AL algorithm satisfiesall of the aforementioned requirements. For example, those ALalgorithms that are general [8, 9, 16] do not support batching orparallelism and often require modifications to the classifier. (SeeSection 7 for a detailed literature review.) In this paper, we designthe first AL algorithms that meet all these design criteria, pavingthe way towards a scalable and generic crowd-sourcing system thatcan be used by a wide range of practitioners.

1.2 Our ContributionsOur main contributions are two AL algorithms, called MinEx-

pError and Uncertainty, along with a noise-management technique,called partitioning-based allocation (PBA). The Uncertainty algo-rithm requests labels for the items that the classifier is most uncer-tain about. We also design a more sophisticated algorithm, calledMinExpError, that combines the current quality (say, accuracy) ofthe classifier with its uncertainty in a mathematically sound way, inorder to choose the best questions to ask. Uncertainty is faster thanMinExpError, but it also has lower overall accuracy, especially inthe upfront scenario, i.e., where we request all labels in a singlebatch. We also study the iterative scenario, i.e., where we requestlabels in multiple batches and refine our decisions after receivingeach batch.

A major novelty of our AL algorithms is in their use of bootstraptheory, 2 which yields several key advantages. First, bootstrap candeliver consistent estimates for a large class of estimators,3 makingour AL algorithms general and applicable to nearly any classifica-tion task. Second, bootstrap-based estimates can be obtained whiletreating the classifier as a complete black-box. Finally, the requiredbootstrap computations can be performed independently from eachother, hence, allowing for an embarrassingly parallel execution.

Once MinExpError or Uncertainty decides which items to sendto the crowd, dealing with the inherent noise in crowd-providedlabels is the next challenge. A common practice is to use redun-dancy, i.e., to ask each question to multiple workers. However,instead of applying the same degree of redundancy to all items,we have developed a novel technique based on integer linear pro-gramming, called PBA, which dynamically partitions the unlabeleditems based on their degree of difficulty for the crowd and deter-mines the required degree of redundancy for each partition.

Thus, with a careful use of bootstrap theory as well as our batch-ing and noise-management techniques, our AL algorithms meet allour requirements for building a practical crowd-sourced system,namely generality, scalability (batching and parallelism), and ease-of-use (black-box view and automatic noise-management).

We have evaluated the effectiveness of our algorithms on 18 real-world datasets (3 crowd-sourced using Amazon Mechanical Turk,

2A short background on bootstrap theory is provided in Section 3.1.3Formally, this class includes any function that is Hadamard dif-ferentiable, including M -estimators (which include most machinelearning algorithms and maximum likelihood estimators [30]).

Page 3: Active Learning for Crowd-Sourced Databases

and 15 well-known datasets from the UCI KDD repository). Ex-periments show that our AL algorithms achieve the same quality asseveral existing approaches while significantly reducing the num-ber of questions asked of the crowd. Specifically, on average, wereduce the number of questions asked by:

• 100× (7×) in the upfront (iterative) scenario, compared topassive learning, and

• 44× (4.5×) in the upfront (iterative) scenario, compared toIWAL [2, 8, 9] which is the state-of-the-art general-purposeAL algorithm.

Interestingly, we also find that our algorithms (which are general-purpose) are still competitive with, and sometimes even superiorto, some of the state-of-the-art domain-specific (i.e., less general)AL techniques. For example, our algorithms ask:

• 7× fewer questions than CrowdER [53] and an order of mag-nitude fewer than CVHull [7], which are among the mostrecent AL algorithms for entity resolution in database litera-ture,

• 2–8× fewer questions than Bootstrap-LV [41] and MarginDis-tance [49], and

• 5–7× fewer questions for SVM classifiers than AL tech-niques that are specifically designed for SVM [49].

2. OVERVIEW OF APPROACHOur approach in this paper is as follows. The user provides (i) a

pool of unlabeled items (possibly with some labeled items as ini-tial training data), (ii) a classifier (or “learner”) that improves withmore and better training data, and a specification as to whetherlearning should be upfront or iterative, and (iii) a budget or goal, interms of time, quality, or cost (and a pricing scheme for paying thecrowd).

Our system can operate in two different scenarios, upfront oriterative, to train the classifier and label the items (Section 2.2).Our proposed AL algorithms, Uncertainty and MinExpError, canbe used in either scenario (Section 3). Acquiring labels from acrowd raises interesting issues, such as how best to employ re-dundancy to minimize error, and how many questions to asked thecrowd (Sections 4 and 5). Finally, we empirically evaluate our al-gorithms (Section 6).

2.1 Active Learning NotationAn active learning algorithm is typically composed of (i) a rankerR, (ii) a selection strategy S, and (iii) a budget allocation strat-egy Γ. The ranker R takes as input a classification algorithm4

θ, a set of labeled items L, and a set of unlabeled items U , andreturns as output an “effectiveness” score wi for each unlabeleditem ui ∈ U . Our proposed algorithms in Section 3 are essentiallyranking algorithms that produce these scores. A selection strategythen uses the scores returned from the ranker to choose a subsetU ′ ⊆ U which will be sent for human labeling. For instance, oneselection strategy is picking the top k items with the largest (orsmallest) scores, where k is determined by the budget or qualityrequirements. In this paper, we use weighted sampling to choosek unlabeled items, where the probability of choosing each item isproportional to its score. Finally, once U ′ is chosen, a budget allo-cation strategy Γ decides how to best acquire labels for all the items4For ease of presentation, in this paper we assume binary classifi-cation (i.e., {0, 1}), but our work applies to arbitrary classifiers.

in U ′: Γ(U ′, B) for finding the most accurate labels given a fixedbudget B, or Γ(U ′, Q) for the cheapest labels given a minimumquality requirement Q. For instance, to reduce the crowd noise, acommon strategy is to ask each question to multiple labelers andtake the majority vote. In Section 4, we introduce our PartitioningBased Allocation (PBA) algorithm, which will be our choice of Γin this paper.

2.2 Active Learning ScenariosThis section describes how learning works in the upfront and the

iterative scenarios. Suppose we have a given budget B for askingquestions (e.g., in terms of money, time, or number of questions) ora quality requirement Q (e.g., required accuracy or F1-measure5)that our classifier must achieve.

The choice of the scenario depends on user’s preference andneeds. Figure 2 is the pseudocode of the upfront scenario. In thisscenario, the ranker computes effectiveness scores solely based onthe initial labeled data6, L0. Then, a subset U ′ ⊆ U is chosen andsent to the crowd (based on S and, B or Q). While waiting for thecrowd to label U ′ (based on Γ and, B or Q), we train our classifierθ on L0 to label the remaining items, namely U −U ′, and immedi-ately send their labels back to the user. Once crowd-provided labelsarrive, they are also sent to the user. Thus, the final result consistsof the union of these two labeled sets.

Figure 3 shows the pseudocode of the iterative scenario. In thisscenario, we ask for labels in several iterations. We ask the crowdto label a few items, adding those labels to the existing trainingset, and retrain. Then, we choose a new set of unlabeled items anditerate until we have exhausted our budgetB or met our quality goalQ. At each iteration, our allocation strategy (Γ) seeks the cheapestor most accurate labels for the chosen items (U ′), then our rankeruses the original training data L0 as well as the crowd labels CLcollected thus far to decide how to score the remaining unlabeleditems.

Note that the upfront scenario is not an iterative scenario with asingle iteration, because the former does not use the crowd-sourcedlabels in training the classifier. This difference is important as dif-ferent applications may call for different scenarios. When earlyanswers are strictly preferred, say in an interactive search inter-face, the upfront scenario can immediately feed users with model-provided labels until the crowd’s answers arrive for the remainingitems. The upfront scenario is also preferred when the project hasa stringent accuracy requirement that only gold data (here, L0) beused for training the classifier, say to avoid the potential noise ofcrowd labels in the training phase. In contrast, the iterative scenariois computationally slower, as it has to repeatedly retrain a classifierand wait for crowd-sourced labels. However, it can adaptively ad-just its scores in each iteration, thus achieving a smaller error forthe same budget than the upfront one. This is because the upfrontscenario must choose all the items it wants labeled at once, basedonly on a limited set of initial labels.

3. RANKING ALGORITHMSThis paper proposes two novel AL algorithms, Uncertainty and

MinExpError. AL algorithms consist of (i) a rankerR that assignsscores to unlabeled items, (ii) a selection strategy S that uses thesescores to choose which items to label, and (iii) a budget allocationstrategy Γ to decide how to acquire crowd labels for those chosen

5F1-measure is the harmonic mean of precision and recall and isfrequently used to assess the quality of a classifier.6In the absence of initially labeled data, we can first spend part ofthe budget to label a small, random sample of the data.

Page 4: Active Learning for Crowd-Sourced Databases

Upfront Active Learning (B or Q, L0, U, θ,R,S,Γ)Input: B is the total budget (money, time, or number of questions),

Q is the quality requirement (e.g., minimum accuracy, F1-measure),L0 is the initial labeled data,U is the unlabeled data,θ is a classification algorithm to (imperfectly) label the data,R is a ranker that gives “effectiveness” scores to unlabeled items,S is a selection strategy (specifying which items should be labeled

by the crowd, given their effectiveness scores).Γ is a budget allocation strategy for acquiring labels from the crowd

Output: L is the labeled version of U1: W ←R(θ, L0, U) //wi ∈W is the effectiveness score for ui ∈ U2: Choose U ′ ⊆ U based on S(U,W ) such that U ′ can be labeled with3: budget B or θL0 (U − U ′) satisfies Q4: ML← θL0 (U − U ′) //train θ on L0 to automatically label U − U ′5: Immediately display ML to the user //Early query results6: CL← Γ(U ′, B or Q) //Ask (and wait for) the crowd to label U ′7: L← CL ∪ML //combine crowd and machine provided labelsReturn L

Figure 2: The upfront scenario in active learning.

Iterative Active Learning (B or Q, L0, U, θ,R,S,Γ)Input: Same as those in Figure 2)Output: L is the labeled version of U1: CL← ∅ // labeled data acquired from the crowd2: L← θL0 (U) //train θ on L0 & invoke it to label U3: While our budget B is not exhausted or L’s quality does not meet Q:4: W ←R(θ, L0 ∪ CL,U) //wi ∈W is the effective score for ui ∈ U5: Choose U ′ ⊆ U based on S(U,W ) (subject to B or Q)6: L′ ← Γ(U ′, B or Q) //Ask (and wait for) the crowd to label U ′7: CL← CL ∪ L′, U ← U − U ′ //remove crowd labels from U8: L← CL ∪ θL0∪CL(U) //train θ on L0 ∪ CL to label remaining UReturn L

Figure 3: The iterative scenario in active learning.

items. As explained in Section 2.1, our AL algorithms use weightedsampling and PBA (introduced in Section 4) as their selection andbudget allocation strategies, respectively. Thus, for simplicity, weuse Uncertainty and MinExpError to refer to both our AL algo-rithms and their corresponding rankers. Both rankers can be usedin either upfront or iterative scenarios. Section 3.1 provides briefbackground on nonparametric bootstrap theory, which we use inour rankers. Our rankers are introduced in Sections 3.2 and 3.3.

3.1 Background: Nonparametric BootstrapOur ranking algorithms rely on nonparametric bootstrap (or sim-

ply the bootstrap) to assess the benefit of acquiring labels for differ-ent unlabeled items. Bootstrap [20] is a powerful statistical tech-nique traditionally designed for estimating the uncertainty of es-timators. Consider an estimator θ (say, a classifier) that can belearned from data L (say, some training data) to estimate somevalue of interest for a data point u (say, the class label of u). Thisestimated value, denoted as θL(u), is a point-estimate (i.e., a singlevalue), and hence, reveals little information about how this valuewould change if we used a different data L′. This information iscritical as most real-world datasets are noisy, subject-to-change, orincomplete. For example, in our active learning context, missingdata means that we can only access part of our training data. Thus,L should be treated a random variable drawn from some (unknown)underlying distribution D. Consequently, statisticians are often in-terested in measuring distributional information about θL(u), suchas variance, bias, etc. Ideally, one could measure such statisticsby (i) drawing many new datasets, say L1, · · · , Lk for some largek, from the same distribution D that generated the original L,7 (ii)

7A common assumption in nonparametric bootstrap is that L and

computing θL1(u), · · · , θLk (u), and finally (iii) inducing a distri-bution for θ(u) based on the observed values of θLi(u). We callthis true distribution Dθ(u) or simply D(u) when θ is understood.Figure 4(a) illustrates this computation. For example, when θ is abinary classifier, D(u) is simply a histogram with two bins (0 and1), where the value of the i’th bin for i ∈ {0, 1} is Pr[θL(u) = i]when L is drawn from D. Given D(u), any distributional informa-tion (e.g., variance) can be obtained.

Unfortunately, in practice the underlying distribution D is oftenunknown, and hence, direct computation of D(u) using the pro-cedure of Figure 4(a) is impossible. This is where bootstrap [20]becomes useful. The main idea of bootstrap is simple: treat L as aproxy for its underlying distribution D. In other words, instead ofdrawingLi’s directly from D, generate new datasets S1, · · · , Sk byresampling from L itself. Each Si is called a (bootstrap) replicateor simply a bootstrap. Each Si is generated by drawing n = |L|I.I.D. samples with replacement from L, and hence, some elementsof L might be repeated or missing in Si. Note that all bootstrapshave the same cardinality as L, i.e. |Si| = |L| for all i. By com-puting θ on these bootstraps, namely θS1(u), · · · , θSk (u), we cancreate an empirical distribution D(u). This is the bootstrap com-putation, which is visualized in Figure 4(b).

The theory of bootstrap guarantees that for a large class of esti-mators θ and sufficiently large k, we can use D(u) as a consistentapproximation of D(u). The intuition is that, by resampling fromL, we emulate the original distribution D that generated L. Here,it is sufficient (but not necessary) that θ be relatively smooth (i.e.,Hadamard differentiable [20]) which holds for a large class of ma-chine learning algorithms [30] such as M -estimators, themselvesincluding maximum likelihood estimators and most classificationtechniques. In our experiments (Section 6), k=100 or even 10 haveyielded reasonable accuracy (k can also be tuned automatically;see [20]).

Both of our AL algorithms use bootstrap to estimate the classi-fier’s uncertainty in its predictions (say, to stop asking the crowdonce we are confident enough). Employing bootstrap has severalkey advantages. First, as noted, bootstrap delivers consistent es-timates for a large class of estimators, making our AL algorithmsgeneral and applicable to nearly any classification algorithm.8 Sec-ond, the bootstrap computation uses a “plug-in” principle; that is,we simply need to invoke our estimator θ with Si instead of L.Thus, we can treat θ (here, our classifier) as a complete black-boxsince its internal implementation does not need to be modified. Fi-nally, individual bootstrap computations θS1(u), · · · , θSk (u) areindependent from each other, and hence can be executed in paral-lel. This embarrassingly parallel execution model enables scalabil-ity by taking full advantage of modern many-core and distributedsystems.

Thus, by exploiting powerful theoretical results from classicalnonparametric statistics, we can estimate the uncertainty of com-plex estimators and also scale up the computation to large volumesof data. Aside from Provost et al. [41] (which is limited to proba-bilistic classifiers and is less effective than our algorithms; see Sec-tions 6 and 7), no one has exploited the power of bootstrap in AL,perhaps due to bootstrap’s computational overhead. However, withrecent advances in parallelizing and optimizing bootstrap compu-tation [3, 27, 55, 56] and increases in RAM sizes and the numberof CPU cores, bootstrap is now a computationally viable approach,

Li datasets are independently and identically drawn (I.I.D.) fromD.8 The only known exceptions are lasso classifiers (i.e., L1 regular-ization). Interestingly, even for lasso classifiers, there is a modifiedversion of bootstrap that can produce consistent estimates [14].

Page 5: Active Learning for Crowd-Sourced Databases

motivating our use of it in this paper.

3.2 Uncertainty AlgorithmOur Uncertainty algorithm aims to ask the crowd the questions

that are hardest for the classifier. Specifically, we (i) find out howuncertain (or certain) our given classifier θ is in its label predictionsfor different unlabeled items, and (ii) ask the crowd to label itemsfor which the classifier is least certain. The intuition is that, themore uncertain the classifier, the more likely it will mislabel theitem.

Note that focusing on most uncertain items is one of the oldestideas in AL literature [15, 47]. The novelty and power of our Un-certainty algorithm is in its use of bootstrap for obtaining an unbi-ased estimate of uncertainty, while making almost no assumptionsabout the classification algorithm. Previous proposals that captureuncertainty either (i) require a probabilistic classifier that can pro-duce highly accurate class probability estimates along with its labelpredictions [41, 51], or (ii) are limited to a particular classifier. Forexample, the standard way of performing uncertainty-based sam-pling is by using the entropy of the class distribution in the caseof probabilistic classifiers [40, 57]. Although entropy-based ALcan be effective in some situations [21, 40, 57], in many other sit-uations, when the classifiers do not produce accurate probabilities,entropy does not guarantee an unbiased estimate of the uncertainty(see Section 6.3). For non-probabilistic classifiers, other heuristicssuch as the distance from the separator is taken as a measure of un-certainty (e.g., SVM classifiers [49, 52]). However, these heuristicscannot be applied to arbitrary classifiers. In contrast, our Uncer-tainty algorithm applies to both probabilistic and non-probabilisticclassifiers, and is also guaranteed by bootstrap theory to produceunbiased estimates. (In Section 6, we also empirically show thatour algorithm is more effective.) Next, we describe how Uncer-tainty uses bootstrap to estimate the classifier’s uncertainty.

Let l be the predicted label for item u when we train θ on ourlabeled data L, i.e., θL(u) = l. As explained in Section 3.1, Lis often a random variable, and hence, θL(u) has a distribution(and variance). We use the variance of θ in its prediction, namelyV ar[θL(u)], as our formal notion of uncertainty. Our intuitionbehind this choice is as follows. A well-established result fromKohavi and Wolpert [28] has shown that the classification error foritem u, say eu, can be decomposed into a sum of three terms:

eu = Var[θL(u)] + bias2[θL(u)] + σ2(u)

where bias[.] is the bias of the classifier and σ2(u) is a noise term.9

Our ultimate goal in AL is to reduce the sum of eu for all u. Theσ2(u) is an error inherent to the data collection process, and thuscannot be eliminated through AL. Thus, by requesting labels for u’sthat have a large variance, we indirectly reduce eu for u’s that havea large classification error.10 Hence, our Uncertainty algorithm as-signs Var[θL(u)] as the score for each unlabeled item u, to en-sure that items with larger variance are sent to the crowd for labels.Thus, in Uncertainty algorithm, our goal is to measure Var[θL(u)].

Since the underlying distribution of the training data (D in Fig-ure 4) is unknown to us, we use bootstrap. In other words, webootstrap our current set of labeled data L, say k times, to obtain kdifferent classifiers that are then invoked to generate labels for eachitem u. This is shown in Figure 4(b). The output of these classifiersform an empirical distribution D(u) that approximates the true dis-9Squared bias bias2[θL(u)] is defined as [f(u) − E[θL(u)]]2,where f(u)=E[lu|u], i.e., expected value of true label given u [28].

10Note that we could try to choose items based on the bias of theclassifier. In Section 3.3, we present an algorithm that indirectlyreduces both variance and bias in a mathematically sound way.

tribution of θL(u). We can then estimate Var[θL(u)] using D(u)which is guaranteed, by bootstrap theory [20], to quickly convergeto the true value of Var[θL(u)] as we increase k.

Let Si denote the i’th bootstrap, and θSi(u) = liu be the predic-tion of our classifier for u when trained on this bootstrap. DefineX(u) :=

∑ki=1 l

iu/k, i.e., the fraction of classifiers in Figure 4(b)

that predict a label of 1 for u. Since liu ∈ {0, 1}, the uncertaintyscore for instance u is given by its variance, which can be computedas:

Uncertainty(u) = Var[θL(u)] = X(u)(1−X(u)) (1)

We evaluate our Uncertainty algorithm in Section 6.

3.3 MinExpError AlgorithmConsider the toy dataset of Figure 5(a). Initially, only a few la-

bels (+ or −) are revealed to us, as shown in Figure 5(b). Withthese initial labels we train a classifier, say a linear separator (shownas a solid line). The Uncertainty algorithm would ask the crowd tolabel the items with the most uncertainty, here being those closerto the separator. The intuition behind Uncertainty is that, by mea-suring the uncertainty (and requesting labels for the most uncertainitems), the crowd essentially handles items that are hardest (or am-biguous) for the classifier. However, this might not always be thebest strategy. By acquiring a label for the item closest to the sep-arator and training a new classifier, as shown in Figure 5(c), ouroverall accuracy does not change much: despite acquiring a newlabel, the new classifier still misclassifies three of the items at thelower-left corner of Figure 5(c). This observation shows that label-ing items with most uncertainty (i.e., asking the hardest questions)may not have the largest impact on the classifier’s prediction powerfor other data points. In other words, another great strategy wouldbe to acquire human labels for items that, if their labels differ fromwhat the current classifier thinks, would have a huge impact on theclassifier’s future decisions. In Figure 5(d), the lower-left pointsexemplify such items. We say that such items have a potential forlargest impact on the classifier’s accuracy.

Note that we cannot completely ignore uncertainty and chooseitems only based on their potential impact on the classifier. Whenthe classifier is highly confident of its predicted label, no mat-ter how much impact an opposite label could have on the classi-fier, acquiring a label for that item wastes resources because thecrowd label will most likely agree with that of the classifier anyway.Thus, our MinExpError algorithm combines these two strategies ina mathematically sound way, described next.

Let l = θL(u) be the current classifier’s predicted label for u.If we magically knew that l was the correct label, we could sim-ply add 〈u, l〉 to L and retrain the classifier. Let eright be this newclassifier’s error. On the other hand, if we somehow knew that lwas the incorrect label, we would instead add 〈u, 1 − l〉 to L andretrain the classifier accordingly. Let ewrong denote the error of thisnew classifier. The problem is that (i) we do not know what the truelabel is, and (ii) we do not know the classifier’s error in either case.

Solving (ii) is relatively easy: in each case we assume those la-bels and use cross validation on L to estimate both errors, say eright

and ewrong. To solve problem (i), we can again use bootstrap to esti-mate the probability of our prediction l being correct (or incorrect),say p(u) := Pr[l = lu|u], where lu is u’s true label. Since wedo not know lu (or its distribution), we bootstrap L to train k dif-ferent classifiers (following Figure 4’s notation). Let l1, · · · , lk bethe labels predicted by these classifiers for u. Then, p(u) can beapproximated as

p(u) =

∑ki=1 1(li = l)

k(2)

Page 6: Active Learning for Crowd-Sourced Databases

(a) Ideal computation of D(u) (b) Bootstrap computation

Figure 4: Bootstrap approximation D(u) of true distributionD(u).

+ -­‐

+++

+

++ -­‐-­‐

-­‐-­‐-­‐

-­‐(a)  

+ -­‐

+++

+

++ -­‐-­‐

-­‐-­‐-­‐

-­‐Most    

Uncertain  Largest  Impact  

Ini4al  Classifier  

(b)  + -­‐

+++

+

++ -­‐-­‐

-­‐-­‐-­‐

-­‐Acquired  Label  

Misclassified  Items  

New  Classifier  

(c)  + -­‐

+++

+

++ -­‐-­‐

-­‐-­‐-­‐

-­‐Acquired  Label  

New  Classifier  

(d)  

Figure 5: (a) Fully labeled dataset, (b) initial labels, (c) asking hardestquestions, and (d) asking questions with high impact.

Here, 1(c) is the decision function which evaluates to 1 when con-dition c holds and to zero otherwise. Intuitively, equation (2) saysthat the probability of the classifier’s prediction being correct canbe estimated by the fraction of classifiers that agree on that pre-diction, if those classifiers are each trained on a bootstrap of thetraining set L.

Our MinExpError algorithm aims to compute the classifier’s ex-pected error if we use the classifier itself to label each item, andask the crowd to label those items for which this expected error islarger. This way, the overall expected error will be minimized. Tocompute the classifier’s expected error, we can average over differ-ent label choices:

MinExpError(u) = p(u)eright + (1− p(u))ewrong (3)

We can break down equation (3) as:

MinExpError(u) = ewrong − p(u)(ewrong − eright) (4)

Assume that ewrong − eright ≥ 0 (an analogous decomposition ispossible when it is negative). Eq. (4) tells us that if the question istoo hard (small p(u)), we may still ask for a crowd label to avoid ahigh risk of misclassification on u. On the other hand, we may aska question for which our model is fairly confident (large p(u)), buthaving its true label can still make a big difference in classifyingother items (ewrong is too large). This means that, however unlikely,if our classifier happens to be wrong, we will have a higher overallerror if we do not ask for the true label of u. Thus, our MinExpErrorscores naturally combine both the difficulty of the question and howmuch knowing its answer can improve our classifier.

3.4 Complexity and ScalabilityBesides its generality, a major benefit of bootstrap is that each

replicate can be shipped to a different node or processor, perform-ing training in parallel. The time complexity of each iteration ofUncertainty is O(k · T (|U |)), where |U | is the number of unla-beled items in that iteration, T (.) is the classifier’s training time(e.g., this is cubic in the input size for SVMs), and k is the num-ber of bootstraps. Thus, we only need k nodes to achieve the samerun-time as training a single classifier.

MinExpError is more expensive than Uncertainty as it requiresa case analysis for each unlabeled item. The time complexity foreach iteration of MinExpError is O((k+ |U |) · T (|U |)). Since un-labeled items can be independently analyzed, the algorithm is stillparallelizable. However, MinExpError requires more nodes (i.e.,O(|U |) nodes) to achieve the same performance as Uncertainty.As we show in Section 6, this additional overhead is justified inthe upfront scenario, given MinExpError’s superior performance atdeciding which questions to ask based on a limited set of initiallylabeled data.

4. HANDLING CROWD UNCERTAINTYCrowd-provided labels are subject to a great degree of uncer-

tainty: humans may give incorrect answers due to ambiguity of the

question and innocent (or deliberate) errors. This section proposesan algorithm called Partitioning Based Allocation (PBA) that man-ages and reduces this uncertainty by strategically allocating differ-ent degrees of redundancy to different subgroups of the unlabeleditems. PBA is our proposed instantiation of Γ in the upfront anditerative scenarios (Figures 2 and 3) which given a fixed budget Bmaximizes the labels’ accuracy, or given a required accuracy, min-imizes cost.

Optimizing Redundancy for Subgroups. Most previous AL ap-proaches assume that labels are provided by domain experts andthus perfectly correct (see Section 7). In contrast, incorrect labelsare common in a crowd database — an issue conventionally han-dled by using redundancy, e.g., asking each question to multipleworkers and combining their answers for the best overall result.Standard techniques, such as asking for multiple answers and us-ing majority vote or the techniques of Dawid and Skene (DS) [17]can improve answer quality when the crowd is mostly correct, butwill not help much if users do not converge to the right answer orconverge too slowly. In our experience, crowd workers can be quiteimprecise for certain classification tasks. For example, we removedthe labels from 1000 tweets with hand-labeled (“gold data”) senti-ment (dataset details in Section 6.1.3), and asked Amazon Mechan-ical Turk workers to label them again, then measured the workers’agreement. We used different redundancy ratios (1, 3, 5) and dif-ferent voting schemes (majority and DS) and computed the crowd’sability to agree with the hand-produced labels. The results areshown in Table 1.

Voting Scheme Majority Vote Dawid & Skene1 worker/label 67% 51%3 workers/label 70% 69%5 workers/label 70% 70%

Table 1: The effect of redundancy (using both majority voting andDawid and Skene voting) on the accuracy of crowd labels.

In this case, increasing redundancy from 3 to 5 labels does notsignificantly increase the crowd’s accuracy. Secondly, we have no-ticed that crowd accuracy varies for different subgroups of the un-labeled data. For example, in a different experiment, we asked Me-chanical Turk workers to label facial expressions in the CMU FacialExpression dataset,11 and measured agreement with hand-suppliedlabels. This dataset consists of 585 head-shots of 20 users, each in32 different combinations of head positions (straight, left, right, andup), sunglasses (with and without), and facial expressions (neutral,happy, sad, and angry). The crowd’s accuracy was significantlyworse when the faces were looking up versus other positions:

11http://kdd.ics.uci.edu/databases/faces/faces.data.html

Page 7: Active Learning for Crowd-Sourced Databases

Facial orientation Avg. accuracystraight 0.6335%left 0.6216%right 0.6049%up 0.4805%

Similar patterns appear in several other datasets, where crowdaccuracy is considerably lower for certain subgroups. To exploitthese two observations, we developed our PBA algorithm whichcomputes the optimal number of questions to ask about each sub-group by estimating the probability pg with which the crowd cor-rectly classifies items of a given subgroup g, and then solves aninteger linear program (ILP) to choose the optimal number of ques-tions (i.e., degree of redundancy) for labeling each item from thatsubgroup, given these probabilities.

Before introducing our algorithm, we make the following ob-servation. To combine answers using majority voting and an oddnumber of votes, say 2v + 1, for an unlabeled item u with a truelabel l, the probability of the crowd’s combined answer l∗ beingcorrect is the probability that at most v or fewer workers get theanswer wrong. Denoting this probability with Pg,(2v+1), we have:

Pg,(2v+1) = Pr(l = l∗|2v + 1 votes) =

v∑i=0

(2v + 1

i

)· p2v+1−ig · (1− pg)i (5)

where pg is the probability that a crowd worker will correctly labelan item in group g.

Next, we describe our PBA algorithm, which partitions the itemsinto subgroups and optimally allocates the budget to different sub-groups by computing the optimal number of votes per item, Vg , foreach subgroup g. PBA consists of three steps:

Step 1. Partition the dataset into G subgroups. This can be doneeither by partitioning on some low-cardinality field that is alreadypresent in the dataset to be labeled (for example, in an image recog-nition dataset, we might partition by photographer ID or the time ofday when the picture was shot), or by using an unsupervised clus-tering algorithm such as k-means. For instance, in the CMU facialexpression dataset, we partitioned the images based on user IDs,leading to G = 20 subgroups, each with roughly 32 images.

Step 2. Randomly pick n0>1 different data items from eachsubgroup, and obtain v0 labels for each one of them. Estimate pgfor each subgroup g, either by choosing data items for which thelabel is known and computing the fraction of labels that are correct,or by taking the majority vote for each of the n0 items, assuming itis correct, and then computing the fraction of labels that agree withthe majority vote. For example, for the CMU dataset, we asked forv0 = 9 labels for n0 = 2 random images12 from each subgroup,and hand-labeled those n0∗G = 40 images to estimate pg for g =1, . . . , 20.

Step 3. Solve an ILP to find the optimal Vg for every group g.We use bg to denote the budget allocated to subgroup g, and createa binary indicator variable xgb whose value is 1 iff subgroup g isallocated a budget of b. Also, let fg be the number of items that our

12With a similar cost, we could ask for v0=5 labels for n0=4 im-ages per group. Using larger v0 and n0 (and accounting for eachworker’s accuracy) will yield more reliable estimates for pg’s. ForCMU dataset, we use a small budget to show that even with roughestimates we can easily improve on uniform allocations. We studythe effect of n0 on PBA’s performance in Section 6.2.

leaner has chosen to label from subgroup g. Our ILP formulationdepends on the user’s goal:

Goal 1. Suppose we are given a budget B (in terms of the num-ber of questions) and our goal is to acquire the most accurate labelsfor the items requested by the learner. We can then formulate anILP to minimize the following objective function:

G∑g=1

bmax∑b=1

xgb · (1− Pg,b) · fg (6)

where bmax is the maximum number of votes that we are willing toask per item. This goal function captures the expected weighted er-ror of the crowd, i.e., it has a lower value when we allocate a largerbudget (xgb=1 for a large b when Pg>0.5) to subgroups whosequestions are harder for the crowd (Pg,b is small) or the learner haschosen more items from that group (fg is large). This optimizationis subject to the following constraints:

∀1 ≤ g ≤ G,bmax∑b=1

xgb = 1 (7)

G∑g=1

bmax∑b=1

xgb · b · fg ≤ B − v0 · n0 ·G (8)

Here, constraint (7) ensures that we pick exactly one b value foreach subgroup and (8) ensures that we stay within our labelingbudget (we subtract the initial cost of estimating pg’s from B).

Goal 2. If we are given a minimum required accuracy Q, andour goal is to minimize the total number of questions asked, wemodify the formulation above by turning (8) into a goal and (6)into a constraint, i.e., minimizing

∑Gg=1

∑bmax

b=1 xgb · b · fg while

ensuring that∑Gg=1

∑bmax

b=1 xgb · (1− Pg,b) · fg ≤ 1−Q.Note that one could further improve the pg estimates and the ex-

pected error estimate in (6) by incorporating the past accuracy ofindividual workers [6]. Such extensions are omitted here due tolack of space and left to future work. We evaluate PBA in Sec-tion 6.2.

Balancing Classes. Our final observation about the crowd’s ac-curacy is that crowd workers perform better at classification taskswhen the number of instances from each class is relatively bal-anced. For example, given a face labeling task asking the crowdto tag each face as “man” or “woman” where only 0.1% of imagesare of men, crowd workers will have a higher error rate when label-ing men (i.e., the rarer class). Perhaps workers become conditionedto answering “woman”. (Psychological studies report the same ef-fect [54].)

Interestingly, both our Uncertainty and MinExpError algorithmsnaturally tend to increase the fraction of labels they obtain for rareclasses. Since our algorithms tend to have more uncertainty aboutitems with rare labels (due to insufficient examples in their trainingset), they are more likely to ask users to label those items. Thus, ouralgorithms naturally improve the “balance” in the questions theyask about different classes, which in turn improves crowd labelingperformance. We show this effect in Section 6.2 as well.

5. OPTIMIZING FOR THE CROWDThe previous section described our algorithm for handling noisy

labels. There are other optimization questions that arise in prac-tice. How should we decide when our accuracy is “good enough”?(Section 5.1) Given that a crowd can label multiple items in par-

Page 8: Active Learning for Crowd-Sourced Databases

allel, what is the the effect of batch size (number of simultaneousquestions) on our learning performance? (Section 5.2)

5.1 When To Stop AskingAs mentioned in Section 2.2, users may either provide a fixed

budget B or a minimum quality requirement Q (e.g., F1-measure).Given a fixed budget, we can ask questions until the budget is ex-hausted. However, to achieve a quality level Q, we must estimatethe current error of the trained classifier. The easiest way to do thisis to measure the trained classifier’s ability to accurately classifythe gold data according to the desired quality metric. We can thencontinue to ask questions until a specific accuracy on the gold datais achieved (or until the rate of improvement of accuracy levels off).

In the absence of (sufficient) gold data, we adopt the standard k-fold cross validation technique, randomly partitioning the crowd-labeled data into test and training sets, and measuring the ability ofa model learned on training data to predict test values. We repeatthis procedure k times and take the average as an overall assessmentof the model’s quality. Section 6.2 shows that this method providesmore reliable estimates of the model’s current quality than relyingon a small amount of gold data.

5.2 Effect of Batch SizesAt each iteration of the iterative scenario, we must choose a sub-

set of the unlabeled items according to their effectiveness scores,and send them to the crowd for labeling. We call this subset a“batch” (denoted as U ′ in Line 5 of Figure 3). An interesting ques-tion is how to set this batch size, say β.

Intuitively, a smaller β increases opportunities to improve the ALalgorithm’s effectiveness by incorporating previously requested la-bels before deciding which labels to request next. For instance, bestresults are achieved when β=1. However, larger batch sizes reducethe overall run-time substantially by (i) allowing several workersto label items in parallel, and (ii) reducing the number of itera-tions.13 This is confirmed by our experiments in Section 6.2, whichshow that the impact of increasing β on the effectiveness of ouralgorithms is not as dramatic as its impact on the overall run-time.Thus, to find the optimal β, a reasonable choice is to start from asmaller batch size and continuously increase it (say, double it) untilthe run-time becomes reasonable, or the quality metric falls belowthe minimum requirement.

6. EXPERIMENTAL RESULTSThis section evaluates the effectiveness of our AL algorithms in

practice by comparing the speed, cost, and accuracy with which ourAL algorithms can label a dataset compared to state-of-the-art ALalgorithms.Overview of the Results. Overall, our experiments show the fol-lowing: (i) our AL algorithms require several orders of magnitudefewer questions to achieve the same quality than the random base-line, and substantially fewer questions (4.5×–44×) than the bestgeneral-purpose AL algorithm (IWAL [2, 8, 9]), (ii) our MinExpEr-ror algorithm works better than Uncertainty in the upfront setting,but the two are comparable in the iterative setting, (iii) Uncertaintyhas a much lower computational overhead than MinExpError, and(iv) surprisingly, even though our AL algorithms are generic andwidely applicable, they still perform comparably to and sometimes

13As shown in Section 3.4, the time complexity of each iteration isproportional to T (|U |), where T (.) is the training time of the clas-sification algorithm. For instance, consider the training of an SVMclassifier which is typically cubic in the size of its training set. As-suming a fixed β and an overall budget of B questions, the overallcomplexity of our Uncertainty algorithm becomes O(B

β|U |3).

much better than AL algorithms designed for specific tasks, e.g.,7× fewer questions than CrowdER [53] and an order of magnitudefewer than CVHull [7] (two of the most recent AL algorithms forentity resolution), competitive results to Brew et al [13], and also2–8× fewer questions than less general AL algorithms (Bootstrap-LV [41] and MarginDistance [49]).Experimental Setup. All algorithms were tested on a Linux serverwith dual-quad core Intel Xeon 2.4 GHz processors and 24GB ofRAM. Throughout this section, unless stated otherwise, we repeatedeach experiment 20 times and reported the average result, everytask cost 1¢, and the size of the initial training and the batch sizewere 0.03% and 10% of the unlabeled set, respectively.

Methods Compared. We ran experiments on the following learn-ing algorithms in both the upfront and iterative scenarios:

1. Uncertainty: Our method from Section 3.2.2. MinExpError: Our method from Section 3.3.3. IWAL: A popular AL algorithm [9] that follows Importance

Weighted Active Learning [8], recently extended with batching [2].4. Bootstrap-LV: Another bootstrap-based AL that uses the model’s

class probability estimates to measure uncertainty [41]. This methodonly works for probabilistic classifiers, e.g., we exclude this in ex-periments with SVMs.

5. CrowdER: One of the most recent AL techniques specificallydesigned for entity resolution tasks [53].

6. CVHull: Another state-of-the-art AL specifically designed forentity resolution [7].

7. MarginDistance: An AL algorithm specifically designed forSVM classifiers, which picks items that are closer to the margin [49].

8. Entropy: A common AL strategy [43] that picks items forwhich the entropy of different class probabilities is higher, i.e., themore uncertain the classifier, the more similar the probabilities ofdifferent classes, and the higher the entropy.

9. Brew et al. [13]: a domain-specific AL designed for senti-ment analysis, which uses clustering to select an appropriate subsetof articles (or tweets) to be tagged by users.

10. Baseline: A passive learner that randomly selects unlabeleditems to send to the crowd.In the plots, we prepend the scenario name to the algorithm names,e.g., UpfrontMinExpError or IterativeBaseline. We have repeatedour experiments with different classifiers as the underlying learner,including SVM, Naïve-Bayes classifier, neural networks, and deci-sion trees. For lack of space, we only report each experiment forone type of classifier. When not specified, we used linear SVM.

Evaluation Metrics. AL algorithms are usually evaluated basedon their learning curve, which plots the quality measure of interest(e.g., accuracy or F1-measure) as a function of the number of dataitems that are labeled [43]. To compare different learning curvesquantitatively, the following metrics are typically used:

1. Area under curve (AUC) of the learning curve.2. AUCLOG, which is the AUC of the learning curve when the

X-axis is in log-scale.3. Questions saved, which is the ratio of number of questions

asked by an active learner to those asked by the baseline to achievethe same quality.

Higher AUCs indicate that the learner achieves a higher qualityfor the same cost/number of questions. Due to the diminishing-return of learning curves, the average quality improvement is usu-ally in a 0–16% range.

AUCLOG favors algorithms that improve the metric of interestearly on (e.g., with few examples). Due to the logarithm, the im-provement of this measure is typically in the 0–6% range.

Page 9: Active Learning for Crowd-Sourced Databases

To compute question savings, we average over all the quality lev-els that are achievable by both AL and baseline curves. For com-petent active learners, this measure should (greatly) exceed 1, as aratio < 1 indicates a performance worse than that of the randombaseline.

6.1 Crowd-sourced DatasetsWe experiment with several datasets labeled using Amazon Me-

chanical Turk. In this section, we report the performance of ouralgorithms on each of them.

6.1.1 Entity ResolutionEntity resolution (ER) involves finding different records that re-

fer to the same entity, and is an essential step in data integration/cleaning[5, 7, 35, 42, 53]. Humans are typically more accurate at ER thanclassifiers, but also slower and more expensive [53].

We used the Product (http://dbs.uni-leipzig.de/file/Abt-Buy.zip) dataset, which contains product attributes (name, description,and price) of items listed on the abt.com and buy.comwebsites.The task is to detect pairs of items that are identical but listed underdifferent descriptions on the two websites (e.g., “iPhone White 16GB” vs “Apple 16GB White iPhone 4”). We used the same datasetas [53], where the crowd was asked to label 8315 pairs of items aseither identical or non-identical. This dataset consists of 12% iden-tical pairs and 88% non-identical pairs. In this dataset, each pairhas been labeled by 3 different workers, with an average accuracyof 89% and an F1-measure of 56%. We also used the same classi-fier used in [53], namely a linear SVM where each pair of itemsis represented by their Levenshtein and Cosine similarities. Whentrained on 3% of the data, this classifier has an average accuracy of80% and an F1-measure of 40%. Figure 8 shows the results of usingdifferent AL algorithms. As expected, while all methods eventu-ally improve with more questions, their overall F1-measures im-prove at different rates. MarginDistance, MinExpError, and Crow-dER are all comparable, while Uncertainty improves much morequickly than the others. Here, Uncertainty can identify the itemsabout which the model has the most uncertainty and get the crowdto label those earlier on. Interestingly, IWAL, which is a genericstate-of-the-art AL, performs extremely poorly in practice. CVHullperforms equally poorly, as it internally relies on IWAL as its ALsubroutine. This suggests opportunities for extending CVHull torely on other AL algorithms in future work.

This result is highly encouraging: even though CrowdER andCVHull are recent AL algorithms highly specialized for improvingthe recall (and indirectly, F1-measure) of ER, our general-purposeAL algorithms are still quite competitive. In fact, Uncertainty uses6.6× fewer questions than CrowdER and an order of magnitudefewer questions than CVHull to achieve the same F1-measure.

6.1.2 Image SearchVision-related problems also utilize crowd-sourcing heavily, e.g.,

in tagging pictures, finding objects, and identifying bounding boxes [51].In all of our vision experiments, we employed a relatively simpleclassifier where the PHOW features (a variant of dense SIFT de-scriptors commonly used in vision tasks [11]) of a set of imagesare first extracted as a bag of words, and then a linear SVM is usedfor their classification. Even though this is not the state-of-the-artimage detection algorithm, we show that our AL algorithms stillgreatly reduce the cost of many challenging vision tasks.Gender Detection. We used the faces from Caltech101 dataset [22]and manually labeled each image with its gender (266 males, 169females) as our ground truth. We also gathered crowd labels by ask-ing the gender of each image from 5 different workers. We startedby training the model on a random set of 11% of the data. In Fig-

0 200 400 600 8000.88

0.9

0.92

0.94

0.96

0.98

1

Total # of questions asked

F1−m

easu

re

ActualProjected (Cross−validation)

Figure 7: Estimating quality with k-fold cross validation.

Figure 8: Comparison of different AL algorithms for entity resolu-tion: the overall F1-measure in the iterative scenario.

ure 6, we show the accuracy of the crowd, the accuracy of our ma-chine learning model, and the overall accuracy of the model pluscrowd data. For instance, when a fraction x of the labels were ob-tained from the crowd, the other 1−x labels were determined fromthe model, and thus, the overall accuracy was x∗ac+(1−x)∗am,where ac and am are the crowd and model’s accuracy, respectively.As in our entity resolution experiments, our algorithms improve thequality of the labels provided by the crowd, i.e., by asking ques-tions for which the crowd tends to be more reliable. Here, though,the crowd produces higher overall quality than in the entity resolu-tion case and therefore its accuracy is improved only from 98.5%to 100%. Figure 6 shows that both MinExpError and Uncertaintyperform well in the upfront scenario, respectively improving thebaseline accuracy by 4% and 2% on average, and improving its AU-CLOG by 2-3%. Here, due to the upfront scenario, MinExpErrorsaves the most number of questions. The baseline has to ask 4.7×(3.7×) more questions than MinExpError (Uncertainty) to achievethe same accuracy. Again MarginDistance, although specificallydesigned for SVM, achieves little improvement over the baseline.Object Containment. We again mixed 50 human faces and 50background images from Caltech101 [22]. Because differentiatinghuman faces from background clutter is easy for humans, we usedthe crowd labels as ground truth in this experiment. Figure 9 showsthe upfront scenario with an initial set of 10 labeled images, whereboth Uncertainty and MinExpError lift the baseline’s F1-measureby 16%, while MarginDistance provides a lift of 13%. All three al-gorithms increase the baseline’s AUCLOG by 5-6%. Note that thebaseline’s F1-measure degrades slightly as it reaches higher bud-gets, since the baseline is forced to give answers to hard-to-classifyquestions, while the AL algorithms avoid such questions, leavingthem to the last batch (which is answered by the crowd).

6.1.3 Sentiment AnalysisMicroblogging sites such as Twitter provide rich datasets for sen-

Page 10: Active Learning for Crowd-Sourced Databases

100 200 300 400

0.98

0.985

0.99

0.995

1

Total # of questions asked

Acc

urac

yCrowd−Accuracy

UpfrontBaselineUpfrontMarginDistanceUpfrontUncertaintyUpfrontMinExpError

0 100 200 300 4000.975

0.98

0.985

0.99

0.995

1

1.005Crowd−F1−measure

Total # of questions asked

F1−m

easu

re

0 100 200 300 4000.8

0.85

0.9

0.95

1Model−Accuracy

Total # of questions asked

Acc

urac

y

0 100 200 300 4000.7

0.75

0.8

0.85

0.9

0.95Model−F1−measure

Total # of questions asked

F1−m

easu

re

0 100 200 300 4000.8

0.85

0.9

0.95

1Overall−Accuracy

Total # of questions asked

Acc

urac

y

0 100 200 300 4000.7

0.75

0.8

0.85

0.9

0.95

1Overall−F1−measure

Total # of questions asked

F1−m

easu

re

Figure 6: The object detection task (detecting the gender of the person in an image): accuracy of the (a) crowd, (b) model, (c) overall.

0 20 40 60 80 1000

0.5

1

1.5

2

Total # of questions asked

Acc

urac

y

Crowd−Accuracy

UpfrontBaselineUpfrontMarginDistanceUpfrontUncertaintyUpfrontMinExpError

0 50 1000

0.5

1

1.5

2

Total # of questions asked

F1−m

easu

re

Crowd−F1−measure

UpfrontBaselineUpfrontMarginDistanceUpfrontUncertaintyUpfrontMinExpError

0 20 40 60 80 1000.4

0.6

0.8

1

Total # of questions asked

Acc

urac

y

Model−Accuracy

UpfrontBaselineUpfrontMarginDistanceUpfrontUncertaintyUpfrontMinExpError

0 20 40 60 800.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Total # of questions asked

F1−m

easu

re

Model−F1−measure

UpfrontBaselineUpfrontMarginDistanceUpfrontUncertaintyUpfrontMinExpError

0 20 40 60 80 1000.7

0.75

0.8

0.85

0.9

0.95

1

Total # of questions asked

Acc

urac

y

Overall−Accuracy

UpfrontBaselineUpfrontMarginDistanceUpfrontUncertaintyUpfrontMinExpError

0 20 40 60 80 1000.75

0.8

0.85

0.9

0.95

1

Total # of questions asked

F1−m

easu

re

Overall−F1−measure

UpfrontBaselineUpfrontMarginDistanceUpfrontUncertaintyUpfrontMinExpError

Figure 9: Image search task (whether a scenecontains a human): F1-measure of the model.

0 2 4 6 8 10x 10

4

0

0.5

1

1.5

2

Total # of questions asked

Acc

urac

y

Crowd−Accuracy

UpfrontBaselineUpfrontMarginDistanceUpfrontUncertaintyUpfrontMinExpError

0 5 10x 10

4

0

0.5

1

1.5

2

Total # of questions asked

F1−m

easu

re

Crowd−F1−measure

UpfrontBaselineUpfrontMarginDistanceUpfrontUncertaintyUpfrontMinExpError

0 5 10x 10

4

0.7

0.8

0.9

1

Total # of questions asked

Acc

urac

y

Model−Accuracy

UpfrontBaselineUpfrontMarginDistanceUpfrontUncertaintyUpfrontMinExpError

20,000 40,000 60,000 80,000 100,000

0.7

0.8

0.9

1

Total # of questions asked

F1−m

easu

reModel−F1−measure

UpfrontBaselineUpfrontMarginDistanceUpfrontUncertaintyUpfrontMinExpError

0 2 4 6 8 10x 10

4

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Total # of questions asked

Acc

urac

y

Overall−Accuracy

UpfrontBaselineUpfrontMarginDistanceUpfrontUncertaintyUpfrontMinExpError

0 2 4 6 8 10x 10

4

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Total # of questions asked

F1−m

easu

re

Overall−F1−measure

UpfrontBaselineUpfrontMarginDistanceUpfrontUncertaintyUpfrontMinExpError

Figure 10: Sentiment analysis task: F1-measure of the model for 100K tweets.

Figure 11: Sentiment analysis task: F1-measure of the model for 10K tweets.

timent analysis [37], where analysts can ask questions such as “howmany of the tweets mentioning iPhone have a positive or nega-tive sentiment?” Training accurate classifiers requires sufficientaccurately labeled data, and with millions of daily tweets, it is toocostly to ask the crowd to label all of them. In this experiment,we show that our AL algorithms, with as few as 1K-3K crowd-labeled tweets, can achieve very high accuracy and F1-measure ona corpus of 10K-100K unlabeled tweets. We randomly chose thesetweets from an online corpus14 that provides ground truth labels forthe tweets, with equal numbers of positive and negative-sentimenttweets. We obtained crowd labels (positive, negative, neutral, orvague/unknown) for each tweet from 3 different workers.

Figure 10 shows the results for using 1K initially labeled datapoints with the 100K dataset in the upfront setting. The results con-firm that the upfront scenario is best handled by our MinExpErrorAlgorithm. Here, the MinExpError, Uncertainty and MarginDis-tance algorithms improve the average F1-measure of the baselinemodel by 11%, 9% and 5%, respectively. Also, MinExpError in-creases baseline’s AUCLOG by 4%. All three AL algorithms dra-matically reduce the number of questions required to achieve agiven accuracy or F1-measure. In comparison to the baseline, Min-ExpError, Uncertainty, and MarginDistance reduce the number ofquestions by factors of 46×, 32×, and 27×, respectively.

Figure 11 shows similar results for using 1K initially labeledtweets with the 10K dataset but in the iterative setting. The resultsconfirm that the iterative scenario is best handled by our Uncer-tainty algorithm, which even improves on Brew et al. algorithm[13], which is a domain-specific AL designed for sentiment anal-

14http://twittersentiment.appspot.com

ysis. Here, the Uncertainty, MinExpError, and Brew et al. algo-rithms improve the average F1-measure of the baseline model by9%, 4% and 4%, respectively. Also, Uncertainty increases baseline’sAUCLOG by 4%. In comparison to the baseline, Uncertainty, Min-ExpError, and Brew et al. reduce the number of questions by fac-tors of 5.38×, 1.76×, and 4.87×, respectively. Again, the savingsare expectedly modest compared to the upfront scenario.

6.2 PBA and other Crowd OptimizationsIn this section, we present results for our crowd-specific opti-

mizations described in Section 5.PBA Algorithm: We first report experiments on the PBA algo-rithm. Recall that this algorithm partitions the items into subgroupsand optimally allocates the budget amongst them. In the CMUfacial expressions dataset, the crowd had a particularly hard timetelling the facial expression of certain individuals, so we createdsubgroups based on the user column of the dataset, and asked thecrowd to label the expression on each face. By choosing v0=9 andbmax=9, we varied n0 = 2, 4, 8, 16, 32 and measured the point-wise error of our Pg,b estimates. Figure 16 shows that the averageerror of our estimates drops quickly as the sample size n0 increases:the error is 0.22 for n0=2 and goes down to 0.13 with only n0=8.Also, curve fitting shows that the error is proportional to O(1/n0)(like most aggregates).

Choosing v0 = 9, bmax = 9, and n0 = 2, we also comparedPBA against a uniform budget allocation scheme, where the samenumber of questions are asked about all items uniformly, as done inprevious research (see 7). The results are shown in Figure 17. Here,the X axis shows the normalized budget, e.g., a value of 2 means

Page 11: Active Learning for Crowd-Sourced Databases

Figure 15: Effect of parallelism on processing time: 100K tweets.

Figure 16: The effect of sample size on the accuracy of Pg,b estimatesin the PBA algorithm.

the budget was twice the total number of unlabeled items. The Yaxis shows the overall (classification) error of the crowd using ma-jority voting under different allocations. Here, the solid lines showthe actual error achieved under both strategies, while the blue andgreen dotted lines show our estimates of their performance beforerunning the algorithms. From Figure 17, we see that although ourestimates of the actual error are not highly accurate, since we onlyuse them to solve an ILP that would favor harder subgroups, ourPBA algorithm (solid green) still reduces the overall crowd errorby about 10% (from 45% to 35%). We also show how PBA wouldperform if it had an oracle that provided access to exact values ofPg,b (red line).Balancing Classes: Recall from Section 4 that our algorithms tendto ask more questions about rare classes than common classes,which can improve the overall performance. Figure 12 reports thecrowd’s F1-measure in an entity resolution task under different ALalgorithms. The dataset used in this experiment has many fewerpositive instances (11%) than negative ones (89%). The main obser-vation here is that although the crowd’s average F1-measure for theentire dataset is 56% (this is achieved by the baseline), our Uncer-tainty algorithm can lift this up to 62%, mainly because the ques-tions posed to the crowd have a balanced mixture of positive andnegative labels.k-Fold Cross Validation for Estimating Accuracy: We use k-fold cross validation to estimate the current quality of our model.Figure 7 shows our estimated F1-measure for an SVM classifier onUCI’s cancer dataset. Our estimates are reasonably close to the trueF1 values, especially as more labels are obtained from the crowd.This suggests that k-fold cross validation allows us to effectivelyestimate current model accuracy, and to stop acquiring more dataonce model accuracy has reached a reasonable level.The Effect of Batch Size: We now study the effect of batch sizeon result quality, based on the observations in Section 5.2. The ef-fect is typically moderate (and often linear), as shown in Figure 13.Here we show that the F1-measure gains can be in the 8–10% range

0 2 4 6 8

0.35

0.4

0.45

0.5

Total # of questions / Total # of unlabeled items

Crow

d’s

Erro

r (m

ajor

ity v

ote)

Uniform in realityUniform (projected)PBA in realityPBA (projected)Optimal (PBA w/ perfect info)

Figure 17: Reducing the crowd’s noise under different budget alloca-tion schemes.

(see Section 5.2). However, larger batch sizes reduce runtime sub-stantially, as Figure 14 shows. Here, going from batch size 1 to 200significantly reduces the time to train a model, by about two ordersof magnitude (from 1000’s of seconds to 10’s of seconds).

6.3 UCI Classification DatasetsIn Section 6.1, we validated our algorithms on crowd-sourced

datasets. This section also compares our algorithms on datasetsfrom the UCI KDD [1], where labels are provided by experts; thatis, ground truth and crowd labels are the same. Thus, by excludingthe effect of noisy labels, we can compare different AL strategiesin isolation. We have chosen 15 well-known datasets, as shown inFigures 18 and 19. To avoid bias, we have avoided any dataset-specific tuning or preprocessing steps, and applied the same clas-sifier with the same settings to all datasets. In each case, we ex-perimented with 10 different budgets of 10%, 20%, · · · , 100% (oftotal number of labels), each repeated 10 times, and reported theaverage. Also, to compute the F1-measure for datasets with morethan 2 classes, we have either grouped all non-majority classes intoa single class, or arbitrarily partitioned all the classes into two newones.

Here, besides the random baseline, we compare Uncertainty andMinExpError against four other AL techniques, namely IWAL,MarginDistance, Bootstrap-LV, and Entropy. IWAL is as generalas our algorithms, MarginDistance only applies to SVM classifi-cation, and Bootstrap-LV and Entropy are only applicable to prob-abilistic classifiers. For all methods (except for MarginDistance)we used MATLAB’s decision trees as the classifier, with its defaultparameters except for the following: no pruning, no leaf merging,and a ‘minparent’ of 1 (impure nodes with 1 or more items can besplit).

Figures 18 and 19 show the reduction in the number of questionsunder both upfront and iterative settings for Entropy, Bootstrap-LV, Uncertainty, and MinExpError, while Table 2 shows the aver-age AUCLOG, F1, and reduction in the number of questions askedacross all 15 datasets for all AL methods. The two figures omitdetailed results for MarginDistance and IWAL, as they performedpoorly (as indicated in Table 2). We report all the measures ofdifferent AL algorithms in terms of their performance improve-ment relative to the baseline (so higher numbers are better). Forinstance, consider Figure 18. On the yeast dataset, MinExpErrorreduces the number of questions asked by 84×, while Uncertaintyand Bootstrap-LV reduce it by about 38× and Entropy does notimprove the baseline.

In summary, these results are consistent with those observed withcrowd-sourced datasets. In the upfront setting, MinExpError sig-nificantly outperforms other AL techniques, with more than 104×

Page 12: Active Learning for Crowd-Sourced Databases

0 200 400 600 800 10000.8

0.85

0.9

Total # of questions asked

Acc

urac

y

Crowd−Accuracy

IterativeBaselineIterativeMarginDistanceIterativeUncertaintyIterativeMinExpError

200 400 600 800 1000

0.54

0.56

0.58

0.6

0.62

Total # of questions asked

F1−m

easu

reCrowd−F1−measure

0 200 400 600 800 10000.75

0.8

0.85

0.9

0.95

Total # of questions asked

Acc

urac

y

Model−Accuracy

IterativeBaselineIterativeMarginDistanceIterativeUncertaintyIterativeMinExpError

0 500 1000

0.35

0.4

0.45

0.5

0.55

0.6

0.65

Total # of questions asked

F1−m

easu

re

Model−F1−measure

0 200 400 600 800 10000.78

0.8

0.82

0.84

0.86

0.88

0.9

Total # of questions asked

Acc

urac

y

Overall−Accuracy

IterativeBaselineIterativeMarginDistanceIterativeUncertaintyIterativeMinExpError

0 200 400 600 800 1000

0.35

0.4

0.45

0.5

0.55

0.6

Total # of questions asked

F1−m

easu

re

Overall−F1−measure

IterativeBaselineIterativeMarginDistanceIterativeUncertaintyIterativeMinExpError

IterativeBaselineIterativeUncertaintyIterativeMinExpError

IterativeBaselineIterativeMarginDistanceIterativeUncertaintyIterativeMinExpError

Figure 12: Improving the crowd’s F1-measure for entity resolution in the itera-tive scenario.

0 50 100 150 2000.9

0.92

0.94

0.96

0.98

1

Batch Size

Avg

. F1−

mea

sure

IterativeUncertaintyIterativeMinExpError

Figure 13: Effect of batch size onour algorithms’ F1-measure (vehi-cle dataset, w/ a budget of 400 ques-tions).

0 50 100 150 200101

102

103

104

Batch Size

Proc

essi

ng T

ime

(sec

)

IterativeUncertaintyIterativeMinExpError

Figure 14: Effect of batch size onour algorithms’ processing times(vehicle dataset, w/ a budget of400 questions).

savings in the total number of questions on average. MinExpEr-ror also improves the AUCLOG and average F1-measure of thebaseline on average by 5% and 15%, respectively. After MinEx-pError, the Uncertainty and Bootstrap-LV are most effective witha comparable performance, i.e. 55-69× savings, improving theAUCLOG by 3%, and lifting the average F1-measure by 11-12%.Bootstrap-LV performs well here, which we expect is due to itsuse of bootstrap (similar to our algorithms). However, recall thatBootstrap-LV only works for probabilistic classifiers (e.g., decisiontrees). Here, MarginDistance is only moderately effective, provid-ing around 13× savings. Finally, the least effective algorithms areIWAL and Entropy, which perform quite poorly across almost alldatasets. IWAL uses learning theory to establish worst-case boundson sample complexity (based on VC-dimensions), but these boundsare known to leave a significant gap between theory and practice,as seen here. Entropy relies on the classifier’s own class proba-bility estimates [43], and thus can be quite ineffective when theseestimates are highly inaccurate. To confirm this, we used bootstrapto estimate the class probabilities more accurately (similarly to ourUncertainty algorithm), and then computed the entropy of these es-timates. The modified version, denoted as Uncertainty (Entropy), issignificantly more effective than the baseline (73×), which showsthat our idea of using bootstrap in AL not only achieves generality(beyond probabilistic classifiers) but can also improve traditionalAL strategies by providing more accurate probability estimates.

For the iterative scenario, Uncertainty actually works better thanMinExpError, with an average saving of 7× over the baseline inquestions asked and an increase in AUCLOG and average F1-measureby 1% and 3%, respectively. Note that savings are generally moremodest than in the upfront case because the baseline receives muchmore labeled data in the iterative setting and therefore, its averageperformance is much higher, leaving less room for improvement.However, given the comparable (and even slightly better) perfor-mance of Uncertainty compared to MinExpError in the iterativescenario, Uncertainty becomes a preferable choice for this scenariodue to its considerably smaller processing overhead (see Section6.4).

6.4 Run-time and ScalabilityTo measure algorithm runtime, we experimented with multiple

datasets. Here, we only report the results for the vehicle dataset.Figure 14 shows that training runtimes range from about 5, 000seconds to a few seconds and depend heavily on batch size, whichdetermines how many times the model is re-trained.

We also studied the effect of parallelism on our algorithms’ run-times. Here, we compared different AL algorithms in the upfrontscenario on Twitter dataset (10K tweets) as we enabled cores on

Figure 18: The ratio of the num. of questions asked by the randombaseline to those asked by different AL algorithms in the Upfront sce-nario.

Figure 19: The ratio of the num. of questions asked by the randombaseline to those asked by different AL algorithms in the Iterative sce-nario.

a multicore machine. The results are shown in Figure 15. Here,for Uncertainty, the run-time only improves until we have as manycores as we have bootstrap replicas (here, 10). After that, improve-ment is marginal. In contrast, MinExpError scales extremely well,achieving nearly linear speedup because it re-runs the model oncefor every training point.

Finally, we perform a monetary comparison between our AL al-gorithms and two different baselines. Figure 20 shows the com-bined monetary cost (crowd+machines) of achieving different lev-els of quality (i.e., the Model’s F1-measure for Twitter dataset fromSection 6.1.3). The crowd cost is (0.01+0.005)*3 per labeled item,which includes 3× redundancy and Amazon’s commission. Themachine cost for the baseline (passive learner) only consists oftraining a classifier while for our algorithms we have also included

Page 13: Active Learning for Crowd-Sourced Databases

UpfrontMethod AUCLOG(F1) Avg(Q’s Saved) Avg(F1)

Uncertainty 1.03x 55.14x 1.11xMinExpError 1.05x 104.52x 1.15x

IWAL 1.05x 2.34x 1.07xMarginDistance 1.00x 12.97x 1.05xBootstrap-LV 1.03x 69.31x 1.12x

Entropy 1.00x 1.05x 1.00xUncertainty (Entropy) 1.03x 72.92x 1.13x

IterativeMethod AUCLOG(F1) Avg(Q’s Saved) Avg(F1)

Uncertainty 1.01x 6.99x 1.03xMinExpError 1.01x 6.95x 1.03x

IWAL 1.01x 1.53x 1.01xMarginDistance 1.01x 1.47x 1.00xBootstrap-LV 1.01x 4.19x 1.03x

Entropy 1.01x 1.46x 1.01xUncertainty (Entropy) 1.01x 1.48x 1.00x

Table 2: Average improvement in AUCLOG, Questions Saved, andAverage F1, across all 15 UCI datasets, by different AL algorithms.

Figure 20: The combined monetary cost of crowd and machines forTwitter dataset.

the computation of the AL scores. To compute the machine cost,we measured the running time in core-hours using c3.8xlarge in-stances of Amazon EC2 cloud, which is currently $1.68/hour. Theoverall cost is clearly dominated by crowd cost, which is why ourAL learners can achieve the same quality with a much lower cost(since they ask much fewer questions to the crowd). We also com-pared against a second baseline where all the items are labeled bythe crowd (i.e., no classifiers). As expected, this ‘Crowd Only’approach is significantly more expensive than our AL algorithms.Figure 21 shows that the crowd can label all the items for $363 withan accuracy of 88.1–89.8%, while we can easily achieve a close ac-curacy of $87.9% with only $36 (for labeling 807 items and spend-ing less than $0.00014 on machine computation). This order ofmagnitude in saved dollars will only become more dramatic overtime, as we expect machine costs to continue dropping accordingto Moore’s law, while human worker costs will presumably remainthe same or even increase.

7. RELATED WORKCrowd-sourced Databases. These systems [23, 26, 32, 34, 35,39, 48] use optimization techniques to reduce the number of un-necessary questions asked to humans (e.g., number of pair-wisecomparisons in a join or sort query). However, the crowd must stillprovide at least as many labels as there are unlabeled items directlyrequested by the user. It is simply unfeasible to label millions ofitems in this fashion. To scale up to large datasets, we use machinelearning to avoid obtaining crowd labels for a significant portion of

Figure 21: The monetary cost of AL vs. using crowd for all labels.

the data.

Active Learning. AL has been a rich literature in machine learn-ing (see [43]). However, to the best of our knowledge, no existingAL algorithm satisfies all of the desiderata required for a practi-cal crowd-sourced system, namely generality, black-box approach,batching, parallelism, and label-noise management. For example,many AL algorithms are designed for a specific classifier (e.g. neu-ral networks [15] or SVM [49]) or a specific domain (e.g., entityresolution [5, 7, 42, 53], vision [51], or medical imaging [24]).However, our algorithms work for general classification tasks anddo not require any domain knowledge. The popular IWAL algo-rithm [8] is generic (except for hinge-loss classifiers such as SVM),but does not support batching or parallelism, and requires addingnew constraints to the classifier’s internal loss-minimization step.In fact, most AL proposals that provide theoretical guarantees (i)are not black-box, as they need to know and shrink the classi-fier’s hypothesis space at each step, and (ii) do not support batch-ing, as they rely on IID-based analysis. Notable exceptions are[16] and its IWAL variant [9]; they are black-box but they do notsupport batching or parallelism. Bootstrap-LV [41] and ParaAc-tive [2] support parallelism and batching, but both [2, 9] are basedon VC-dimension bounds [10], which are known to be too loosein practice. They cause the model to request many more labelsthan needed, leading to negligible savings over passive learning (asshown in Section 6). Bootstrap-LV also uses bootstrap, but unlikeour algorithms, it is not general. Noisy labelers are handled in [19,34, 38, 44], but [34, 38, 44] assume the same quality for all label-ers and [19] assumes that each labeler’s quality is the same acrossall items. Moreover, in Section 6, we empirically showed that ouralgorithms are superior to generic AL algorithms (IWAL +ParaAc-tive [2, 8, 9] and Bootstrap-LV [41]).

AL has been applied to many specific domains (some using crowd-sourcing): machine translation [4], entity resolution [12, 18, 25, 29,53, 7], and SVM classification [49]. Also, [31] assumes a proba-bilistic classifier and [13] assumes access to a clustering algorithmfor selecting subsets of articles. Our algorithms can handle arbi-trary classifiers and do not make any of these assumptions. Also,surprisingly, we are still competitive with (and sometimes even su-perior to) some of these domain-specific algorithms (e.g., we com-pared against MarginDistance [49], CrowdER [53], CVHull [7],and Brew et al. [13]).

Semi-supervised Learning. AL and semi-supervised learning (SSL)[33, 46, 50] are closely related. SSL exploits the latent structureof unlabeled data. E.g., [46] combines labeled and unlabeled ex-amples to infer more accurate labels from the crowd. To achievegenerality, our PBA algorithm does not assume any prior knowl-edge of the unlabeled data. Another common SSL technique is

Page 14: Active Learning for Crowd-Sourced Databases

to train multiple (ensemble) models with the labeled data to clas-sify the unlabeled data independently, and then use each model’smost confident predictions to train the rest [33, 50]. This is sim-ilar to our Uncertainty algorithm, but we use bootstrap, providinga generic way for obtaining multiple predictions and unbiased es-timates of uncertainty. Also, we treat the classifier as a black-box,whereas some of these methods do not. However, SSL and ALcan be complementary [45], and thus combining them can be aninteresting future work to further improve the scalability of crowd-sourced systems.

8. CONCLUSIONSIn this paper, we proposed two AL algorithms, Uncertainty and

MinExpError, to enable crowd-sourced databases to scale up tolarge datasets. To broaden their applicability to different classifi-cation tasks, we designed these algorithms based on the theory ofnonparametric bootstrap and evaluated them in two different set-tings. In the upfront setting, we ask all questions to the crowd inone go. In the iterative setting, the questions are adaptively pickedand added to the labeled pool. Then, we retrain the model andrepeat this process. While iterative retraining is more expensive,it also has a higher chance of learning a better model. Addition-ally, we proposed algorithms for choosing the number of questionsto ask different crowd-workers, based on the characteristics of thedata being labeled. We also studied the effect of batching on theoverall runtime and quality of our AL algorithms. Our results,on three data sets collected with Amazon’s Mechanical Turk, andwith 15 datasets from the UCI KDD archive, show that our algo-rithms make substantially fewer label requests than state-of-the-artAL techniques. We believe that these algorithms would prove to beimmensely useful in crowd-sourced database systems.

9. REFERENCES[1] D. N. A. Asuncion. UCI machine learning repository, 2007.[2] A. Agarwal, L. Bottou, M. Dudík, and J. Langford. Para-active

learning. CoRR, abs/1310.8243, 2013.[3] S. Agarwal, H. Milner, A. Kleiner, A. Talwalkar, M. Jordan,

S. Madden, B. Mozafari, and I. Stoica. Knowing when you’re wrong:Building fast and reliable approximate query processing systems. InSIGMOD Conference, 2014.

[4] V. Ambati, S. Vogel, and J. Carbonell. Active learning andcrowd-sourcing for machine translation. In LREC, volume 1, 2010.

[5] A. Arasu, M. Götz, and R. Kaushik. On active learning of recordmatching packages. In SIGMOD, 2010.

[6] Y. Bachrach, T. Graepel, G. Kasneci, M. Kosinski, and J. Van Gael.Crowd iq: aggregating opinions to boost performance. In AAMS,2012.

[7] K. Bellare, S. Iyengar, A. G. Parameswaran, and V. Rastogi. Activesampling for entity matching. In KDD, 2012.

[8] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weightedactive learning. In ICML, 2009.

[9] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic activelearning without constraints. In NIPS, 2010.

[10] C. M. Bishop. Pattern Recognition and Machine Learning.Springer-Verlag New York, Inc., 2006.

[11] A. Bosch, A. Zisserman, and X. MuUoz. Image classification usingrandom forests and ferns. In ICCV, 2007.

[12] K. Braunschweig, M. Thiele, J. Eberius, and W. Lehner. Enhancingnamed entity extraction by effectively incorporating the crowd. InBTW Workshops, 2013.

[13] A. Brew, D. Greene, and P. Cunningham. Using crowdsourcing andactive learning to track sentiment in online media. In ECAI, 2010.

[14] A. Chatterjee and S. N. Lahiri. Bootstrapping lasso estimators.Journal of the American Statistical Association, 106(494):608–625,2011.

[15] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning withstatistical models. J. Artif. Int. Res., 4, 1996.

[16] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic activelearning algorithm. In ISAIM, 2008.

[17] A. Dawid and A. Skene. Maximum likelihood estimation of observererror-rates using the em algorithm. Applied Statistics, 1979.

[18] G. Demartini, D. Difallah, and P. Cudré-Mauroux. Large-scale linkeddata integration using probabilistic crowdsourcing. VLDBJ, 22, 2013.

[19] P. Donmez, J. G. Carbonell, and J. Schneider. Efficiently learning theaccuracy of labeling sources for selective sampling. In KDD, 2009.

[20] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap.Chapman & Hall, 1993.

[21] J. Fan, M. Lu, B. C. Ooi, W.-C. Tan, and M. Zhang. A hybridmachine-crowdsourcing system for matching web tables. In ICDE,2014.

[22] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visualmodels from few training examples. In WGMBV, 2004.

[23] M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin.Crowddb: answering queries with crowdsourcing. In SIGMOD, 2011.

[24] S. Hoi, R. Jin, J. Zhu, and M. Lyu. Batch mode active learning and itsapplication to medical image classification. In ICML, 2006.

[25] L. Jiang, Y. Wang, J. Hoffart, and G. Weikum. Crowdsourced entitymarkup. In CrowdSem, 2013.

[26] A. Kittur, B. Smus, S. Khamkar, and R. E. Kraut. Crowdforge:crowdsourcing complex work. In UIST, 2011.

[27] A. Kleiner and et al. The big data bootstrap. In ICML, 2012.[28] R. Kohavi and D. H. Wolpert. Bias plus variance decomposition for

zero-one loss functions. In ICML, 1996.[29] H. Kopcke, A. Thor, and E. Rahm. Learning-based approaches for

matching web data entities. Internet Computing, 14(4), 2010.[30] S. Lahiri. On bootstrapping m-estimators. Sankhya. Series A.

Methods and Techniques, 54(2), 1992.[31] F. Laws, C. Scheible, and H. Schütze. Active learning with amazon

mechanical turk. In EMNLP, 2011.[32] X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. Cdas: a

crowdsourcing data analytics system. PVLDB, 2012.[33] P. K. Mallapragada, R. Jin, A. K. Jain, and Y. Liu. Semiboost:

Boosting for semi-supervised learning. PAMI, 31, 2009.[34] A. Marcus, D. R. Karger, S. Madden, R. Miller, and S. Oh. Counting

with the crowd. PVLDB, 2012.[35] A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller.

Human-powered sorts and joins. PVLDB, 5, 2011.[36] B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden.

Scaling up crowd-sourcing to very large datasets: A case for activelearning. PVLDB, 8(2):125–136, 2014.

[37] A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysisand opinion mining. In Proceedings of LREC, 2010.

[38] A. G. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis,A. Ramesh, and J. Widom. Crowdscreen: algorithms for filtering datawith humans. In SIGMOD, 2012.

[39] H. Park and et al. Deco: A system for declarative crowdsourcing.PVLDB, 5, 2012.

[40] V. C. Raykar and S. Yu. An entropic score to rank annotators forcrowdsourced labeling tasks. In NCVPRIPG, 2011.

[41] M. Saar-Tsechansky and F. Provost. Active sampling for classprobability estimation and ranking. Mach. Learn., 54, 2004.

[42] S. Sarawagi and A. Bhamidipaty. Interactive deduplication usingactive learning. In SIGKDD, 2002.

[43] B. Settles. Active learning literature survey. Computer SciencesTechnical Report 1648, University of Wisconsin–Madison, 2010.

[44] V. Sheng, F. Provost, and P. Ipeirotis. Get another label? In KDD,2008.

[45] M. Song, H. Yu, and W.-S. Han. Combining active learning andsemi-supervised learning techniques to extract protein interactionsentences. BMC bioinformatics, 12(Suppl 12), 2011.

[46] W. Tang and M. Lease. Semi-supervised consensus labeling forcrowdsourcing. In CIR, 2011.

[47] S. Tong and D. Koller. Support vector machine active learning withapplications to text classification. J. Mach. Learn. Res., 2, 2002.

[48] B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar.Crowdsourced enumeration queries. In ICDE, 2013.

[49] M.-H. Tsai, C.-H. Ho, and C.-J. Lin. Active learning strategies using

Page 15: Active Learning for Crowd-Sourced Databases

svms. In IJCNN, 2010.[50] H. Valizadegan, R. Jin, and A. K. Jain. Semi-supervised boosting for

multi-class classification. In ECML, 2008.[51] S. Vijayanarasimhan and K. Grauman. Cost-sensitive active visual

category learning. Int. J. Comput. Vision, 91, 2011.[52] A. Vlachos. A stopping criterion for active learning. Comput. Speech

Lang., 22(3), 2008.[53] J. Wang, T. Kraska, M. J. Franklin, and J. Feng. CrowdER:

Crowdsourcing entity resolution. PVLDB, 5, 2012.[54] J. Wolfe and et al. Low target prevalence is a stubborn source of

errors in visual search tasks. Experimental Psychology, 136(4), 2007.[55] K. Zeng, S. Gao, J. Gu, B. Mozafari, and C. Zaniolo. Abs: a system

for scalable approximate queries with accuracy guarantees. InSIGMOD Conference, 2014.

[56] K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. The analyticalbootstrap: a new method for fast error estimation in approximatequery processing. In SIGMOD Conference, 2014.

[57] D. Zhou, S. Basu, Y. Mao, and J. C. Platt. Learning from the wisdomof crowds by minimax entropy. In Advances in Neural InformationProcessing Systems, 2012.