7
Active Task Selection for Lifelong Machine Learning Paul Ruvolo and Eric Eaton Bryn Mawr College Computer Science Department 101 North Merion Avenue, Bryn Mawr, PA 19010 {pruvolo,eeaton}@cs.brynmawr.edu Abstract In a lifelong learning framework, an agent acquires knowledge incrementally over consecutive learning tasks, continually building upon its experience. Recent lifelong learning algorithms have achieved nearly iden- tical performance to batch multi-task learning methods while requiring over three orders of magnitude less time to learn. In this paper, we further improve the scalability of lifelong learning by developing curriculum selection methods that enable an agent to actively select the next task to learn in order to maximize performance on future learning tasks. We demonstrate that active task selec- tion is highly reliable and effective, allowing an agent to learn high performance models using up to 50% fewer tasks than when the agent has no control over the task order. We also explore a variant of transfer learning in the lifelong learning setting in which the agent can focus knowledge acquisition toward a particular target task, enabling the agent to quickly learn the target task. Introduction Current multi-task learning algorithms improve perfor- mance by sharing learned knowledge between multiple task models. With few exceptions, multi-task learning has fo- cused primarily on learning in a batch framework, where all tasks are given at once. However, versatile learning agents need to be able to learn tasks consecutively, continually building upon their knowledge over a lifetime of experience. In such a lifelong learning framework, tasks arrive se- quentially and the learning agent’s goal is to maximize its performance across all tasks. Recently developed lifelong learning methods (Ruvolo and Eaton 2013; Saha et al. 2011) have demonstrated the ability to learn multiple tasks online, yielding nearly identical model performance to batch multi- task learning while exhibiting dramatic speedups of 1,000x or more. These methods do not control the order in which they learn the tasks; in the basic lifelong learning setting the task order is outside the agent’s control. However, consider a versatile personal assistant agent that supports lifelong learning. In some situations the agent may not have control over the task order (which could be speci- fied by its boss), but in other situations it may have knowl- Under review for publication by the Association for the Advance- ment of Artificial Intelligence. (www.aaai.org). Do not distribute. edge of the next several tasks it needs to learn. In these cases, it can intelligently choose the next task to learn in or- der to maximize its overall performance. Previous research on curriculum selection (Bengio et al. 2009) has affirmed that proper task ordering can improve learning performance. In this paper, we explore the use of active curriculum se- lection to improve the scalability of lifelong learning over a long sequence of tasks. Specifically, we focus on a setting in which a lifelong learner can choose the next task to learn from a pool of candidate tasks in order to maximize perfor- mance while requiring the learning of as few tasks as possi- ble. Learning in a task-efficient manner is crucial not only for interactive settings where learning each task requires ex- tensive input from a human, but also in autonomous learn- ing settings where learning each task may be time consum- ing. We develop two general approaches to active task selec- tion, one based on information maximization and the other on model performance, and apply them to the Efficient Life- long Learning Algorithm (ELLA) (Ruvolo and Eaton 2013). We show that active curriculum selection enables ELLA to achieve large gains in learning efficiency compared to when the task order is outside its control, thereby reducing the number of tasks required to achieve a particular level of per- formance and improving scalability. We also demonstrate a form of transfer learning in the lifelong learning setting, in which the agent actively seeks knowledge to optimize per- formance on a specific target task. We show that this tar- geted active task selection enables ELLA to focus its learn- ing and achieve better performance on the target task than other methods. Related Work The model framework used in ELLA is closely related to a number of current multi-task learning (MTL) algo- rithms (Rai and Daum´ e III 2010; Zhang et al. 2008; Kumar and Daum´ e III 2012) that represent each individual task’s model as a linear combination of a common shared basis L. In most cases, the basis is learned in tandem with all task models in an expensive batch training process. Adding even a single task may require re-optimization of all models, making these batch MTL approaches unsuitability for life- long learning. Saha et al. (2011) proposed an algorithm for online multi-task classification, OMTL, that is based on per- ceptron learning and supports incremental learning. How-

Active Task Selection for Lifelong Machine Learning

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Active Task Selection for Lifelong Machine Learning

Active Task Selection for Lifelong Machine Learning

Paul Ruvolo and Eric EatonBryn Mawr College

Computer Science Department101 North Merion Avenue, Bryn Mawr, PA 19010

{pruvolo,eeaton}@cs.brynmawr.edu

Abstract

In a lifelong learning framework, an agent acquiresknowledge incrementally over consecutive learningtasks, continually building upon its experience. Recentlifelong learning algorithms have achieved nearly iden-tical performance to batch multi-task learning methodswhile requiring over three orders of magnitude less timeto learn. In this paper, we further improve the scalabilityof lifelong learning by developing curriculum selectionmethods that enable an agent to actively select the nexttask to learn in order to maximize performance on futurelearning tasks. We demonstrate that active task selec-tion is highly reliable and effective, allowing an agent tolearn high performance models using up to 50% fewertasks than when the agent has no control over the taskorder. We also explore a variant of transfer learningin the lifelong learning setting in which the agent canfocus knowledge acquisition toward a particular targettask, enabling the agent to quickly learn the target task.

IntroductionCurrent multi-task learning algorithms improve perfor-mance by sharing learned knowledge between multiple taskmodels. With few exceptions, multi-task learning has fo-cused primarily on learning in a batch framework, where alltasks are given at once. However, versatile learning agentsneed to be able to learn tasks consecutively, continuallybuilding upon their knowledge over a lifetime of experience.

In such a lifelong learning framework, tasks arrive se-quentially and the learning agent’s goal is to maximize itsperformance across all tasks. Recently developed lifelonglearning methods (Ruvolo and Eaton 2013; Saha et al. 2011)have demonstrated the ability to learn multiple tasks online,yielding nearly identical model performance to batch multi-task learning while exhibiting dramatic speedups of 1,000xor more. These methods do not control the order in whichthey learn the tasks; in the basic lifelong learning setting thetask order is outside the agent’s control.

However, consider a versatile personal assistant agent thatsupports lifelong learning. In some situations the agent maynot have control over the task order (which could be speci-fied by its boss), but in other situations it may have knowl-

Under review for publication by the Association for the Advance-ment of Artificial Intelligence. (www.aaai.org). Do not distribute.

edge of the next several tasks it needs to learn. In thesecases, it can intelligently choose the next task to learn in or-der to maximize its overall performance. Previous researchon curriculum selection (Bengio et al. 2009) has affirmedthat proper task ordering can improve learning performance.

In this paper, we explore the use of active curriculum se-lection to improve the scalability of lifelong learning over along sequence of tasks. Specifically, we focus on a settingin which a lifelong learner can choose the next task to learnfrom a pool of candidate tasks in order to maximize perfor-mance while requiring the learning of as few tasks as possi-ble. Learning in a task-efficient manner is crucial not onlyfor interactive settings where learning each task requires ex-tensive input from a human, but also in autonomous learn-ing settings where learning each task may be time consum-ing. We develop two general approaches to active task selec-tion, one based on information maximization and the otheron model performance, and apply them to the Efficient Life-long Learning Algorithm (ELLA) (Ruvolo and Eaton 2013).We show that active curriculum selection enables ELLA toachieve large gains in learning efficiency compared to whenthe task order is outside its control, thereby reducing thenumber of tasks required to achieve a particular level of per-formance and improving scalability. We also demonstrate aform of transfer learning in the lifelong learning setting, inwhich the agent actively seeks knowledge to optimize per-formance on a specific target task. We show that this tar-geted active task selection enables ELLA to focus its learn-ing and achieve better performance on the target task thanother methods.

Related WorkThe model framework used in ELLA is closely relatedto a number of current multi-task learning (MTL) algo-rithms (Rai and Daume III 2010; Zhang et al. 2008; Kumarand Daume III 2012) that represent each individual task’smodel as a linear combination of a common shared basisL. In most cases, the basis is learned in tandem with alltask models in an expensive batch training process. Addingeven a single task may require re-optimization of all models,making these batch MTL approaches unsuitability for life-long learning. Saha et al. (2011) proposed an algorithm foronline multi-task classification, OMTL, that is based on per-ceptron learning and supports incremental learning. How-

Page 2: Active Task Selection for Lifelong Machine Learning

ever, OMTL exhibits worse performance than ELLA, bothin terms of accuracy and speed.

Task selection is closely related to both active learningand curriculum learning methods, which order training in-stances in an attempt to maximize model performance. Ac-tive learning criteria typically focus on the most uncer-tain (Tong and Koller 2001; Cohn et al. 1996) or the mostinformative instances (MacKay 1992). In contrast, cur-riculum learning (Bengio et al. 2009; Kumar et al. 2010;Lee and Grauman 2011) advocates learning the easiest in-stances first and progressing toward increasingly harder in-stances, effectively the opposite of active learning criteria.

Active learning has been evaluated in a number of MTLalgorithms (Saha et al. 2010; Zhang 2010; Saha et al. 2011),and has been shown to generally reduce the amount of train-ing data necessary to achieve a particular level of perfor-mance. This paper builds on this earlier work by extendingthese ideas to lifelong learning, and introducing an efficientand highly effective mechanism for selecting the next taskto learn: the Diversity heuristic. In contrast to these activeMTL methods, which focus primarily on optimizing modelperformance, we focus on learning the basis L efficiently,which will provide maximal benefit to learning future tasks.

The Lifelong Learning ProblemThis paper uses the following notational conventions: ma-trices are denoted by bold uppercase letters, vectors are de-noted by bold lowercase letters, scalars are denoted by nor-mal lowercase letters, and sets are denoted using script type-face (e.g., A). Quantities related to a particular task are de-noted using parenthetical superscripts (e.g., matrix A(t) andvector v(t) are each related to task t). We use the shorthandA(1:t) to refer to the list of variables A(1), . . . ,A(t).

We employ a lifelong learning framework in whichthe agent faces a series of supervised learning tasksZ(1),Z(2), . . . ,Z(Tmax). Each learning task Z(t) =(f (t),X(t),y(t)

)is specified by a (hidden) function f (t) :

X (t) 7→ Y(t) from an instance space X (t) ⊆ Rd to a setof labels Y(t) (typically Y(t) = {−1,+1} for classificationtasks and Y(t) = R for regression tasks). To learn f (t), theagent is given nt training instances X(t) ∈ Rd×nt with cor-responding labels y(t) ∈ Y(t)nt given by f (t). For brevity,we use (x

(t)i , y

(t)i ) to represent the ith labeled training in-

stance for task t. The agent does not know a priori the totalnumber of tasks Tmax, their order, or their distribution.

Each time step, the agent receives a batch of labeled train-ing data for some task t, either a new task or as additionaldata for a previous task. Let T denote the number of tasksthe agent has encountered so far. At any time, the agentmay be asked to make predictions on data from any previ-ous task. Its goal is to construct task models f (1), . . . , f (T ),where each f (t) : Rd 7→ Y(t), such that each f (t) will ap-proximate f (t) to enable the accurate labeling of new data.Additionally, the agent must be able to rapidly update eachmodel f (t) in response to new training data for a previouslylearned task, and efficiently add f (t)’s to model new tasks.

1.)  Tasks  are  learned  consecu1vely  

 Time  

previously  learned  tasks   candidate  task  pool  

...   ...  t t-­‐1  t-­‐2   t+1   t+2   t+3  

labeled  data  X(t),  y(t)  

 

Lifelong  Learning  System  

previously  learned  knowledge  L

learned  model  ft    

2.)  Knowledge  is    transferred  from    previously    learned  tasks  

3.)  New    knowledge  is    stored  for    future  use  

4.)  Exis1ng    knowledge    is  refined  

ft    

t+4  

5.)  Agent  chooses  the  next  task  to  learn  from  the  candidate  pool  

ac1ve  task  selec1on  

current  task  

future  tasks  

Figure 1: An illustration of the lifelong learning process.

Background on ELLAThis section provides an overview of the Efficient LifelongLearning Algorithm (ELLA) (Ruvolo and Eaton 2013) thatforms the foundation for our proposed enhancements. Forfurther details on ELLA, see the original paper.

ELLA learns and maintains a repository of k latentmodel components L ∈ Rd×k, which forms a basis forall task models and serves as the mechanism for knowl-edge transfer between tasks. For each task t, ELLA learnsa model f (t)(x) = f(x;θ(t)) that is parameterized by a d-dimensional task-specific vector θ(t) = Ls(t) representedas a sparse linear combination of the columns of L using theweight vector s(t) ∈ Rk. The specific form of f (t)(x) is de-pendent on the base learning algorithm, as described below.

Given the labeled training data for each task, ELLA opti-mizes the predictive loss of each model while encouragingshared structure through L by minimizing:

eT (L) =1

T

T∑t=1

mins(t)

{1

nt

nt∑i=1

L(f(x(t)i ; Ls(t)

), y

(t)i

)+ µ‖s(t)‖1

}+ λ‖L‖2F , (1)

where L is a known loss function for fitting the model (e.g.,squared loss, etc.). This optimization problem is closely re-lated to a number of batch MTL algorithms, such as GO-MTL (Kumar and Daume III 2012). Equation 1 is not jointlyconvex in L and the s(t)’s, and so most batch MTL methodsyield a local optimum using an alternating optimization pro-cedure that is prohibitively expensive for lifelong learning.

To enable its efficient solution, ELLA approximatesEquation 1 using two simplifications:

1. To eliminate the explicit dependence on all previous train-ing data through the inner summation, ELLA approxi-mates Equation 1 using the second-order Taylor expan-sion of 1

nt

∑nt

i=1 L(f(x(t)i ;θ

), y

(t)i

)around θ = θ(t),

where θ(t) = arg minθ1nt

∑nt

i=1 L(f(x(t)i ;θ

), y

(t)i

)is an optimal predictor learned on only the training datafor task t. Crucially, this step removes the optimization’sdependence on the number of data instances n1 . . . nT .

Page 3: Active Task Selection for Lifelong Machine Learning

2. ELLA computes each s(t) only when training on task t,and does not update them when training on other tasks.This choice improves computational efficiency withoutsignificantly affecting performance—as T grows large,the performance penalty for this simplification becomesvanishingly small (Ruvolo and Eaton 2013).

These simplifications yield the following efficient updateequations that approximate the result of Equation 1:

s(t) ← arg mins(t)

`(Lm, s(t),θ(t),D(t)) (2)

Lm+1 ← arg minLgm(L) (3)

gm(L) = λ‖L‖2F +1

T

T∑t=1

`(L, s(t),θ(t),D(t)

)(4)

where ` (L, s,θ,D) = µ ‖s‖1 + ‖θ − Ls‖2D, Lm refers tothe value of the latent model components at the start of themth iteration, D(t) is the Hessian of the loss functionL eval-uated at θ(t) multiplied by 1/2, and t is the current task.

Each iteration m, ELLA receives training data(X(t),y(t)) for some task t. First, ELLA computes anoptimal model θ(t) and Hessian D(t) from only this trainingdata for task t using a single-task learner:

(θ(t),D(t)) = singleTaskLearner(X(t),y(t)) . (5)

The form of this step depends on the base learning algo-rithm; Ruvolo and Eaton (2013) provide details for learninglinear and logistic regression models using ELLA. Once thetask has been modeled, ELLA then updates the basis L toincorporate new knowledge by nulling the gradient of Equa-tion 4 and solving for L via an efficient incremental update.

Active Curriculum SelectionUntil now, we have framed lifelong learning as a passiveprocess, i.e. one in which the agent has no control over theorder in which learning tasks are presented. We consider alifelong learning agent that may actively choose the next taskto learn from a set of candidate training tasks (see Figure 1).

We formalize the problem of active curriculum selectionin the following manner. Consider a learning agent thathas access to training data from a pool of unlearned tasksX(T+1), . . . ,X(Tpool), where T + 1 ≤ Tpool ≤ Tmax andonly a subset of the training instances for each of these tasksis labeled. Let

(X(t), y(t)

)denote this subset of initially la-

beled data for task t. The agent must select the index of thenext task to learn tnext ∈ {T + 1, . . . , Tpool}.1 The value ofTpool could be a fixed value or it could be set dynamicallyduring learning (for example, we could set Tpool = T +z sothat the agent has knowledge of the next z upcoming tasks).Once the agent decides to learn a particular task, all of thetraining labels y(tnext) are revealed to the agent, allowing itto model the new task.

In this section, we develop task selection methods that al-low agents to maximize their ability to learn across a wide

1Note that after the index is chosen, T ← T + 1 and the tasksare reordered such that the remaining unlearned tasks in the poolare given as Z(T+1), . . . ,Z(Tpool).

range of future tasks. Thus, we will focus on encouragingthe repository of latent model components to be able to flex-ibly and rapidly adapt to any future task the agent might berequired to solve. Later we will develop an approach thatfocuses the selection of tasks toward learning a known targettask. We consider three approaches for choosing the nexttask to learn that encourage general knowledge acquisition:• choose tnext to maximize expected information gain

about the basis L (Information Maximization),• choose tnext to minimize the worse case fit of L to each

task in the pool (Diversity methods), and• choose tnext randomly (this corresponds to the original

lifelong learning setting in which the curriculum order isnot under the learner’s control).

Next, we describe both the Information Maximization andDiversity methods for selecting the next task to learn. Thesemethods are inspired by active learning criteria (see RelatedWork). We also evaluated an approach based on curriculumlearning (Bengio et al. 2009) that chooses to learn the taskwith the best performance given the current L, but found thatthis method performed significantly worse than random taskselection in all cases and so do not consider it further.

Information MaximizationSuppose a learning agent’s goal is to infer the value of a ran-dom vector h from noisy observations that carry potentiallyimperfect information about h’s value. For lifelong learn-ing, the random variable h represents the set of latent modelcomponents L used in ELLA. Let Xm be a set of randomvectors that the learner can choose to observe at iteration m.Here, we use a myopic information maximization criterion(InfoMax for short) where the learner chooses to observe therandom vector from the set Xm that gives the most expectedinformation gain about the random vector h:

xnext = arg minx∈Xm

H [h|x, Im]

= arg minx∈Xm

∫p(x=v|Im)H[h|x=v, Im] dv , (6)

where H is the differential entropy, Im represents all infor-mation available at iteration m, x is a random vector, andx = v is the event that the random vector x has value v.

Our goal is to learn a flexible and complete set of latentmodel components for ELLA, and so we maximize informa-tion about the basis L where the set of random variables thatthe agent may observe at themth iteration, Xm, correspondsto the tuples (θ(t),D(t)) returned by the single task learner(Equation 5) for t ∈ {T + 1, . . . , Tpool} . This formulationyields the following objective function:

tnext = arg mint

∫∫p(θ(t) =u,D(t) =V|Im)× (7)

H[L|θ(t) =u,D(t) =V, Im

]du dV .

Unfortunately, this objective is very computationally expen-sive in the general case, and so it does not have the efficiencyor scalability necessary for lifelong learning in its currentform. To overcome this issue, we describe several modifica-tions that allow us to efficiently approximate Equation 7.

Page 4: Active Task Selection for Lifelong Machine Learning

The first challenge is that the probability densityp(θ(t) =u,D(t) =V|Im) will depend on the particular baselearner used by ELLA. To provide a general solution, weapproximate the density using a Dirac delta function at(θ(t), D(t)) = singleTaskLearner(X(t), y(t)) (that is we usethe tuple returned by running the single task learner on thesubset of labeled training data for task t).

Next, we approximate the entropy term using a LaPlaceapproximation of the density of L by constructing a Gaus-sian density for L about L?, which is a mode of Equation 4.Specifically, given L?, we can approximate the density ofvec(L) as a multivariate Gaussian:

E[vec(L)|θ(t) = θ(t),D(t) =D(t), Im

]= vec(L?)

Cov[vec(L)|θ(t) = θ(t),D(t) =D(t), Im

]=

1

2

(λIk×d,k×d +

1

T + 1

(1

nt

(s(t)s(t)>

)⊗ D(t)

+

T∑t′=1

1

nt′

(s(t

′)s(t′)>)⊗D(t′)

))−1.

The previous equation does not incorporate uncertainty overthe s(t)’s in computing the entropy. This simplificationavoids additional computational overhead, which would pre-clude the use of this active task selection method for lifelonglearning, where efficiency is paramount. We leave furtherexploration of this simplification and its effect on qualityof the selected tasks to future work. To this end we ap-proximate the densities of s(t) for t ∈ {T + 1 . . . Tpool}given θ(t), D(t), and Ij using a Dirac delta function ats(t) = arg mins `(Lm, s, θ

(t), D(t)). Similarly, the den-sities for s(t

′) with t′ ∈ {1, . . . , T} given θ(t), D(t), andIm are approximated using Dirac delta functions at s(t

′) =

arg mins `(Liter(t′), s,θ(t′),D(t′)), where iter(t′) is the it-

eration that ELLA last received training data for task t′.To arrive at a final task selection criterion, we combine

our approximations with the well-known differential en-tropyH[x] = 1

2 ln((2πe)

d |Σ|)

of a multi-variate Gaussian-distributed random variable x ∈ Rd with covariance Σ,yielding the following InfoMax task selection rule:

tnext = arg maxt∈{T+1,...,Tpool}

ln∣∣Σ(t)

∣∣ , (8)

where

Σ(t) = Cov[vec(L)|θ(t) = θ(t),D(t) =D(t), Im

].

Diversity MethodsThe previous section approached task selection from an in-formation theoretic perspective. In this section, we proposea simple and efficient Diversity heuristic that encourages Lto serve as an effective basis for a wide variety of tasks, en-abling ELLA to be able to rapidly learn any new task. To en-courage ELLA to be able to solve the widest range of tasks,we select the next task as the one that the current basis L is

does the worst job solving, relative to how well that particu-lar task could be solved using a single task learning model:

tnext = arg maxt∈{T+1,...,Tpool}

mins`(Lm, s, θ

(t), D(t)), (9)

where as in the previous section (θ(t), D(t)) =

singleTaskLearner(X(t), y(t))). Effectively, Equation 9focuses on selecting tasks that are encoded poorly with thecurrent basis set (i.e. tasks that are unlike anything that thelearner has seen so far), thereby encouraging diversity.

The Diversity heuristic has connections to the work ofArthur and Vassilvitskii (2007) on choosing points to ini-tialize the k-means algorithm. Their algorithm, k-means++,stochastically chooses to add a center to a candidate set ofcluster centers with probability proportional to the squareddistance between the candidate center and the closest cur-rent center. This selection rule encourages instances to beselected as centers that are poorly represented by the cur-rent set. We view this strategy as analogous to the Diver-sity heuristic in the sense that we measure squared distancebased on the error of the current basis on coding the trainingdata for a candidate task. Inspired by the k-means++ algo-rithm, we test a stochastic version of the Diversity heuristicthat we call Diversity++ in which the probability of select-ing a particular task for learning is:

p(tnext = t) ∝(

mins`(Lm, s, θ

(t), D(t)))2

. (10)

Note that Diversity is equivalent to argt max p(tnext = t).

Targeted Curriculum SelectionSo far, we have focused on active task selection to sup-port the learning of an unknown set of future tasks. Inthis section, we explore a form of transfer learning (Panand Yang 2010) in which the lifelong learner focuses onacquiring knowledge to support a specific target task, forwhich it has access to training data

(X(target),y(target)

). Its

goal is to select learning tasks in order to quickly and ef-fectively optimize those model components needed for thetarget task, thereby maximizing its performance on the tar-get task (rather than on any arbitrary future task as before).

We derive the Targeted InfoMax method in a nearly iden-tical fashion to general InfoMax, with the exception thatinstead of setting h = L, we let h = Ls(target), wheres(target) is learned each iteration m via Equations 2 and 5.This formulation corresponds to maximizing the informa-tion about the model parameters for the target task. The ob-jective for selecting the next task becomes:

tnext = arg mint

∫∫p(θ(t) =u,D(t) =V|Im)× (11)

H[Ls(target)|θ(t) =u,D(t) =V, Im

]du dV .

Using similar approximations as for Equation 7, we arrive atthe following rule for Targeted InfoMax task selection:

tnext = arg maxt∈{T+1,...,Tpool}

ln∣∣Ψ>Σ(t)Ψ

∣∣ , (12)

where Ψ = s(target) ⊗ Id .

Page 5: Active Task Selection for Lifelong Machine Learning

EvaluationWe evaluated the following methods of active task selectionagainst random task selection: InfoMax, Targeted InfoMax,Diversity, and Diversity++. We assessed performance intwo problem settings: (1) general knowledge acquisition, inwhich performance was evaluated against a pool of unknowntasks, and (2) targeted knowledge acquisition in which per-formance was evaluated against a known target task. Eachmethod of task selection was tested on four data sets:Synthetic Regression Tasks We created a set of Tmax =100 random tasks with d = 13 features and nt = 100 in-stances per task. The task parameter vectors θ(t) were gen-erated using a linear combination of k = 6 randomly gener-ated latent components in R12. The vectors s(t) had a spar-sity level of 0.5 (i.e., only half the latent components wereused to construct each θ(t)). Each instance in X(t) was gen-erated from a standard normal distribution. The training la-bels for each task were given as y(t) = X(t)>θ(t)+ε, whereeach element of ε is independent univariate Gaussian noise.A bias term was added as the 13th feature prior to learning.London School Data The London Schools data set con-sists of exam scores from 15,362 students in 139 schools.We treat the data from each school as a separate task. Thegoal is to predict the exam score of each student. We usethe same feature encoding as used by Kumar and DaumeIII (2012), where four school-specific and three student-specific categorical variables are encoded as a collection ofbinary features. We use the exam year and a bias term as ad-ditional features, giving each data instance d = 27 features.Land Mine Detection In the land mine data set (Xue etal. 2007), the goal is to detect whether or not a land mineis present in an area based on radar images. The 10 inputfeatures (plus a bias term) are extracted from radar data; seeXue et al. (2007) for details. The data set consists of14,820data instances in total, divided into 29 different geographicalregions. We treat each region as a different task.Facial Expression Recognition This data set is from a re-cent facial expression recognition challenge (Valstar et al.2011). Each task involves recognizing one of three differentfacial action units (#5: upper lid raiser, #10: upper lip raiser,and #12: lip corner pull) from an image of one of seven sub-jects’ faces. There are a total of 21 tasks, each with 450–999images. We use the same feature encoding as Ruvolo andEaton (2013), using a multi-scale Gabor pyramid to extract2,880 Gabor features for each image, then reducing them to100 dimensions using PCA, and adding a bias term.

Evaluation ProcedureFor each task, we created random 50%/50% splits betweentraining and hold-out test data. Additionally, we divided theset of tasks into two subsets: one set of training tasks thatserves as the pool for active task selection and is used tolearn L, and one set of evaluation tasks on which we mea-sure the performance of the learned L. Therefore, Tpool ≈12Tmax in these experiments. The agent never has access tothe evaluation tasks, and so must select the curriculum orderfrom the training tasks alone. For each training task, 25% of

its training data was revealed as the initial sample of labeleddata (recall that each task selection method requires a smallamount of labeled data to measure the utility of learning aparticular task). Each experiment was repeated 1,000 timesto smooth out variability due to the factors discussed above.

The general knowledge acquisition experiments followedthe procedure below to evaluate each task selection method:

1. Begin with no tasks learned; L is initialized randomly.2. Select the next training task to learn using some method.3. Learn the new task using Equations 2–4 to arrive at a new

set of latent components L.4. Compute models for each evaluation task using Equa-

tion 2 for the current L (no modification of L is allowed).5. Record the performance of the evaluation task models on

the hold-out test data, yielding an estimate of how wellthe current L captures knowledge for new tasks.

6. Go to step 2 while additional training tasks remain.The experiments on targeted knowledge acquisition firstchose a single target task from the set of evaluation tasksand then followed the same procedure, but instead measuredperformance only against that target task in Step 5.

The values of the parameters k and a ridge term, Γ,for the single task regressors for ELLA were selected us-ing a grid-search over the values k ∈ {1 . . . 10} and Γ ∈{e−5, e−3, e−1, e1} to maximize performance on the evalu-ation tasks averaged over all of the task selection methods.The value of µwas set to 1 and the value of λwas set to e−5.

We are primarily concerned with the performance levelachieved by each task selection method as a function ofthe number of tasks learned. Specifically, we measure howmany less tasks (expressed as a percentage) are required foractive task selection to achieve and maintain a particular per-formance level as compared to the number of tasks requiredfor random task selection to achieve and maintain that sameperformance level.2 We measured performance using thearea under the ROC curve for classification tasks (chosendue to the highly biased class distributions for the data sets)and -rMSE (negative root mean squared error) for regressiontasks. A score of 0 on % Less Tasks Required indicates thatthe active task selection method is no more efficient thanrandom selection, a negative score indicates that it is less ef-ficient, and a positive score indicates that it is more efficient.

ResultsFigure 2a and Table 1a depict our results on general knowl-edge acquisition, showing that active task selection is al-most always more efficient than random selection. Mostencouragingly, Diversity achieves large gains in efficiencyover random selection on all data sets. In some cases, Di-versity achieves equivalent performance to random selectionusing approximately one-half the number of tasks.

InfoMax also shows strong gains in efficiency over ran-dom selection. However, the results for this method are notas consistent as those for Diversity, since we occasionally

2If a particular run did not achieve a specific level of perfor-mance, then the number of tasks required to achieve that level ofperformance was set to one plus the total number of training tasks.

Page 6: Active Task Selection for Lifelong Machine Learning

−4.5 −4 −3.5 −3 −2.5 −2 −1.5 −10

10

20

30

40

50

60

70

Performance (−rMSE)

% L

ess

Tas

ks R

equi

red

Synthetic Data

DiversityDiversity++InfoMax

−11.2 −11 −10.8 −10.6 −10.4 −10.20

10

20

30

40

50

Performance (−rMSE)

% L

ess

Tas

ks R

equi

red

London Schools Data

DiversityDiversity++InfoMax

0.66 0.69 0.72 0.75 0.78−10

0

10

20

30

40

50

Performance (AROC)

% L

ess

Tas

ks R

equi

red

Land Mine Data

DiversityDiversity++InfoMax

0.54 0.56 0.58 0.6 0.62 0.64

−5

0

5

10

15

20

25

Performance (AROC)

% L

ess

Tas

ks R

equi

red

Facial Expression Data

DiversityDiversity++InfoMax

(a) Active Task Selection for General Knowledge Acquisition

0.66 0.68 0.7 0.72 0.74 0.76 0.78−10

0

10

20

30

40

Performance (AROC)

% L

ess

Tas

ks R

equi

red

Land Mine Data

DiversityDiversity++Targeted InfoMaxInfoMax

0.56 0.58 0.6 0.62 0.640

5

10

15

20

25

Performance (AROC)

% L

ess

Tas

ks R

equ

ired

Facial Expression Data

DiversityDiversity++Targeted InfoMaxInfoMax

(b) Targeted Knowledge Acquisition

Figure 2: The results of active task selection for (a) general, and (b) targeted knowledge acquisition. Each plot shows theaccuracy achieved by each method versus the relative efficiency (in number of tasks) as compared to random task selection.

Table 1: The results for (a) general and (b) targeted knowl-edge acquisition, as measured by the percent less tasks re-quired by active task selection, averaged across all perfor-mance levels. The mean and standard deviation are reported.

(a) General Knowledge Acquisition

Data Set InfoMax Diversity Diversity++Land Mine 5.1±3.7 29.4±4.1 18.1±3.0Facial Expr. 0.5±2.6 14.6±5.1 9.9±4.0Syn. Data 10.2±7.9 20.2±6.7 17.0±5.9London Sch. 29.8±6.8 21.0±3.1 26.2±3.1

(b) Targeted Knowledge Acquisition

TargetedData Set InfoMax InfoMax Diversity Diversity++Land Mine 17.9±2.7 -1.7±3.0 14.9±3.2 8.5±2.5Facial Expr. 7.8±0.7 2.6±0.8 10.0±2.5 2.7±1.3Syn. Data 38.4±7.5 11.4±5.6 19.9±4.9 16.6±5.0London Sch. 26.9±1.8 20.1±2.8 22.3±1.1 16.4±2.7

observe inefficient task selection (e.g., on the two classifi-cation data sets). However, InfoMax achieves the best effi-ciency of all approaches on the London Schools data set.

Recall that Diversity++ is a stochastic version of Diver-sity, and therefore does not always select the task with theworst performance to learn next. The results show that Di-versity++ was dominated in three of the four data sets by Di-versity, affirming the benefits of selecting the task with theworst performance to learn next. However, both Diversity++and InfoMax perform better than Diversity on the London

Schools data set. Further investigation is needed to deter-mine whether particular characteristics of the data set, suchas the degree of sparsity of its input features, are importantfor determining when each method should be used.

Table 1b shows our results on targeted knowledge acqui-sition, with Figure 2b depicting extended results for selecteddata sets. These results show that targeted task selection ishighly effective, revealing that Targeted InfoMax is the mostefficient method for three out of the four datasets. There-fore, it appears that incorporating knowledge of the targettask into InfoMax can lead to large gains in performance,not only over general InfoMax but also over Diversity, thebest general task selection method.

ConclusionWe have considered the setting of active curriculum selec-tion in which a lifelong learner can select which task to learnnext for either general and targeted knowledge acquisition.Our proposed Diversity heuristic is efficient and effective forgeneral knowledge acquisition, achieving significant reduc-tions in the number of tasks (up to 50%) required to obtaineda particular performance level as compared to random selec-tion. Our InfoMax method did not work as well as Diversityfor general knowledge acquisition on three of four data sets.However, when applied to targeted knowledge acquisition,this approach achieved the best performance on gatheringknowledge to solve a given target task. In future work, weintend to connect active task selection to higher level learn-ing goals and incorporate external guidance from a teacher.

Page 7: Active Task Selection for Lifelong Machine Learning

ReferencesArthur, D., and Vassilvitskii, S. 2007. k-means++: Theadvantages of careful seeding. In Proceedings of the Eigh-teenth Annual ACM-SIAM Symposium on Discrete Algo-rithms, 1027–1035. SIAM.Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J.2009. Curriculum learning. In Proceedings of the 26th In-ternational Conference on Machine Learning, 41–48. ACM.Cohn, D. A.; Ghahramani, Z.; and Jordan, M. I. 1996. Ac-tive learning with statistical models. Journal of ArtificialIntelligence Research 4:129–145.Kumar, A., and Daume III, H. 2012. Learning task groupingand overlap in multi-task learning. In Proceedings of the29th International Conference on Machine Learning, 1383–1390. Omnipress.Kumar, M. P.; Packer, B.; and Koller, D. 2010. Self-pacedlearning for latent variable models. In Advances in NeuralInformation Processing Systems 23, 1189–1197.Lee, Y. J., and Grauman, K. 2011. Learning the easy thingsfirst: Self-paced visual category discovery. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, 1721–1728. IEEE.MacKay, D. J. 1992. Information-based objective functionsfor active data selection. Neural Computation 4:590–604.Pan, S., and Yang, Q. 2010. A survey on transfer learn-ing. Knowledge and Data Engineering, IEEE Transactionson 22(10):1345–1359.Rai, P., and Daume III, H. 2010. Infinite predictor sub-space models for multitask learning. In Proceedings of the13th International Conference on Artificial Intelligence andStatistics, 613–620.Ruvolo, P., and Eaton, E. 2013. ELLA: An efficient lifelonglearning algorithm. In Proceedings of the 30th InternationalConference on Machine Learning.Saha, A.; Rai, P.; Daume III, H.; and Venkatasubramanian,S. 2010. Active online multitask learning. In ICML 2010Workshop on Budget Learning.Saha, A.; Rai, P.; Daume III, H.; and Venkatasubramanian,S. 2011. Online learning of multiple tasks and their relation-ships. In Proceedings of the 14th International Conferenceon Artificial Intelligence and Statistics, 643–651.Tong, S., and Koller, D. 2001. Support vector machine ac-tive learning with applications to text classification. Journalof Machine Learning Research 2:45–66.Valstar, M.; Jiang, B.; Mehu, M.; Pantic, M.; and Scherer,K. 2011. The first facial expression recognition and analysischallenge. In Proceedings of the IEEE International Confer-ence on Automatic Face & Gesture Recognition, 921–926.IEEE.Xue, Y.; Liao, X.; Carin, L.; and Krishnapuram, B. 2007.Multi-task learning for classification with Dirichlet processpriors. The Journal of Machine Learning Research 8:35–63.Zhang, J.; Ghahramani, Z.; and Yang, Y. 2008. Flexible la-tent variable models for multi-task learning. Machine Learn-ing 73(3):221–242.

Zhang, Y. 2010. Multi-task active learning with output con-straints. In Proceedings of the 24th AAAI Conference onArtificial Intelligence.