20
A gents, physical and virtual entities that interact with their environment, are becoming increasingly prevalent. How- ever, if agents are to behave intelligently in complex, dynamic, and noisy environments, we believe that they must be able to learn and adapt. The reinforcement learning (RL) par- adigm is a popular way for such agents to learn from experience with minimal feedback. One of the central questions in RL is how best to generalize knowledge to successfully learn and adapt. In reinforcement learning problems, agents sequentially observe their state and execute actions. The goal is to maximize a real-valued reward signal, which may be time delayed. For example, an agent could learn to play a game by being told what the state of the board is, what the legal actions are, and then whether it wins or loses at the end of the game. However, unlike in supervised learning scenarios, the agent is never provided the “correct” action. Instead, the agent can only gather data by interacting with an environment, receiving information about the results, its actions, and the reward signal. RL is often used because of the framework’s flexibility and due to the develop- ment of increasingly data-efficient algorithms. RL agents learn by interacting with the environment, gather- ing data. If the agent is virtual and acts in a simulated environ- ment, training data can be collected at the expense of comput- er time. However, if the agent is physical, or the agent must act on a “real-world” problem where the online reward is critical, such data can be expensive. For instance, a physical robot will degrade over time and must be replaced, and an agent learning to automate a company’s operations may lose money while training. When RL agents begin learning tabula rasa, mastering difficult tasks may be infeasible, as they require significant amounts of data even when using state-of-the-art RL approach- es. There are many contemporary approaches to speed up “vanilla” RL methods. Transfer learning (TL) is one such tech- nique. Transfer learning is an umbrella term used when knowledge is Articles SPRING 2011 15 Copyright © 2011, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 An Introduction to Intertask Transfer for Reinforcement Learning Matthew E. Taylor, Peter Stone n Transfer learning has recently gained popu- larity due to the development of algorithms that can successfully generalize information across multiple tasks. This article focuses on transfer in the context of reinforcement learning domains, a general learning framework where an agent acts in an environment to maximize a reward signal. The goals of this article are to (1) familiarize readers with the transfer learning problem in reinforcement learning domains, (2) explain why the problem is both interesting and difficult, (3) present a selection of existing tech- niques that demonstrate different solutions, and (4) provide representative open problems in the hope of encouraging additional research in this exciting area.

An Introduction to Intertask Transfer for Reinforcement ...gitars/cap6671-2014/Papers/transfer-overview.pdfWorld [chess] champion [Mikhail] Tal once told how his chess game was inspired

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Agents, physical and virtual entities that interact with theirenvironment, are becoming increasingly prevalent. How-ever, if agents are to behave intelligently in complex,dynamic, and noisy environments, we believe that they mustbe able to learn and adapt. The reinforcement learning (RL) par-adigm is a popular way for such agents to learn from experiencewith minimal feedback. One of the central questions in RL ishow best to generalize knowledge to successfully learn andadapt.

    In reinforcement learning problems, agents sequentiallyobserve their state and execute actions. The goal is to maximizea real-valued reward signal, which may be time delayed. Forexample, an agent could learn to play a game by being told whatthe state of the board is, what the legal actions are, and thenwhether it wins or loses at the end of the game. However, unlikein supervised learning scenarios, the agent is never provided the“correct” action. Instead, the agent can only gather data byinteracting with an environment, receiving information aboutthe results, its actions, and the reward signal. RL is often usedbecause of the framework’s flexibility and due to the develop-ment of increasingly data-efficient algorithms.

    RL agents learn by interacting with the environment, gather-ing data. If the agent is virtual and acts in a simulated environ-ment, training data can be collected at the expense of comput-er time. However, if the agent is physical, or the agent must acton a “real-world” problem where the online reward is critical,such data can be expensive. For instance, a physical robot willdegrade over time and must be replaced, and an agent learningto automate a company’s operations may lose money whiletraining. When RL agents begin learning tabula rasa, masteringdifficult tasks may be infeasible, as they require significantamounts of data even when using state-of-the-art RL approach-es. There are many contemporary approaches to speed up“vanilla” RL methods. Transfer learning (TL) is one such tech-nique.

    Transfer learning is an umbrella term used when knowledge is

    Articles

    SPRING 2011 15Copyright © 2011, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602

    An Introduction to Intertask Transfer for

    Reinforcement Learning

    Matthew E. Taylor, Peter Stone

    n Transfer learning has recently gained popu-larity due to the development of algorithms thatcan successfully generalize information acrossmultiple tasks. This article focuses on transferin the context of reinforcement learningdomains, a general learning framework wherean agent acts in an environment to maximize areward signal. The goals of this article are to (1)familiarize readers with the transfer learningproblem in reinforcement learning domains, (2)explain why the problem is both interesting anddifficult, (3) present a selection of existing tech-niques that demonstrate different solutions,and (4) provide representative open problems inthe hope of encouraging additional research inthis exciting area.

  • generalized not just across a single distribution ofinput data but when knowledge can be generalizedacross data drawn from multiple distributions oreven different domains, potentially reducing theamount of data required to learn a new task. Forinstance, one may train a classifier on one set ofnewsgroup articles (for example, rec.*) and thenwant the classifier to generalize to newsgroupswith different subject material (for example, trans-fer to sci.*) (Dai et al. 2007). An example ofhuman-level transfer was discussed in a recentissue of AI Magazine (Bushinsky 2009):

    World [chess] champion [Mikhail] Tal once toldhow his chess game was inspired from watching icehockey! He noticed that many of the hockey play-ers had the habit of hanging around the goal evenwhen the puck was far away from it … He decidedto try transferring the idea to his chess game bypositioning his pieces around the enemy king, stillposing no concrete threats.

    The insight that knowledge may be generalizedacross different tasks has been studied for years inhumans (see Skinner [1953]) and in classificationsettings (see Thrun [1996]) but has only recentlygained popularity in the context of reinforcementlearning.

    As discussed later in this article, there are manypossible goals for transfer learning. In the mostcommon scenario, an agent learns in a source task,and then this knowledge is used to bias learning ina different target task. The agent may have differ-ent knowledge representations and learning algo-rithms in the two tasks, and transfer acts as a con-duit for transforming source-task knowledge sothat it is useful in the target task. It is also reason-able to frame TL as one agent learning in a sourcetask and a different agent learning in the targettask, where transfer is used to share knowledgebetween the two agents. In most cases the two arealgorithmically equivalent, as we assume we areable to directly access the source-task agent’sknowledge so that direct copying into a “new”agent is equivalent to having the “old” agentswitch tasks.

    This article provides an introduction to theproblem of reusing agents’ knowledge so that itcan be used to solve a different task or (equiva-lently) to transfer knowledge between agents solv-ing different tasks. This article aims to (1) intro-duce the transfer problem in RL domains, (2)provide a sense of what makes the problemdifficult, (3) present a sample of existing transfertechniques, and (4) discuss some of the shortcom-ings in current work to help encourage additionalresearch in this exciting area. Readers interested ina significantly more detailed treatment of existingworks are directed to our survey article (Taylor andStone 2009).

    In the following section we provide backgroundon the reinforcement learning problem that

    should be sufficient to understand the remainderof the article. Readers interested in a more in-depthtreatment of RL are referred elsewhere (Kaelbling,Littman, and Moore 1996; Sutton and Barto 1998;Szepesvári 2009). The next section introduces apair of motivating domains to be used throughoutthe article. Following that, the Dimensions ofTransfer section discusses the many ways in whichtransfer algorithms can differ in terms of theirgoals and abilities. A Selection of Transfer LearningAlgorithms presents a selection of transfer algo-rithms in the context of the example domains,providing the reader with a brief introduction toexisting solution techniques. Finally, we concludewith a discussion of open questions.

    Reinforcement Learning Background and Notation

    Reinforcement learning problems are typicallyframed as Markov decision processes (MDPs)defined by the 4-tuple {S, A, T, R}, where the goal ofthe agent is to maximize a single real-valuedreward signal. In this article, we define a domain tobe a setting for multiple related tasks, where MDPand task are used interchangeably. An agent per-ceives the current state of the world s S (possiblywith noise). The agent begins in a start state(drawn from a subset of states, S0). If the task isepisodic, the agent executes actions in the envi-ronment until it reaches a terminal state (a state inthe set Sfinal, which may be referred to as a goalstate if the reward there is relatively high), atwhich point the agent is returned to a start state.

    The set A describes the actions available to theagent, although not every action may be possiblein every state. The transition function, T : S A S, takes a state and an action and returns the stateof the environment after the action is performed.Transitions may be nondeterministic, making thetransition function a probability distribution func-tion. The reward function, R : S �, maps everystate to a real-valued number, representing thereward achieved for reaching the state. An agentsenses the current state, s, and typically knows Aand what state variables compose S; however, it isgenerally not given R or T.

    A policy, : S A, fully defines how an agentinteracts with the environment by mapping per-ceived environmental states to actions. The successof an agent is determined by how well it maxi-mizes the total reward it receives in the long runwhile acting under some policy . There are manypossible approaches to learning such a policy, butthis article will focus on temporal difference (TD)methods, such as Q-learning (Sutton 1988;Watkins 1989) and Sarsa (Rummery and Niranjan1994; Singh and Sutton 1996), which learn toapproximate Q : S A �. At any time during

    Articles

    16 AI MAGAZINE

  • learning, an agent can choose to exploit, by select-ing the action that maximizes Q(s, a) for the cur-rent state, or to explore, by taking a randomaction. TD methods for RL are among the mostsimple and well understood, causing them to beused in many different transfer learning works;some transfer learning methods can be explainedusing TD algorithms but then be combined withmore complex RL algorithms in practice.

    In tasks with small, discrete state spaces, Q and can be fully represented in a table. As the statespace grows, using a table becomes impractical, orimpossible if the state space is continuous. TLmethods are particularly relevant in such largeMDPs, as these are precisely the problems thatrequire generalization across states and may be par-ticularly slow to learn. Agents in such tasks typi-cally factor the state using state variables (or fea-tures), so that s = (x1, x2, …, xn). In such cases, RLmethods use function approximators, such asartificial neural networks and tile coding, whichrely on concise, parameterized functions and usesupervised learning methods to set these parame-ters. Parameters and biases in the approximatordefine the state-space abstraction, allowingobserved data to influence a region of state-actionvalues, rather than just a single state-action pair,and can thus substantially increase the speed oflearning.

    Two additional concepts used to improve theperformance of RL algorithms will be referencedlater in this article. The first, options (Sutton, Pre-cup, and Singh 1999), are a type of temporalabstraction. Such macroactions group a series ofactions together and may be hand-specified orlearned, allowing the agent to select actions thatexecute for multiple time steps. The second,reward shaping (Mataric 1994), allows agents touse an artificial reward signal (R) rather than theMDP’s true reward (R). For instance, a humancould modify the reward of an agent in a maze sothat it received additional reward as it approachedthe goal state, potentially allowing the agent tolearn faster.

    Motivating Domains This section provides two example RL domains inwhich transfer has been successful and that will beused to discuss particular transfer methods later inthis article. The first, maze navigation, containsmany possible tasks, such as having different walllocations or goal states. In the second, robot soccerKeepaway, tasks typically differ in the number ofplayers on each team, which results in differentstate representations and actions.

    These two domains are selected because theyhave been used in multiple works on TL andbecause their properties compliment each other.

    Maze navigation tasks are typically single-agenttasks situated in a fully observable, discrete statespace. These relatively simple tasks allow for easyanalysis of TL methods and represent a simplifica-tion of a number of different types of agent-navi-gation tasks. Keepaway, in contrast, is a multiagenttask that models realistic noise in sensors and actu-ators, requiring significantly more data (that is,time) to learn successful policies. Additionally, thenumber of state variables and number of actionschange between tasks (as described below), intro-ducing a significant challenge not necessarily seenin mazelike domains.

    Maze Navigation Learning to navigate areas or mazes is a popularclass of RL problems, in part because learned poli-cies are easy to visualize and to understand.Although the formulation differs from paper topaper, in general the agent operates in a discretestate space and can select from four movementactions {MoveNorth, MoveSouth, MoveEast, MoveWest}. The agent begins in a start state, or set ofstart states, and receives a positive reward uponreaching a goal state based on how many actions ithas executed. The transition function is initiallyunknown and is determined by the location of themaze’s walls.

    As discussed later, multiple researchers haveexperimented with speeding up learning with TLby focusing on the structure of navigation tasks.For instance, a policy learned on a source-taskmaze could be useful when learning a target-taskmaze with slightly different walls (for example,transfer from figure 1a to figure 1b). Alternatively,it is often the case that navigating within rooms issimilar, in that the goal is to traverse a relativelyempty area efficiently in order to reach a doorwayor corridor (for example, transfer from figure 1b tofigure 1c).

    There are many ways in which source and targettasks can differ, which dictate what abilities atransfer algorithm would need in order to be appli-cable. In the maze domain, for instance, differ-ences could include the following:

    T: wall placement could change or the actions couldbe made stochastic;S: the size of the maze could change; S0: the agent could start in different locations; Sfinal: the goal state could move; State variables: the agent’s state could be an indexinto a discrete state space or could be composed ofx, y coordinates; R: the agent could receive a reward of –1 on everystep or the reward could be a function of the dis-tance from the goal state; A: additional actions, such as higher-level macroactions could be enabled or disabled.

    Articles

    SPRING 2011 17

  • Keepaway The game of Keepaway is a subproblem in theRoboCup Soccer Server (Noda et al. 1998). A teamof independently controlled keepers attempt tomaintain possession of a ball in a square area whilea team of takers try to gain possession of the ball orkick it out of bounds. The keeper with the ballchooses between macro actions, which depend onthe number of keepers. The keepers receive areward of +1 for every time step that the ballremains in play. The episode finishes when a takergains control of the ball or the ball is kicked out ofbounds. In the standard version of Keepaway,keepers attempt to learn to possess the ball as longas possible, increasing the average episode length,while the takers follow a fixed policy.

    In 3 versus 2 Keepaway (see figure 2), there arethree keepers and two takers. Rather than learningto execute low-level actuator movements, thekeepers instead select from three hand-designedmacro actions: {Hold, Pass1, Pass2}, correspondingto “maintain possession,” “pass to closest team-mate,” and “pass to second closest teammate.”Keepers that do not possess the ball follow a hand-coded policy that attempts to capture an open ballor moves to get open for a pass. Learning which ofthe three actions to select on a given time step ischallenging because the actions are stochastic, theobservations of the agents’ state are noisy, and the13-dimensional state (described in table 1) is con-tinuous (necessitating function approximation).This task has been used by multiple researchers, inpart due to freely available players and benchmarkperformances distributed by the University ofTexas at Austin (Stone et al. 2006).1

    If more players are added to the task, Keepawaybecomes harder for the keepers because the field

    Articles

    18 AI MAGAZINE

    Figure 1. Three Example Mazes to Motivate Transfer Learning.

    a b c

    Figure 2. The Distances and Angles Used to Construct the 13 State Variables Used in 3 versus 2 Keepaway.

    Relevant objects are the five players, ordered by distance from the ball, andthe center of the field.

    K3

    T2

    Center of eld

    Ball K2

    K1

    T1

  • becomes more crowded. As more takers are added,there are more opponents to block passing lanesand chase down errant passes. As more keepers areadded, the keeper with the ball has more passingoptions, but the average pass distance is shorter.This reduced distance forces more passes and oftenleads to more missed passes because of the noisyactuators and sensors. Additionally, because eachkeeper learns independently and can only selectactions when it controls the ball, adding morekeepers to the task means that keepers receive lessexperience per time step. For these reasons, keepersin 4 versus 3 Keepaway (that is, 4 keepers versus 3takers) require more simulator time to learn a goodcontrol policy, compared to 3 versus 2 Keepaway.

    In 4 versus 3 Keepaway (see figure 3), the rewardfunction is very similar: there is a + 1 to the keep-ers on every time step the ball remains in play.Likewise, the transition function is similar becausethe simulator is unchanged. Now A = {Hold, Pass1,Pass2, Pass3}, and each state is composed of 19state variables due to the added players. It is alsoimportant to point out that the addition of anextra taker and keeper in 4 versus 3 results in aqualitative change to the keepers’ task. In 3 versus2, both takers must go toward the ball because twotakers are needed to capture the ball from the keep-er. In 4 versus 3, the third taker is now free to roamthe field and attempt to intercept passes. Thischanges the optimal keeper behavior, as one team-mate is often blocked from receiving a pass by ataker.

    TL results in the Keepaway domain often focus

    on transferring between different instances of thesimulated robot soccer domain with differentnumbers of players. Many authors have studiedtransfer in this domain because (1) it was one ofthe first RL domains to show successful transferlearning results, (2) learning is nontrivial, allowingtransfer to significantly improve learning speeds,and (3) different Keepaway tasks have well-defineddifferences in terms of both state variables andaction sets. When transferring from 3 versus 2 to 4versus 3, there are a number of salient changes thatany TL algorithm must account for, such as statevariables (the agent’s state variables must change toaccount for the new players); A (there are addi-tional actions when the number of possible passreceivers increases); and S0 (adding agents to thefield causes them to start in different positions onthe field).

    Dimensions of Transfer We now turn our attention to ways in which trans-fer learning methods can differ, providing exam-ples in the maze and Keepaway domains. The fol-lowing sections will enumerate and discuss thedifferent dimensions, both providing an overviewof what is possible with transfer in RL and describ-ing how TL algorithms may differ. First, there are anumber of different possible goals for transfer inRL domains. Second, the similarity of source andtarget tasks plays an important role in determiningwhat capabilities a TL algorithm must have inorder to be applied effectively (that is, how the

    Articles

    SPRING 2011 19

    Table 1. All State Variables Used for Representing the State of 3 Versus 2 Keepaway.

    C is the center of the field, dist(X, Y) is the distance between X and Y, and ang(X, Y, Z) is the angle formed by X, Y , Z where X is the vertex.The state is egocentric for the keeper with the ball and rotationally invariant.

    State Variable Description dist(K1, C) Distance from keeper with ball to center of !eld

    dist(K1, K2) dist(K1, K3)

    Distance from keeper with ball to closest teammate Distance from keeper with ball to second closest teammate

    dist(K1, T1) dist(K1, T2)

    Distance from keeper with ball to closest taker Distance from keeper with ball to second closest taker

    dist(K2, C) dist(K3, C)

    Distance from second closest teammate to center of !eld Distance from closest teammate to center of !eld

    dist(T1, C) dist(T2, C)

    Distance from closest taker to center of !eld Distance from second closest taker to center of !eld

    Min(dist(K2, T1), dist(K2, T2) Min(dist(K3, T1), dist(K3, T2)

    Distance from nearest teammate to its nearest taker Distance from second nearest teammate to its nearest taker

    Min(ang(K2, K1, T1), ang(K2, K1, T2) Min(ang(K3, K1, T1), ang(K3, K1, T2)

    Angle of passing lane from keeper with ball to closest teammate Angle of passing lane from keeper with ball to second closest teammate

  • source task is mapped to the target task in figure4a). Third, transfer methods can differ in terms ofwhat RL methods they are compatible with (thatis, how the source task and target-task agents learntheir policies). Fourth, different TL methods trans-fer different types of knowledge (that is, what istransferred between the source-task and target-taskagents).

    Possible Transfer Goals Current transfer learning research in RL does nothave a standard set of evaluation metrics, in partbecause there are many possible options. Forinstance, it is not always clear how to treat learn-ing in the source task: whether to charge it to theTL algorithm or to consider it a “sunk cost.” Onepossible goal of transfer is to reduce the overalltime required to learn a complex task. In this sce-nario, a total time scenario, which explicitlyincludes the time needed to learn the source task,would be most appropriate. Another reasonablegoal of transfer is to effectively reuse past knowl-edge in a novel task. In such a target-task time sce-

    nario, it is reasonable to account only for the timeused to learn the target task. The majority of cur-rent research focuses on the second scenario,ignoring time spent learning source tasks byassuming that the source tasks have already beenlearned — the existing knowledge can either beused to learn a new task or simply ignored.

    The degree of similarity between the source taskand the target task will have markedly differenteffects on TL metrics, depending on whether thetotal time scenario or target-task time scenarioapplies. For instance, suppose that the source taskand target task were identical (thus technically notan instance of TL). In the target-task time scenario,learning the source task well would earn the sys-tem high marks. However, in the total time sce-nario, there would be little performance changefrom learning on a single task. On the other hand,if the source and target task were unrelated, trans-fer would likely provide little benefit, or possiblyeven hurt the learner (that is, cause negative trans-fer, as discussed later as a current open question).

    Many metrics to measure the benefits of transferin the target task are possible, such as those shownin figure 4b (replicated from our past transfer learn-ing work [Taylor and Stone 2007a]). The jump startis the difference in initial performance of an agentusing transfer relative to learning without transfer.The asymptotic performance (that is, the agent’s finallearned performance) may be changed throughtransfer. The total reward accumulated by an agent(that is, the area under the learning curve) may dif-fer if it uses transfer, compared to learning withouttransfer. The transfer ratio is defined as the totalreward accumulated by the transfer learner divid-ed by the total reward accumulated by the non-transfer learner. The time to threshold measures thelearning time (or samples) needed by the agent toachieve a prespecified performance level.

    Transfer may be considered a success if metric 1is greater than zero, metrics 2–3 are increased withtransfer, metric 4 is greater than one, or if metric 5is reduced through transfer. While these metricsare domain independent, none are useful to direct-ly evaluate the relative performance of transferlearning algorithms unless the tasks, learningmethods, and representations are constant (forexample, it is generally not useful to directly com-pare the jump start of two different transfer learn-ing algorithms if the algorithms were tested on dif-ferent pairs of source and target tasks).

    Improvement due to transfer from 3 versus 2Keepaway into 4 versus 3 Keepaway could be meas-ured, for example, by (1) improving the averageepisode time in 4 versus 3 at the very beginning oflearning in the target task, (2) learning to maintainpossession of the ball for longer episodes in 4 ver-sus 3, (3) increasing the area under the 4 versus 3Keepaway learning curve when measuring the

    Articles

    20 AI MAGAZINE

    K2

    T2

    K3

    K1

    T1

    T3K4

    Figure 3. 4 Versus 3 Keepaway

    4 Versus 3 Keepaway has seven players in total and the keepers’ state is com-posed of 19 state variables, analogous to 3 versus 2 Keepaway.

  • average episode versus the simulator learning time,(4) increasing that same area under a 4 versus 3Keepaway learning curve relative to learning with-out transfer, and (5) reducing the time required bykeepers to learn to reach an average episode lengthof 21.0 simulator seconds in 4 versus 3 Keepaway.

    Tasks Similarity One way to categorize transfer methods is to con-

    sider how the source task and target tasks are relat-ed. For instance, are the two tasks similar enoughfor the learned source-task knowledge to be direct-ly used in the target task, or must the source-taskknowledge be modified before being used in thetarget task? Possible variants along this dimensionare as follows.

    Knowledge is “copied” from a source-task agentdirectly into a target-task agent. Such methods are

    Articles

    SPRING 2011 21

    7

    9

    11

    13

    15

    17

    19

    21

    23

    0 5 10 15 20 25 30 35 40

    Perf

    orm

    ance

    Training Time (Sample Complexity)

    Jumpstart

    Time to ThresholdAsymptoticPerformance

    No Transfer

    Transfer

    Threshold Performance

    SourceEnvironment

    TargetEnvironment

    Source Agent

    Source Task

    Target Agent

    Target Task

    Action State Rewardrsource atarget atarget

    Actionssourceasource

    State Rewardrtarget

    a

    b

    Figure 4. Transfer in RL.

    In general, transfer in RL can be described as transferring information between a source-task agent to a target-task agent (top)Many different metrics to evaluate TL algorithms are possible. This graph (bottom) shows benefits to the jump start, theasymptotic performance, the total reward (the area under a learning curve), the reward area ratio (the ratio of the area underthe transfer to the area under the nontransfer learning curve), and the time to threshold.

  • typically simple, but require the source and targettasks to be very similar. Example: a policy learnedin a source maze is used directly in a target-taskmaze.

    Knowledge from a source-task agent is modifiedbefore being used by a target-task agent. This mod-ification leverages a hand-coded or learned rela-tionship between the source and target task. Suchmethods enable the source and target tasks to dif-fer significantly but require knowledge about howthe two tasks are similar. Example: a Q-value func-tion is learned in 3 versus 2 Keepaway, transformedto account for novel state variables and actions,and then used to initialize keepers in 4 versus 3.

    The TL method learns how a pair of tasks aresimilar and then uses this similarity to modifysource-task agent knowledge for use by the target-task agent. Such methods are much more flexiblebut may be impractical in the general case. Exam-ple: rather than using human knowledge, an algo-rithm discovers how state variables and actions aresimilar in 3 versus 2 and 4 versus 3 Keepaway, andthen transfers a state-action value functionbetween the two tasks.

    Relatedness of Agents TL algorithms are often designed to be able toaccommodate specific types of differences betweenthe source-task agents and target-task agents, pro-viding another dimension along which to classifyTL algorithms. Do agents in the two tasks need touse the same learning method, the same class oflearning methods, or the same knowledge repre-sentation? For example, a particular transfer algo-rithm may require one of the following relation-ships between source and target agents. Althougheach of the following four options has beenexplored in the literature, the first is the most com-mon.

    In option one, the source and target agents mustuse the same learning method. For example, keep-ers in 3 versus 2 use Sarsa with tile coding andkeepers in 4 versus 3 use Sarsa with tile coding.

    In option two, the source and target agents mayuse different learning methods, but they must besimilar. Example: keepers in 3 versus 2 use Sarsawith tile coding and keepers in 4 versus 3 use Q()with tile coding. (Both Sarsa and Q-learning aretemporal difference methods.)

    In option three, the source and target agentsmust use a similar learning method, and the repre-sentation can change. Example: an agent learns tonavigate a source-task maze using Sarsa with a Q-table (that is, no function approximation) andthen an agent learns to navigate a target-task mazeusing Q() with tile coding.

    In option four, the source and target agents mayuse different learning methods and representa-tions; for example, keepers in 3 versus 2 use Sarsa

    with tile coding and keepers in 4 versus 3 use directpolicy search with a neural network. As with tasksimilarity, TL methods that are more flexible areoften more complex and require more data to beeffective.

    What Can Be Transferred Lastly, TL methods can be classified by the type ofknowledge transferred and its specificity. Low-lev-el knowledge, such as (s, a, r, s) instances, anaction-value function (Q), a policy (), and a fullmodel of the source task may all be transferredbetween tasks. Higher-level knowledge, such aspartial policies (for example, options) and shapingrewards (where a modified reward signal helpsguide the agent), may not be directly used by thealgorithm to fully define an initial policy in a tar-get task, but such information may help guide theagent during learning.

    All of these types of knowledge have been suc-cessfully transferred in maze domains and Keep-away domains. The benefits and trade-offs of thesedifferent types of knowledge have not yet beenexamined, but we anticipate that different transfersituations will favor different types of transferinformation.

    A Selection of Transfer Learning Algorithms

    One good way to gain a sense of the state of the artis through examples of past successful results. Thissection provides a selection of such methods basedon our two example domains. First, we investigatemethods that act in maze or navigation domains.Second, we discuss methods that transfer betweensimulated robot soccer games. Third, we highlighttwo methods to learn relationships between tasks.

    Table 2 gives a summary of how the transfermethods discussed in our two example domainsdiffer. The first three methods copy knowledgedirectly between tasks without any modification.The second set of papers modify knowledge gainedin a source task before using it in a target task. Thethird set learn how two tasks are related so thatthey may modify source-task knowledge for use ina target task.

    Salient differences between source and targettasks include having different start states, goalstates, problem spaces (described in the text),actions, reward functions, or state representations.When learning how tasks are similar, methods mayrely on a qualitative description of the transitionfunction, state variable groupings (groups), orexperience gathered in the target task (exp), all ofwhich are explained later in the text. Policies,reward functions, options, Q-value functions,rules, and (s, a, r, s) instances may be transferredbetween the source task and target task. The met-

    Articles

    22 AI MAGAZINE

  • rics used are the total reward (tr), jump start (j),asymptotic performance (ap), and time to thresh-old (tt). All papers measure target-task learningtime (that is, do not count time spent learning inthe source task), and those marked with a † alsomeasure the total learning time.

    Maze Navigation To begin our discussion of current transfer learn-ing methods, we examine a set of algorithms thattackle transfer in navigation tasks, for example, bytransferring a policy between tasks (see figure 5).Both transfer methods discussed in this sectionassume that agents in the source and target taskuse the same TD learning algorithm and functionapproximators, and they only consider the target-task time scenario.

    Probabilistic Policy Reuse. Probabilistic policyreuse (Fernández and Veloso 2006) considers mazetasks in which only the goal state differs. In thiswork, the method allows a single goal state to dif-fer between the tasks but requires that S, A, and Tremain constant. Action selection is performed bythe -reuse exploration strategy. When the agent isplaced in a novel task, on every time step, it canchoose to exploit a learned source-task policy, ran-domly explore, or exploit the current learned pol-icy for the target task. By setting these explorationparameters, and then decaying them over time, theagent balances exploiting past knowledge, improv-ing upon the current task’s policy by measuring itsefficacy in the current task, or exploring in thehope of finding new actions that will furtherimprove the current task’s policy, eventually con-verging to a near-optimal policy. Results show that

    using the -reuse exploration to transfer from asingle source task performs favorably to learningwithout transfer, although performance may bedecreased when the source task and target taskhave goal states in very different locations. Per-formance is measured by the total reward accumu-lated in a target task.

    In addition to transfer from a single source task,Fernández and Veloso’s PRQ-Learning method isone of the few that can transfer from multiplesource tasks (Fernández and Veloso 2006). PRQ-Learning extends -reuse exploration so that morethan one past policy can be exploited. As past poli-cies are exploited, their usefulness in the currenttask is estimated, and over time a softmax selectionstrategy (Sutton and Barto 1998) makes the mostuseful past policies most likely to be used. Finally,the policy library through policy reuse algorithmallows learned policies to be selectively added to apolicy library. Once a task is learned, the new pol-icy can be added to the library if it is dissimilarfrom all policies existing in the library, accordingto a parameterized similarity metric. This metriccan be tuned to change the number of policiesadmitted to the library. Increasing the number ofpolicies in the library increases the chance that apast policy can assist learning a new task, at theexpense of increased policy evaluation costs inPRQ-Learning. This approach is similar in spirit tothat of case-based reasoning, where the similarityor reuse function is defined by the parameterizedsimilarity metric. In later work Fernández, García,and Veloso (2010) extend the method so that it canbe used in Keepaway as well.

    Articles

    SPRING 2011 23

    Table 2. A High-Level Overview of TL Algorithms.

    Abbreviations are explained in the text.

    Paper Task Differences Transferred Knowledge Knowledge to Learn Mapping Metrics Used

    Knowledge Copied: Maze Navigation Tasks

    Fernandez and Veloso 2006 start and goal states N/A tr Konidaris and Barto 2006 problem-space R N/A j, tr

    Knowledge Modi!ed: Simulated Robot Soccer

    Taylor, Stone, and Liu 2007 A, s-vars Q N/A tt†

    Torrey et al. 2006 A, R, s-vars rules j, tr

    Torrey et al. 2006 A, R, s-vars options j, tr

    Taylor and Stone 2007a A, R, s-vars rules N/A j, tr, tt†

    Learning Transfer Mappings

    Liu and Stone 2006 A, s-vars N/A T (qualitative) N/A

    Soni and Singh 2006 A, s-vars options groups, exp ap, j, tr

    Taylor, Jong, and Stone 2008 A, s-vars instances exp ap, tr

    N/A

    N/A

  • Agent and Problem Space. Rather than transferpolicies from one task to another, Konidaris andBarto (2006) instead transfer a learned shapingfunction. A shaping function provides higher-lev-el knowledge, such as “get rewarded for movingtoward a particular beacon,” which may transferwell when the source task and target task are so dis-similar that direct policy transfer would not be use-ful. In order to learn the shaping reward in thesource task, the standard problem is separated intoagent-space and problem-space representations.The agent space is determined by the agent’s capa-bilities, which remain fixed (for example, physicalsensors and actuators), and the space may be non-Markovian.2 For instance, the agent may have asensor that is always able to determine the direc-tion of a beacon, and this sensor may be used inany task the agent faces. The problem space, on theother hand, may change between source and targetproblems and is assumed to be Markovian. Theshaping reward is learned in agent space, since thiswill be unchanged in the target task.

    For this method to work, we acknowledge thatthe transfer must be between reward-linked tasks,where “the reward function in each environmentconsistently allocates rewards to the same types ofinteractions across environments.” For instance,an agent may learn to approach a beacon in thesource task and transfer this knowledge as a shap-ing reward, but if the target rewards the agent formoving away from a beacon, the transferred

    knowledge will be useless or possibly even harm-ful. Determining whether or not a sequence oftasks meet this criterion is currently unaddressed,but it may be quite difficult in practice. This prob-lem of determining when transfer will and will notbe effective is currently one of the more pressingopen problems, which may not have a simple solu-tion.

    In our opinion, the idea of agent and problemspaces should be further explored as they will like-ly yield additional benefits. For instance, discover-ing when tasks are “reward-linked” so that transferwill work well would make the methods morebroadly applicable. Particularly in the case of phys-ical agents, it is intuitive that agent sensors andactuators will be static, allowing information to beeasily reused. Task-specific items, such as featuresand actions, may change, but should be faster tolearn if the agent has already learned somethingabout its unchanging agent space.

    Keepaway In this section we discuss a selection of methodsthat have been designed to transfer knowledgebetween Keepaway tasks. As discussed earlier, thiscan be a particularly difficult challenge, as the statevariables and actions differ between different ver-sions of Keepaway. All methods in this sectionrequire that the target task is learned with a TDlearner.

    We will first introduce intertask mappings, a

    Articles

    24 AI MAGAZINE

    π1. . .πn

    Source Maze Target Maze

    Figure 5. Many Methods Directly Use a Learned Source-Task Policy to Initialize an Agent’s Policy in the Target Task, or to Bias Learning in the Target Task.

  • construct to enable transfer between agents thatact in MDPs with different state variables andactions. We then discuss a handful of methods thatuse intertask mappings to transfer in the simulat-ed robot soccer domain.

    Intertask Mapping. To enable TL methods totransfer between tasks that do have such differ-ences, the agent must know how the tasks are relat-ed. This section provides a brief overview of inter-task mappings (Taylor, Stone, and Liu 2007), oneformulation of task mappings. Such mappings canbe used by all methods discussed in this sectionand can be thought of as similar to how case-basedreasoning decides to reuse past information.

    To transfer effectively, when an agent is present-ed with a target task that has a set of actions, A, itmust know how those actions are related to theaction set in the source task, A. (For the sake ofexposition we focus on actions, but an analogousargument holds for state variables.) If the TLmethod knows that the two action sets are identi-cal, no action mapping is necessary. However, ifthis is not the case, the agent needs to be told, orlearn, how the two tasks are related. One solutionis to define an action mapping, A, such thatactions in the two tasks are mapped so that theireffects are “similar,” where similarity is looselydefined so that it depends on how the actionaffects the agent’s state and what reward isreceived.3 Figure 6 depicts an action mapping aswell as a state-variable mapping (X) between twotasks. A second option is to define a partial map-ping (Taylor, Whiteson, and Stone 2007), such thatany novel actions or novel state variables in thetarget task are not mapped to anything in thesource task (and are thus ignored). Because inter-task mappings are not functions, they are typical-ly assumed to be easily invertible (that is, mappingsource-task actions into target-task actions, ratherthan target-task actions to source-task actions).

    For a given pair of tasks, there could be manyways to formulate intertask mappings. Much of thecurrent TL work assumes that a human has pro-vided a (correct) mapping to the learner. In theKeepaway domain, we are able to construct inter-task mappings between states and actions in thetwo tasks based on our domain knowledge. Thechoice for the mappings is supported by empiricalevidence showing that using these mappings doesallow us to construct transfer functions that suc-cessfully reduce training time. We define A, the intertask action mapping for the

    two tasks, by identifying actions that have similareffects on the world state in both tasks (see table 3).For the 4 versus 3 and 3 versus 2 Keepaway tasks,the action “Hold ball” is equivalent because thisaction has a similar effect on the world in bothtasks. Likewise, the first two pass actions are anal-ogous in both tasks. We map the novel target

    Articles

    SPRING 2011 25

    ......

    Source Task Target Task

    ...

    a1a2

    am

    a1a2a3

    an

    x1x2x3

    xk

    x1x2

    xj

    ...

    χX

    χA

    Figure 6. A and X Are Independent Mappings that Describe Similarities between Two MDPs.

    These mappings describe how actions in the target task are similar to actionsin the source task and how state variables in the target task are similar to statevariables in the source task, respectively.

    Table 3. A: 4 Versus 3 Keepaway to 3 Versus 2 Keepaway.

    This table describes the (full) action mapping from 4 versus 3 to 3 versus 2.

    4 Versus 3 Action 3 Versus 2 Action

    Hold Ball Hold Ball

    Pass1 Pass1

    Pass2 Pass2

    Pass3 Pass2

  • action “Pass to third closest keeper” to “Pass to sec-ond closest keeper” in the source task. To define apartial action mapping between 4 versus 3 and 3versus 2, the “Pass to third closest keeper” 4 versus3 action would be ignored, as there is no direct cor-respondence in 3 versus 2. The state variable map-ping for 4 versus 3 to 3 versus 2 can likewise bedefined (Taylor, Stone, and Liu 2007).

    Value Function Transfer. Value function transfer(Taylor, Stone, and Liu 2007) differs from themethods previously discussed as it makes use ofintertask mappings, as well as transferring a differ-ent type of knowledge. Rather than transferring apolicy or even a shaping reward, the authorsextract Q-values learned in the source task andthen use hand-coded intertask mappings to trans-fer learned action-value functions from one Keep-away task to another (see figure 7). It may seemcounterintuitive that low-level action-value func-

    tion information is able to speed up learning acrossdifferent tasks — rather than using abstractdomain knowledge, this method transfers verytask-specific values. When this knowledge is usedto initialize the action-value functions in a differ-ent Keepaway task, the agents receive a small jumpstart, if any, because the Q-values are not “correct”for the new task. However, as experiments show,the transferred knowledge biases the learners sothat they are able to learn faster.

    Results from transfer from 3 versus 2 Keepawayto 4 versus 3 Keepaway are shown in figure 8,which compares learning without transfer to eightdifferent transfer learning scenarios. The x-axisdenotes the number of episodes spent training inthe 3 versus 2 task, and the y-axis records the totalsimulator time needed to meet a specified per-formance threshold in the 4 versus 3 task (an aver-age of 11.5 seconds per episode). The bar on the farleft shows that the average time needed to achieve

    Articles

    26 AI MAGAZINE

    Target Task

    χA

    χX

    Source Task

    QSource QTarget

    Q-ValueFunction:

    Q-ValueFunction:

    Figure 7. In Value Function Transfer, Learned Q-Values from the Source Task Are Transformed through Intertask Mappings.

    These transformed Q-values are then used to initialize agents in the target task, biasing their learning with low-level knowl-edge from the source task.

  • the threshold performance using only target-tasklearning is just over 30 simulator hours. On the farright, agents train in the source task for 6000episodes (taking just over 20 hours of simulatortraining time), but then are able to train in the tar-get task for just under 9 simulator hours, signifi-cantly reducing the total time required to reach thetarget-task threshold. In the Keepaway domain,spending increased amounts of time in the 3 ver-sus 2 source task generally decreases the amount oftime needed to train the 4 versus 3 target task. Thisresult shows that if one’s goal is to reduce target-task training time, it can indeed make sense to usepast knowledge, even if from a task with differentstate variables and actions.

    Finally, consider when keepers train for 250episodes in 3 versus 2. In this scenario, the totaltraining time is reduced relative to no transfer. Thisresult shows that in some cases it may be beneficialto train on a simple task, transfer that knowledge,and then learn in a difficult task, rather thandirectly training on the difficult task. The authorssuggest that in order to reduce the total trainingtime, it may be advantageous to spend enoughtime to learn a rough value function, transfer intothe target task, and then spend the majority of thetraining time refining the value function in the tar-get task.

    However, value function transfer is not apanacea; the authors also admit that transfer may

    fail to improve learning. For instance, consider thegame of Giveaway, where 3 agents must try to losethe ball as fast as possible. The authors show thattransfer from Giveaway to Keepaway can be harm-ful, as might be expected, as the transferred action-value function is not only far from optimal, but italso provides an incorrect bias.

    The authors provide no guarantees about theirmethod’s effectiveness but do hypothesize condi-tions under which their TL method will and will notperform well. Specifically, they suggest that theirmethod will work when one or both of the follow-ing conditions are true: (1) The best learned actionsin the source task, for a given state, are mapped tothe best action in the target task through the inter-task mappings. (2) The average Q-values learned forstates are of the correct magnitude in the trainedtarget task’s function approximator.

    The first condition biases the learner to selectthe best actions more often in the target task, evenif the Q-values are incorrect. For instance, Hold isoften the best action to take in 3 versus 2 Keep-away and 4 versus 3 Keepaway if no taker is closeto the keeper with the ball. The second improveslearning by requiring fewer updates to reach accu-rate Q-values. The second condition is true whenreward functions and learned performances aresimilar, as is true in different keepaway tasks.

    Higher-Level Transfer Approaches. We now turnour attention to methods that transfer higher-lev-

    Articles

    SPRING 2011 27

    0

    5

    10

    15

    20

    25

    30

    35

    0 10 50 100 250 500 1000 3000 6000

    Sim

    ulat

    or

    Ho

    urs

    Number of 3 Versus 2 Episodes

    Time to Threshold in 4 Versus 3 Keepaway

    Average 4 Versus Average 3 Versus 2 Time

    Figure 8. Some Results from Value Function Transfer Work.

    This bar chart shows the total number of simulator hours of training needed to achieve a threshold performance (11.5simulator seconds/episode). The x-axis shows the number of episodes spent training in 3 versus 2, and the y-axis reportsthe total number of simulator hours spent training, separating out time spent in the 3 versus 2 source task and the 4versus 3 target task.

  • el advice than an action-value function. Our intu-ition is that such domain-level information shouldtransfer better than low-level, task-specific infor-mation, but no empirical or theoretical studieshave examined this particular question.

    Torrey et al. (2006) introduce the idea of trans-ferring skills between different tasks, which aresimilar to options. Inductive logic programming isused to identify sequences of actions that are mostuseful to agents in a source task. These skills arethen mapped by human-supplied intertask map-pings into the target task, where they are used tobootstrap reinforcement learning. Torrey’s subse-quent work (Torrey et al. 2007) further generalizesthe technique to transfer relational macros, orstrategies, which may require composing severalskills together. Inductive logic programming isagain used to identify sequences of actions thatresult in high reward for the agents, but now thesequences of actions are more general (and thus,potentially, more useful for agents in a target task).

    Once the strategies are remapped to the targettask through a human-provided mapping, they areused to demonstrate a strategy to the target-tasklearner. Rather than explore randomly, the target-task learner always executes the transferred strate-gies for the first 100 episodes and thus learns toestimate the Q-values of the actions selected by thetransferred strategies. After this demonstrationphase, the strategies are discarded, having success-fully seeded the agent’s action-value function. Forthe remainder of the experiment, the agent selectsfrom the MDP’s base actions, successfully learning

    to improve on the performance of the transferredstrategies.

    One particularly interesting challenge theauthors tackle in their papers is to consider sourcetasks and target tasks that not only differ in termsof state variables and actions, but also in rewardstructure. Positive transfer is shown between 3 ver-sus 2 Keepaway, 3 versus 2 MoveDownfield, anddifferent versions of BreakAway. MoveDownfield issimilar to Keepaway, except that the boundary areashifts over time (similar to how players may wantto maintain possession of a ball while advancingtowards an opponent’s goal). BreakAway is lesssimilar, as the learning agents now can pass to ateammate or shoot at the goal — rather than areward based on how long possession is main-tained, the team reward is binary, based onwhether a goal is scored. Figure 9 reports one set ofresults, showing that using relational macros sig-nificantly improves the jump start and total rewardin the target task when transferring between dif-ferent BreakAway tasks. The simpler method, skilltransfer, also improves learning performance, butto a lesser degree, as the information transferred isless complex (and, in this case, less useful for learn-ing the target task).

    Similar in spirit to relational macro transfer, ourpast work introduced rule transfer (Taylor andStone 2007a), which uses a simple propositionalrule-learning algorithm to summarize a learnedsource-task policy (see figure 10). In particular, thelearning algorithm examines (state, action) pairsfrom the source-task agent and generates a deci-

    Articles

    28 AI MAGAZINE

    0

    5

    10

    15

    20

    25

    30

    35

    0 10 50 100 250 500 1000 3000 6000

    Sim

    ulat

    or

    Ho

    urs

    Number of 3 Versus 2 Episodes

    Time to Threshold in 4 Versus 3 Keepaway

    Average 4 Versus Average 3 Versus 2 Time

    Figure 9. One Set of Results in a Simulated Soccer Goal Scoring Task.

    This graph compares relational macro transfer, skill transfer, and learning without transfer.

  • sion list to approximate the agent’s policy (that is,the mapping from states to actions). In addition toagain allowing high-level knowledge to be trans-ferred between tasks, this work is novel because itlooks at transferring knowledge between differentdomains, producing substantial speedups whenthe source task is orders of magnitude faster tolearn than the target task.

    The learned propositional rules are transformedthrough intertask mappings so that they apply toa target task with different state variables andactions. The target-task learner may then bootstraplearning by incorporating the rules as an extraaction, essentially adding an ever-present option“take the action suggested by the source-task poli-cy,” resulting in an improved jump start, totalreward, and time to threshold. By using rules as anintermediary between the two tasks, the sourceand target tasks can use different learning meth-ods, effectively decoupling the two learners. Twodifferent source tasks are hand constructed: Ring-world and Knight Joust. Both tasks can be learned

    optimally in a matter of seconds, but the knowl-edge gained by agents in these simple tasks couldstill significantly improve learning rates in 3 versus2 Keepaway, reducing both the target-task learningtime and total learning time, relative to learning 3versus 2 Keepaway. For instance, high-level con-cepts like “perform a knight jump if the opponentgets close, otherwise move forward” can transferwell into Keepaway as “perform the pass action ifa keeper gets close, otherwise hold the ball.”

    Learning Intertask Mappings The transfer algorithms considered thus far haveassumed that a hand-coded mapping betweentasks was provided, or that no mapping was need-ed. In this section we consider the less-wellexplored question of how an intertask mappingcan be learned, a critical capability if a human isunable, or unwilling, to provide a mapping. Thereare two high-level approaches to learning task sim-ilarity. The first relies on high-level structure of thetasks and the second uses low-level data gathered

    Articles

    SPRING 2011 29

    E

    N

    W

    S

    ang (East)

    Start

    Goal

    ang(West)

    dist(P,O)

    Player

    Opponent

    RuleLearner

    Target Task

    TargetTaskRules

    χ

  • from agents in both tasks (see figure 11 for anexample data-driven approach, discussed later inthis section).

    The first category of techniques uses high-levelinformation about tasks to reason about their sim-ilarity. If the full model of both MDPs were known(that is, agents could plan a policy without explor-ing), a careful analysis of the two MDPs couldreveal similarities that may be exploited. More rel-evant to this article is when some informationabout the source task and the target task is known,but a full model is not available. For instance, Liuand Stone (2006) assume that a type of qualitativemodel is available for learning a mapping, but sucha model is insufficient for planning a policy. Analgorithm based on the Structure Mapping Engine(Falkenhainer, Forbus, and Gentner 1989) is usedto learn similarities between the tasks’ qualitativemodels. The Structure Mapping Engine is an algo-rithm that attempts to map knowledge from one

    domain into another using analogical matching —it is exactly these learned similarities that can beused as intertask mappings to enable transfer.

    The second category, data-driven learners, donot require a high-level model of tasks, allowingfor additional flexibility and autonomy. Soni andSingh (2006) develop a TL method that requiresmuch less background knowledge than the previ-ous method, needing only information about howstate variables are used to describe objects in a mul-tiplayer task.4 First, an agent is supplied with aseries of possible state transformations and anintertask action mapping. Second, after learningthe source task, the agent’s goal is to learn the cor-rect transformation: in each target-task state s, theagent can randomly explore the target-taskactions, or it may choose to take the action rec-ommend by the source-task policy using a map-ping, source(X(s)). This method has a similar moti-vation to that of Fernández and Veloso (2006), but

    Articles

    30 AI MAGAZINE

    SourceEnvironment

    TargetEnvironment

    Source Agent

    Source Task

    Target Agent

    Target Task

    Action State Rewardrsource atarget atarget

    Actionssourceasource

    State Rewardrtarget

    Target Task DataSource Task Data

    Intertask Mapping

    SimilarityLearner

    Figure 11. One Class of Methods to Learn Intertask Mappings.

    This class of methods relies on collecting data from (that is, allowing an agent to interact with) source and target tasks.Similarities in the transitions and rewards in the two tasks are used to learn state variable and action mappings.

  • here the authors learn to selectbetween possible mappings rather thanpossible previous policies. Over timethe agent uses Q-learning in Keepawayto select the best state variable map-ping (X), as well as learn the actionvalues for the target task (A). The jumpstart, total reward, and asymptotic per-formance are all slightly improvedwhen using this method, but its effica-cy will be heavily dependant on thenumber of possible mappings betweenany source and target task. As the num-ber of possibilities grows, so too willthe amount of data needed to deter-mine which mapping is most correct.

    The MASTER algorithm (Taylor, Jong,and Stone 2008) further relaxes theknowledge requirements for learningan intertask mapping: no knowledgeabout how state variables are related toobjects in the target task is required.This added flexibility, however, comesat the expense of additional computa-tional complexity. The core idea ofMASTER is to (1) record experiencedsource-task instances, (2) build anapproximate transition model from asmall set of experienced target-taskinstances, and then (3) test possiblemappings offline by measuring the pre-diction error of the target-task modelson source-task data (see figure 11). Thisapproach is sample efficient but has ahigh computational complexity thatgrows exponentially as the number ofstate variables and actions increase.

    The main benefit of such a method isthat this high computational costoccurs “between tasks,” after the sourcetask has been learned but before thetarget-task agent begins learning, andthus does not require any extra inter-actions with the environment. The cur-rent implementation uses an exhaus-tive search to find the intertaskmappings that minimize the predic-tion error, but more sophisticated (forexample, heuristic) search methodscould also be incorporated. It is worthnoting that MASTER is one of the fewmethods that transfers informationwhen using a model-based learningmethod (rather than a model-free TDmethod). This method may also beused to assist with source-task selec-tion, but MASTER’s primary contribu-tion is to demonstrate that fullyautonomous transfer is possible.

    Open Questions This section discusses a selection ofcurrent open questions in transfer forreinforcement learning domains. Hav-ing discussed a number of successfulmethods in TL, this section will helpthe reader better understand where TLmay be further improved and we hopethat it will inspire others to examineproblems in this area. Many peoplehave demonstrated empirical successeswith transfer learning, but we discusshow (1) related paradigms may offerinsight into improving transfer, (2)there are currently few theoreticalresults or guarantees for TL, (3) currentTL methods are not powerful enoughto completely remove a human fromthe loop, and (4) there are many waysthe current TL methods could be moreefficient.

    Related Paradigms Thus far, we have focused on algo-rithms that use one or more sourcetasks to better learn in a different, butrelated, target task. There are manyother paradigms that may also be com-bined with, or provide ideas for, trans-fer learning. This section provides abrief discussion of these, as well as theirrelationship to TL.

    Lifelong Learning. Thrun (1996) sug-gested the notion of lifelong learningwhere an agent may experience asequence of temporally or spatially sep-arated tasks. Transfer would be a keycomponent of any such system, butthe lifelong learning framework ismore demanding than that of transfer.For instance, transfer algorithms aretypically “told” when a new task hasbegun, whereas in lifelong learning,agents may be reasonably expectedautomatically to identify new subtaskswithin the global environment (that is,the real world). Transfer learning algo-rithms could, in theory, be adapted toautomatically detect such task changes,similar to the idea of concept drift(Widmer and Kubat 1996).

    Domain-Level Advice. There is agrowing body of work integratingadvice into RL learners, which is oftenprovided by humans (see Maclin andShavlik [1996] or Kuhlmann et al.[2004]). More relevant to this article arethose methods that automatically

    Articles

    SPRING 2011 31

    extract domain advice, effectivelyspeeding up future tasks within a givendomain (Torrey et al. 2006).

    Reward Shaping. Reward shaping inan RL context typically refers to allow-ing agents to train on an artificialreward signal (R) rather than theMDP’s true reward (R). While therehave been notable cases of humansproviding shaping rewards, mostimportant are methods that can learndomain-level shaping rewards (Ko -nidaris and Barto 2006), representing abias that may transfer well across mul-tiple tasks. Another option that has notbeen explored in the literature, to thebest of our knowledge, would be toshape the source task so that (1) theagent learned faster and reduced thetotal learning time, or (2) the sourcetask became more similar to the targettask, improving how useful the trans-ferred information would be to the tar-get-task agent.

    Representation Transfer. Transferlearning problems are typically framedas leveraging knowledge learned on asource task to improve learning on arelated, but different, target task. Inother work, we (Taylor and Stone2007b) examine the complimentarytask of transferring knowledge betweenagents with different internal represen-tations (that is, the function approxi-mator or learning algorithm) of thesame task. Allowing for such shifts inrepresentation gives additional flexibil-ity to an agent designer as well aspotentially allowing agents to achievehigher performance by changing repre-sentations partway through learning.Selecting a representation is often keyfor solving a problem (see the mutilat-ed checkerboard problem [McCarthy1964] where humans’ internal repre-sentations of a problem drasticallychange the problem’s solvability) anddifferent representations may maketransfer more or less difficult. Repre-sentation selection is a difficult prob-lem in RL in general and we expect thatthe flexibility provided by representa-tion transfer may enhance existingtransfer learning algorithms.

    Theoretical Advances In addition to making progress inempirical enquiries, attempting to bet-ter understand where and how TL will

  • can define transformations betweenMDPs based on transition and rewarddynamics, similar in spirit to intertaskmappings, and have been used success-fully for transfer (Soni and Singh2006). However, discovering homo-morphisms is NP-hard (Ravindran andBarto 2003) and homomorphisms aregenerally supplied to a learner by anoracle. While these two theoreticalframeworks may be able to help avoidnegative transfer or to determine whentwo tasks are “transfer compatible,”significant work needs to be done todetermine if such approaches are feasi-ble in practice, particularly if the agentis fully autonomous (that is, is not pro-vided domain knowledge by a human)and is not provided a full model of theMDP.

    With a handful of exceptions (seeBowling and Veloso [1999]), there hasbeen relatively little work on the theo-retical properties of transfer in RL. Forexample, there is considerable need foranalysis that could potentially (1) pro-vide guarantees about whether a par-ticular source task can improve learn-ing in a target task (given a particulartype of knowledge transfer); (2) corre-late the amount of knowledge trans-ferred (for example, the number ofsamples) with the improvement in thesource task; (3) define what an optimalintertask mapping is and demonstratehow transfer efficacy is affected by theintertask mapping used.

    Autonomous Transfer To be fully autonomous, an RL transferagent would have to perform all of thefollowing steps: (1) Given a target task,select an appropriate source task or setof tasks from which to transfer. (2)Learn how the source task(s) and targettask are related. (3) Effectively transferknowledge from the source task(s) tothe target task.

    While the mechanisms used forthese steps will necessarily be interde-pendent, TL research has focused oneach independently, and no TL meth-ods are currently capable of robustlyaccomplishing all three goals. In par-ticular, successfully performing allthree steps in an arbitrary setting mayprove to be “AI-Complete,” but achiev-ing the goal of autonomous transfer inat least some limited settings will

    improve the ability of agents to quick-ly learn in novel or unanticipated situ-ations.

    Optimizing Transfer Recall that in value function transfer(figure 8), the amount of source-tasktraining was tuned empirically for thetwo competing goals of reducing thetarget-task training time and the totaltraining time. No work that we areaware of is able to determine the opti-mal amount of source-task training tominimize the target-task training timeor total training time. It is likely that acalculation or heuristic for determiningthe optimal amount of source-tasktraining time will have to consider thestructure of the two tasks, their rela-tionship, and what transfer method isused. This optimization becomes evenmore difficult in the case of multisteptransfer, as there are two or more tasksthat can be trained for differentamounts of time.

    Another question not (yet) ad -dressed is how best to explore in asource task when the explicit purposeof the agent is to speed up learning ina target task. For instance, it may bebetter to explore more of the sourcetask’s state space than to learn an accu-rate action-value function for only partof the state space. While no current TLalgorithms take such an approach,there has been some work on the ques-tion of learning a policy that isexploitable (without attempting tomaximize the online reward accruedwhile learning) in nontransfer contexts(Simsek and Barto 2006).

    Finally, while there has been successcreating source tasks by hand (Taylorand Stone 2007a), there are currentlyno methods for automatically con-structing a source task given a targettask. Furthermore, while multisteptransfer has proven successful, thereare currently no guidelines on how toselect/construct a single source task, letalone a series of source tasks. In caseswhere minimizing the total trainingtime is critical, it may well be worth theeffort to carefully design a curriculum(Taylor 2009) or training regimen(Zang et al. 2010) of tasks for the agentto sequentially learn, much as Skinnertrained dogs to perform tricks througha series of tasks with increasing difficul-

    provide benefits, it is important to con-tinue advancing theoretical under-standing. For example, the majority ofTL work in the literature has concen-trated on showing that a particulartransfer approach is plausible. None, toour knowledge, has a well-definedmethod for determining when anapproach will succeed or fail accordingto one or more metrics. While we cansay that it is possible to improve learn-ing in a target task faster through trans-fer, we cannot currently decide if anarbitrary pair of tasks are appropriatefor a given transfer method. Therefore,transfer may produce incorrect learn-ing biases and result in negative trans-fer. While some methods can estimatetask similarity (Taylor, Jong, and Stone2008), these methods do not provideany theoretical guarantees about itseffectiveness.

    One potential method for avoidingnegative transfer is to leverage the ideasof bisimulation (Milner 1982). Ferns etal. (2006) point out that:

    In the context of MDPs, bisimulationcan roughly be described as the largestequivalence relation on the state spaceof an MDP that relates two states pre-cisely when for every action, theyachieve the same immediate rewardand have the same probability of tran-sitioning to classes of equivalentstates. This means that bisimilar stateslead to essentially the same long-termbehavior.

    However, bisimulation may be toostrict because states are either equiva-lent or not, and may be slow to com-pute in practice. The work of Ferns andcolleagues (Ferns, Panangaden, andPrecup 2005; Ferns et al. 2006) relaxesthe idea of bisimulation to that of a(pseudo)metric that can be computedmuch faster, and gives a similaritymeasure, rather than a boolean. It ispossible, although not yet shown, thatbisimulation approximations can beused to discover regions of state spacethat can be transferred from one task toanother, or to determine how similartwo tasks are in toto. Such an approachmay in fact be similar to that of ana-logical reasoning (for example, theStructure Mapping Engine), but to ourknowledge such a comparison has notbeen closely investigated.

    Homomorphisms (Ravindran and Bar-to 2002) are a different abstraction that

    Articles

    32 AI MAGAZINE

  • Articles

    SPRING 2011 33

    ty (Skinner 1953). Likewise, if a fullyautonomous agent discovers a new taskthat is particularly hard, the agent maydecide to first train on an artificial task,or series of tasks, before attempting tomaster the full task.

    Conclusion In the coming years, deployed agentswill become increasingly common andgain more capabilities. Transfer learn-ing, paired with reinforcement learn-ing, is an appropriate paradigm if theagent must take sequential actions inthe world and the designers do notknow optimal solutions to the agent’stasks at design time (or even if they donot know what the agent’s tasks willbe). There are many possible settingsand goals for transfer learning, but RL-related approaches generally focus onincreasing the performance of learningand therefore potentially making RLmore appropriate for solving complex,real-world systems.

    Significant progress on transfer forRL domains has been made in the lastfew years, but there are also manyopen questions. We expect that manyof the above questions will beaddressed in the near future so that TLalgorithms become more powerful andmore broadly applicable. Additionally,we hope to see more physical and vir-tual agents utilizing transfer learning,further encouraging growth in anexciting and active subfield of the AIcommunity.

    Through this article, we have tried toprovide an introduction to transferlearning in reinforcement learningdomains. We hope that this article hasserved to better formulate the TL prob-lem within the community, and tohighlight some of the more prominentapproaches. Additionally, given thenumber of open questions currentlyfacing the field, we hope that otherswill be motivated to join in research ofthis exciting area.

    Acknowledgements We thank the editor and reviewers foruseful feedback and suggestions, aswell as Lisa Torrey for sharing her datafor figure 9. This work has taken placein part at the Learning Agents ResearchGroup (LARG) at the Artificial Intelli-

    gence Laboratory, the University ofTexas at Austin. LARG research is sup-ported in part by grants from theNational Science Foundation (CNS-0615104 and IIS-0917122), ONR (N00014-09-1-0658), DARPA (FA8 650 -08-C-7812), and the Federal Highway Ad -ministration (DTFH61-07-H-00030).

    Notes1. For details and videos of play, seecs.utexas.edu/˜AustinVilla/sim/keepaway.

    2. A standard assumption is that a task isMarkovian, meaning that the probabilitydistribution over next states is independentof the agent’s state and action history. Thus,saving a history would not assist the agentwhen selecting actions, and it can considereach state in isolation.

    3. An intertask mapping often maps multi-ple entities in the target task to single enti-ties in the source task because the target taskis more complex than the source, but ingeneral the mappings may be one-to-many,one-to-one, or many-to-many.

    4. For instance, an agent may know that apair of state variables describe “distance toteammate” and “distance from teammate tomarker,” but the agent is not told whichteammate the state variables describe.

    References Bowling, M. H., and Veloso, M. M. 1999.Bounding the Suboptimality of ReusingSubproblem. In Proceedings of the SixteenthInternational Joint Conference on ArtificialIntelligence, 1340–1347. San Francisco: Mor-gan Kaufmann Publishers.

    Bushinsky, S. 2009. Deus Ex Machina — AHigher Creative Species in the Game ofChess. AI Magazine 30(3): 63–70.

    Dai, W.; Yang, Q.; Xue, G. R.; and Yu, Y.2007. Boosting for Transfer Learning. In Pro-cedings of the 24th International Conference onMachine Learning. Princeton, NJ: Interna-tional Machine Learning Society.

    Falkenhainer, B.; Forbus, K. D.; and Gentner,D. 1989. The Structure Mapping Engine:Algorithm and Examples. Artificial Intelli-gence 41(1): 1–63.

    Fernández, F., and Veloso, M. 2006. Proba-bilistic Policy Reuse in a ReinforcementLearning Agent. In Proceedings of the FifthInternational Conference on AutonomousAgents and Multiagent Systems. New York:Association for Computing Machinery.

    Fernández, F.; García, J.; and Veloso, M.2010. Probabilistic Policy Reuse for Inter-Task Transfer Learning. Journal of Roboticsand Autonomous Systems. 58(7): 866–871.

    Ferns, N.; Castro, P.; Panangaden, P.; and

    Precup, D. 2006. Methods for ComputingState Similarity in Markov Decision Process-es. In Proceedings of the 22nd Conference onUncertainty in Artificial intelligence, 174–181.Seattle: AUAI Press.

    Ferns, N.; Panangaden, P.; and Precup, D.2005. Metrics for Markov Decision Process-es with Infinite State Spaces. In Proceedingsof the 21st Conference on Uncertainty in Artifi-cial Intelligence, 201–208. Seattle: AUAIPress.

    Kaelbling, L. P.; Littman, M. L.; and Moore,A. W. 1996. Reinforcement Learning: A Sur-vey. Journal of Artificial Intelligence Research4: 237–285.

    Konidaris, G., and Barto, A. 2006.Autonomous Shaping: Knowledge Transferin Reinforcement Learning. In Proceedings ofthe 23rd International Conference on MachineLearning, 489–496. New York: Associationfor Computing Machinery.

    Kuhlmann, G.; Stone, P.; Mooney, R.; andShavlik, J. 2004. Guiding a ReinforcementLearner With Natural Language Advice: Ini-tial Results in RoboCup Soccer. In Superviso-ry Control of Learning and Adaptive Systems:Papers from the 2004 AAAI Workshop. Tech-nical Report WS-04-10. Menlo Park, CA:AAAI Press.

    Liu, Y., and Stone, P. 2006. Value-Function-Based Transfer for Reinforcement LearningUsing Structure Mapping. In Proceedings ofthe Twenty-First National Conference on Artifi-cial Intelligence, 415–20. Menlo Park, CA:AAAI Press.

    Maclin, R., and Shavlik, J. W. 1996. CreatingAdvice-Taking Reinforcement Learners.Machine Learning 22(1-3): 251–281.

    Mataric, M. J. 1994. Reward Functions forAccelerated Learning. In Proceedings of theInternational Conference on Machine Learning,181–189. San Francisco: Morgan KaufmannPublishers.

    McCarthy, J. 1964. A Tough Nut for ProofProcedures. Technical Report Sail AI Memo16, Computer Science Department, StanfordUniversity.

    Milner, R. 1982. A Calculus of Communicat-ing Systems. Berlin: Springer-Verlag.

    Noda, I.; Matsubara, H.; Hiraki, K.; andFrank, I. 1998. Soccer Server: A Tool forResearch on Multiagent Systems. AppliedArtificial Intelligence 12(2–3): 233–250.

    Price, B., and Boutilier, C. 2003. Accelerat-ing Reinforcement Learning throughImplicit Imitation. Journal of Artificial Intelli-gence Research 19: 569–629.

    Ravindran, B., and Barto, A. 2002. ModelMinimization in Hierarchical Reinforce-ment Learning. In Proceedings of the FifthSymposium on Abstraction, Reformulation, andApproximation. Lecture Notes in ArtificialIntelligence 2371. Berlin: Springer.

  • Ravindran, B., and Barto, A. G. 2003. AnAlgebraic Approach to Abstraction In Rein-forcement Learning. In Proceedings of theTwelfth Yale Workshop on Adaptive and Learn-ing Systems, 109–114. New Haven, CT: YaleUniversity.

    Rummery, G., and Niranjan, M. 1994. On-line Q-Learning Using Connectionist Sys-tems. Technical Report CUED/F-INFENG-RT116, Engineering Department, CambridgeUniversity. ¨

    Simsek, O., and Barto, A. G. 2006. An Intrin-sic Reward Mechanism for Efficient Explo-ration. In Procedings of the 23rd InternationalConference on Machine Learning. Princeton,NJ: International Machine Learning Society.

    Singh, S., and Sutton, R. S. 1996. Reinforce-ment Learning with Replacing EligibilityTraces. Machine Learning 22(1–3): 123–158.

    Skinner, B. F. 1953. Science and HumanBehavior. London: Collier-Macmillian.

    Soni, V., and Singh, S. 2006. Using Homo-morphisms to Transfer Options across Con-tinuous Reinforcement Learning Domains.In Proceedings of the Twenty-First NationalConference on Artificial Intelligence. MenloPark, CA: AAAI Press.

    Stone, P.; Kuhlmann, G.; Taylor, M. E.; and

    Articles

    34 AI MAGAZINE

    Liu, Y. 2006. Keepaway Soccer: FromMachine Learning Testbed to Benchmark. InRoboCup-2005: Robot Soccer World Cup IX,volume 4020, 93–105, ed. I. Noda, A. Jacoff,A. Bredenfeld, and Y. Takahashi. Berlin:Springer-Verlag..

    Sutton, R. S. 1988. Learning to Predict bythe Methods of Temporal Differences.Machine Learning 3(1): 9–44.

    Sutton, R. S., and Barto, A. G. 1998. Intro-duction to Reinforcement Learning. Cam-bridge, MA: The MIT Press.

    Sutton, R. S.; Precup, D.; and Singh, S. P.1999. Between MDPs and Semi-MDPs: AFramework for Temporal Abstraction inReinforcement Learning. Artificial Intelli-gence 112(1–2): 181–211.

    Szepesvári, C. 2009. Reinforcement Learn-ing Algorithms for MDPs — A Survey. Tech-nical Report TR09-13, Department of Com-puting Science, University of Alberta.

    Taylor, M. E. 2009. Assisting Transfer-Enabled Machine Learning Algorithms:Leveraging Human Knowledge for Curricu-lum Design. In Agents That Learn fromHuman Teachers: Papers from the AAAISpring Symposium. Technical Report SS-09-01. Menlo Park, CA: AAAI Press.

    Taylor, M. E., and Chernova, S. 2010. Inte-grating Human Demonstration and Rein-forcement Learning: Initial Results inHuman-Agent Transfer. Paper presented atthe Agents Learning Interactively withHuman Teachers AAMAS workshop, Toron-to, Ontario, Canada, 11 May.

    Taylor, M. E., and Stone, P. 2007a. Cross-Domain Transfer for Reinforcement Learn-ing. In Procedings of the 24th InternationalConference on Machine Learning. Princeton,NJ: International Machine Learning Society.

    Taylor, M. E., and Stone, P. 2007b. Repre-sentation Transfer for Reinforcement Learn-ing. In AAAI 2007 Fall Symposium on Com-putational Approaches to RepresentationChange during Learning and Development:Papers from the AAAI Fall Symposium.Technical Report FS-07-03. Menlo Park, CA:AAAI Press.

    Taylor, M. E., and Stone, P. 2009. TransferLearning for Reinforcement LearningDomains: A Survey. Journal of Machine Learn-ing Research 10(1): 1633–1685.

    Taylor, M. E.; Jong, N. K.; and Stone, P. 2008.Transferring Instances for Model-BasedReinforcement Learning. In Proceedings ofthe European Conference on Machine Learningand Principles and Practice of Knowledge Dis-covery in Databases (ECML PKDD), LectureNotes in Computer Science 5211, 488–505.Berlin: Springer.

    Taylor, M. E.; Stone, P.; and Liu, Y. 2007.Transfer Learning Via Inter-Task Mappingsfor Temporal Difference Learning. Journal of

    Machine Learning Research 8(1): 2125–2167.

    Taylor, M. E.; Whiteson, S.; and Stone, P.2007. Transfer Via Inter-Task Mappings inPolicy Search Reinforcement Learning. InProceedings of the Sixth International JointConference on Autonomous Agents and Multia-gent Systems. Richland, SC: InternationalFoundation for Autonomous Agents andMultiagent Systems.

    Thrun, S. 1996. Is Learning the n-th ThingAny Easier Than Learning the First? InAdvances in Neural Information Processing Sys-tems 8, 640–646. Cambridge, MA: The MITPress.

    Torrey, L.; Shavlik, J. W.; Walker, T.; andMaclin, R. 2006. Skill Acquisition Via Trans-fer Learning and Advice Taking. In Proceed-ings of the Seventeenth European Conference onMachine Learning, Lecture Notes in Comput-er Science 4212, 425–436. Berlin: Springer.

    Torrey, L.; Shavlik, J. W.; Walker, T.; andMaclin, R. 2007. Relational Macros forTransfer in Reinforcement Learning. In Pro-ceedings of the Seventeenth Conference onInductive Logic Programming. Lecture Notesin Computer Science 4894. Berlin: Springer.

    Watkins, C. J. C. H. 1989. Learning fromDelayed Rewards. Ph.D. dissertation, King’sCollege, Cambridge, UK.

    Widmer, G., and Kubat, M. 1996. Learningin the Presence of Concept Drift and HiddenContexts. Machine Learning 23(1):69–101.

    Zang, P.; Irani, A.; Zhou, P.; Isbell, C.; andThomaz, A. 2010. Combining FunctionApproximation, Human Teachers, andTraining Regimens for Real-World RL. In TheNinth International Joint Conference onAutonomous Agents and Multiagent Systems.Richland, SC: International Foundation forAutonomous Agents and Multiagent Sys-tems.

    Matthew E. Taylor is an assistant professorin the Computer Science Department atLafayette College. He received his doctoratefrom the University of Texas at Austin in2008, supervised by Peter Stone. After grad-uation, Taylor worked for two years as apostdoctoral research assistant at the Uni-versity of Southern California with MilindTambe. Current research interests includeintelligent agents, multiagent systems, rein-forcement learning, and transfer learning.

    Peter Stone is an associate professor in theDepartment of Computer Sciences at theUniversity of Texas at Austin. He receivedhis Ph.D in computer science in 1998 fromCarnegie Mellon University. From 1999 to2002 he was a senior technical staff memberin the Artificial Intelligence PrinciplesResearch Department at AT&T Labs –Research. Stone’s research interests includemachine learning, multiagent systems, androbotics.

    Showcase YourVideo at AAAI-11!

    (The first vidio submissiondeadline is April 15, 2011.)

    The goal of the VideoCompetition is to demon-strate how much fun AI isby documenting exciting

    artificial intelligenceadvances in research, edu-cation, and application.Accepted videos will bescreened at AAAI-11 onMonday, August 8. Westrongly encourage stu-

    dent participation!

    For complete informationabout the program and

    how to submit, please seewww.aaai.org/aaai11.