Interactive Teaching Strategies for Agent Training€¦ · Interactive Teaching Strategies for Agent Training ... This paper studies interactive teaching strategies for identifying

Interactive Teaching Strategies for Agent Training

Ofra AmirHarvard University

[email protected]

Ece KamarMicrosoft Research

[email protected]

Andrey KolobovMicrosoft Research

[email protected]

Barbara J. GroszHarvard University

[email protected]

AbstractAgents learning how to act in new environmentscan benefit from input from more experiencedagents or humans. This paper studies interactiveteaching strategies for identifying when a studentcan benefit from teacher-advice in a reinforcementlearning framework. In student-teacher learning,a teacher agent can advise the student on whichaction to take. Prior work has considered heuris-tics for the teacher to choose advising opportuni-ties. While these approaches effectively accelerateagent training, they assume that the teacher con-stantly monitors the student. This assumption maynot be satisfied with human teachers, as people in-cur cognitive costs of monitoring and might not al-ways pay attention. We propose strategies for ateacher and a student to jointly identify advisingopportunities so that the teacher is not required toconstantly monitor the student. Experimental re-sults show that these approaches reduce the amountof attention required from the teacher compared toteacher-initiated strategies, while maintaining sim-ilar learning gains. The empirical evaluation alsoinvestigates the effect of the information communi-cated to the teacher and the quality of the student’sinitial policy on teaching outcomes.

1 IntroductionWhen agents are face the task of learning how to act in a newenvironment, they can benefit from the input of more expe-rienced agents and humans. Multiple lines of work have fo-cused on incorporating different types of input to agent learn-ing. Learning from demonstration approaches aim at infer-ring a policy from expert demonstrations [Argall et al., 2009;Abbeel and Ng, 2004; Chernova and Veloso, 2007]. Inreward shaping, an agent learns from positive or negativesignals provided by an expert [Knox and Stone, 2010;Loftin et al., 2015]. In action critiquing, an agent practicesby interacting with the environment and an expert evaluatesits actions at the end of each session [Judah et al., 2010].

We focus on a paradigm for giving feedback to an agent inreal time, called student-teacher training. In this framework,an experienced “teacher” agent helps accelerate the “student”

agent’s learning by providing advice on which action to takenext [Clouse, 1996; Torrey and Taylor, 2013]. The studentupdates its policy based on reward signals from the environ-ment as in typical reinforcement learning, but its explorationis guided by the teacher’s advice.

Prior work has considered two modes of advice-giving inthis framework: student-initiated [Clouse, 1996] and teacher-initiated [Torrey and Taylor, 2013]. Torrey and Taylor [2013]considered a setting with a limited advice budget and devel-oped heuristics that guide the teacher’s choice of advisingopportunities. They demonstrated significant learning gainswhen using these heuristics in empirical studies. While theamount of advice that the teacher can provide was limited,their formulation assumed that the student’s current state isalways communicated to the teacher, and that the teachercontinuously monitors the student’s decisions until the ad-vice budget runs out. These assumptions have significantdrawbacks. For human teachers, constantly paying atten-tion diminishes the value of automation, imposes cognitivecosts [Miller et al., 2015] and can be simply unrealistic. Evenif the teacher is a computer agent, transmitting the student’severy state to the teacher can have a prohibitive communica-tion cost.

In this paper, we propose interactive student-teacher train-ing, in which the student and the teacher jointly decide whenadvice should be given. In these jointly-initiated teach-ing strategies, the student determines whether to ask for theteacher’s attention, and the teacher, if asked to pay attentionto the student’s state, decides whether to use this opportu-nity to give advice, given a limited advice budget. We beginby comparing the teacher-initiated and student-initiated ap-proaches experimentally, showing that heuristics for teacher-initiated training are more effective at improving the stu-dent agent’s policy than student-initiated ones, but they re-quire more teacher attention. Then we demonstrate that thejointly-initiated teaching strategies can reduce the amount ofattention required of the teacher compared to teacher-initiatedstrategies, while maintaining similar learning gains. Thus,our work integrates the teacher-initiated and student-initiatedapproaches, alleviating their disadvantages.

Collaborative approaches for assisting agents are partic-ularly important for semi-autonomous agents [Zilberstein,2015] (e.g., self-driving cars), as such agents will have long-term interactions with people and will have opportunities to

continuously improve their policies based on these interac-tions. Therefore, in addition to comparing the effectivenessof different interactive training strategies, we investigate theeffect on learning performance of factors that may vary acrossagent-human settings. In particular, our empirical evaluationsanalyze the effect of the information communicated to theteacher and the quality of the initial policy of the student onteaching outcomes.

The contributions of the paper are threefold. First, it ex-tends prior work on the teacher-student reinforcement learn-ing framework by considering both communication and at-tention requirements, motivated by settings in which a per-son will assist agents’ learning. Second, it proposes jointly-initiated advising approaches to reduce attention demands forthe teacher. Third, it empirically evaluates the proposed ap-proaches in the Pac-Man domain, and explores the effectof various aspects of teaching strategies on student learninggains and teacher attention requirements.

2 Related WorkOur paper builds and expands on prior work studying thestudent-teacher reinforcement learning framework [Clouse,1996; Torrey and Taylor, 2013], discussed in Section 3.

Chernova & Veloso [2007] proposed a confidence-basedapproach in which a learning agent asks for demonstrationswhen it is uncertain of its actions. In their approach, incontrast to the teacher-student framework, the agent onlylearns from the expert demonstration without receiving a sig-nal from the environment. Judah et al. [2014] proposed aframework for active imitation learning, in which an agentcan query an expert for its policy for a given state. They alsoassumed that the learning agent does not receive a reward sig-nal from the environment. Furthermore, they assumed thatthe agent can simulate trajectories and does not query fordemonstration during execution. Rosenstein & Barto [2004]proposed a supervised actor-critic RL framework in whicha supervisor’s action is integrated with a learning agent’s ac-tion. In contrast to our work, they assume a continuous actionspace. Griffith et al. [2013] proposed a Bayesian approach forintegrating human feedback on the correctness of actions toshape an agent’s policy. Rosman and Ramamoorthy [2014]developed methods for deciding when to advise an agent, butassume teachers have access to a knowledge base of commonagent trajectories.

Our motivation for aiming to reduce the attention requiredfrom the teacher is rooted in prior research on human-agentinteraction and human attention. These works have devel-oped methods for detecting human attention and for incor-porating it into agent decision making, taking into consid-eration the limited attention resources available to peopleas well as the costs of interruptions [Horvitz et al., 1999;2003]. Adjustable autonomy approaches take into accountthe user’s focus of attention when deciding whether to act au-tonomously or transfer autonomy to the user [Tambe et al.,2002; Goodrich et al., 2001]. Models of human attentionare also key in developing approaches for supporting humanssupervising autonomous systems [Cummings and Mitchell,2008; Fong et al., 2002].

3 Student-Teacher Reinforcement LearningThe student-teacher framework [Clouse, 1996] includes twoagents: a student and a teacher. We assume that the teacherhas already established a fixed policy for acting in the en-vironment, denoted πteacher, whereas the student uses a re-inforcement learning algorithm to learn its policy, denotedπstudent. At any state s, the teacher can give advice to thestudent by sharing πteacher(s). This formulation requiresthat the teacher and the student share the same action spacebut does not assume they share the same state representation.When the student receives advice from the teacher, it takesthe suggested action. That action is then treated as any otheraction chosen by the student during the learning period, andQ-values are updated using the same learning algorithm.

Similarly to Torrey & Taylor [2013], we specify a lim-ited advice budget for the teacher, but in contrast, we do notassume constant monitoring of the student by the teacher.Rather than specifying an attention budget, we consider theattention required of the teacher as an additional metric bywhich we evaluate the different teaching strategies. We con-sider two different metrics for attention: (1) the number ofstates in which the teacher had to assess the student’s state (todecide whether to give advice), and (2) the overall durationof the teaching period (i.e., the last time step in which theteacher had to assess the student’s state).

Instead of fixing an attention budget, we choose to ana-lyze the amount and duration required by training strategies,because considerations about attention vary among differentsettings. For example, if a person is helping an autonomouscar to improve its policy, the overall duration of the teachingperiod does not matter because the person is always presentwhen the car drives. However, the person might not pay at-tention to the road at all times and therefore there is a cost as-sociated with monitoring the car’s actions to decide whetherto intervene. Moreover, if teaching in this setting requiresthe human to take control over the car, then providing adviceincurs an additional cost beyond monitoring the agent’s be-havior (i.e., deciding whether to intervene requires less effortthan actually intervening). Thus, in this setting, we wouldlike to minimize the number of states in which we require theteacher’s attention as well as the number of times the teacheris required to give advice. In contrast, if an expert is broughtto a lab to help train a robot, teaching is done during a ded-icated time period in which the teacher watches a robot stu-dent. Here, minimizing the overall duration of the teachingperiod will be more important than minimizing the numberof states in which attention is required.

3.1 Teacher-Initiated AdvisingTorrey & Taylor [2013] proposed several heuristics for theteacher to decide when to give advice, using the notion ofstate importance. Intuitively, a state is considered importantif taking a wrong action in that state can lead to a significantdecrease in future rewards, as determined by the teacher’s Q-values. Formally, the importance of a state, denoted I(s), isdefined as:

I(s) = maxa

Qteacher(s,a) −min

aQteacher

(s,a) (1)

Three variations of this heuristic were suggested: (1) Ad-vise Important: giving advice when I(s) > tti (where tti is apredetermined threshold); (2) Correct Important: giving ad-vice if I(s) > tti and πstudent(s) 6= πteacher(s). This heuris-tic assumes that the teacher has access to the student’s chosenaction for the current state; (3) Predictive Advising: givingadvice if I(s) > tti and the teacher predicts that the stu-dent will take a sub-optimal action. This approach assumesthat the teacher does not know the student’s intended actionand instead develops a predictive model of the students’ ac-tions over time. We do not include Predictive Advising inour study to avoid the assumption that a person would de-velop a predictive model of the students’ actions. Moreover,even with agent teachers, the ability to predict an action willgreatly depend on the size of the action space.

We also evaluate the Early Advising baseline heuristicsused by Torrey & Taylor. Using the Early Advising heuristic,the teacher gives advice in all states until the entire advicebudget is spent. Similarly, with Early Correcting, the teacheradvises the student in any state in which πstudent(s) 6=πteacher(s) until exhausting the advice budget (as CorrectImportant, this heuristic assumes that the student’s intendedaction is communicated to the teacher).

3.2 Student-Initiated AdvisingWe consider several heuristics for the student agent to deter-mine when to ask the teacher for advice. Similarly to cor-rect important, the Ask Important heuristic uses the notionof state-importance to decide whether the student should askfor advice. It uses the student’s Q-values when computingEquation 1 and asks for advice when I(s) > tsi, where tsi isa threshold set for the student agent.

The Ask Uncertain heuristic [Clouse, 1996] also considersQ-values differences (Equation 1) to decide whether to askfor advice, but differs from Ask Important in that it asks foradvice when the difference is smaller than a given thresholdtunc (Equation 2). Intuitively, low Q-value difference signalsthat the student is uncertain about which action to take. Thisheuristic asks for advice when:

maxa

Qstudent(s,a) −min

aQstudent

(s,a) < tunc, (2)

where tunc is the given student’s threshold for uncertainty.Chernova & Veloso [Chernova and Veloso, 2007] used the

distance from a state to its previously visited nearest-neighborstate as one measure of confidence that is based on the agent’sfamiliarity with the state. We implement this approach in theAsk Unfamiliar heuristic. In our settings states are describedby a feature vector and we use Euclidean distance betweenfeature vectors to determine the nearest neighbor. The studentthen asks for advice when:

distance(s,NN(s)) > tunf , (3)where NN(s) is nearest neighbor of state s.

3.3 Jointly-Initiated AdvisingThe student-initiated and teacher-initiated advising ap-proaches both have shortcomings. The teacher-initiated ap-proach requires the teacher to always pay attention. On the

other hand, the advising decisions of the student are likely tobe weak since they are guided by the student’s noisy Q-valueestimates. We design jointly-initiated advising approaches toaddress these shortcomings by having the right division oftasks between the teacher and the student. These approachesdo not require the teacher to pay continuous attention whilestill utilizing the more informed signal of the teacher aboutwhether advice is beneficial in a given state.

In jointly-initiated advising, the student decides whether toask for the teacher’s attention based on one of the student-initiated approaches for asking for advice. Then, the teacherdecides whether to provide advice based on one of theteacher-initiated advising approaches. We denote a jointly-initiated heuristic by [X–Y ], where X is a student heuris-tic for asking the teacher’s attention and Y is a teacherheuristic for determining whether to give advice in the cur-rent state. For instance, [Ask Important–Correct Important]means that the student asks for the teacher’s attention whenI(s)student > tsi. The teacher will then assess the state andwill give advice if I(s)teacher > tti.

Once the teacher decides to give advice, it will continuemonitoring the student’s actions until advice is no longerneeded, and will later resume monitoring only when the stu-dent asks for the teacher’s attention next1. The motivation forthis approach is that once the teacher is already paying atten-tion, it will be better able to judge whether additional adviceis required in consequent states, until the student takes theright course of action. In addition, it requires less context-switching of the teacher.

4 Empirical EvaluationOur experiments have four objectives: (1) Comparingstudent-initiated and teacher-initiated strategies: we as-sess the relative strengths and weaknesses of the existing ap-proaches; (2) Evaluating the proposed jointly-initiated ap-proaches: we compare the performance of the joint heuristicsto that of the best-performing prior heuristics, as determinedby the first experiment; (3) Exploring the effect of the stu-dent’s initial policy quality on performance: in real-worldsettings, autonomous agents will likely start with some pre-programmed basic policy rather than learn “from scratch”.Therefore, we evaluate the benefits of the teaching sessionswhen varying the quality of the student’s initial policy. Thequality of the initial policy is varied in two ways: by varyingthe length of the student’s independent training (without ac-cess to teacher advice) prior to the teaching session, and bypre-training the student in limited settings that do not includesome important features of the game, so that the student can-not learn certain skills; (4) Exploring the effect of sharingthe student’s intended action with the teacher: while shar-ing the student’s intended action can reduce the use of theteacher’s advice budget, sharing the student’s action might beinfeasible in some domains, and also incurs additional com-munication costs. Therefore, we explore the extent to whichsharing the intended student action benefits learning.

1We evaluated the student-initiated approaches using this contin-ued monitoring, but it did not lead to significant differences.

Figure 1: The Pac-Man Game.

4.1 Experimental SetupWe used the Pac-Man vs. Ghosts League competition [Rohlf-shagen and Lucas, 2011] as our experimental domain. Fig-ure 1 shows the game maze used in our experiments. Thisgame configuration includes two types of food pellets: reg-ular pellets (small dots) are worth 10 points each and powerpellets (larger dots) are worth 50 points each. In addition,after eating a power pellet, ghosts become edible for a lim-ited time period. Pac-Man receives 200 points for each eatenghost. A game episode ends when Pac-Man is eaten by aghost, or after 2000 time steps. Ghosts chase Pac-Man with80% probability and otherwise move randomly. In each state,Pac-Man has at most four moves (right, left, up or down).

Due to the large size of the state space, we use a high-level feature representation for state-action pairs. Specif-ically, we use the 7-feature representation from Torrey &Taylor’s [2013] implementation. Q-values are defined as aweighted function of the feature values fi(s, a):

Q(s, a) = ω0 +∑i

ωi · fi(s, a) (4)

The student agent employed the Sarsa(λ) algorithm to learnthe weights in Equation 4. We used the same parameterconfiguration as Torrey & Taylor [2013]: ε = 0.05, α =0.001, γ = 0.999, λ = 0.9. The teacher agent was trainedwith the same learning configuration until its performanceconverged. Table 1 summarizes the heuristics for teacher-initiated and student-initiated advising strategies studied inour experiments, together with the thresholds we used foreach of them. The thresholds were determined empirically.

4.2 Evaluation MetricsWe evaluate the student’s learning rate by assessing the stu-dent’s performance at different time points during training.Specifically, in each trial, we paused training after every100 game episodes and evaluated the student’s policy at thattime point by averaging 30 evaluation episodes (in whichthe student uses its current policy without exploring or up-dating its Q-values). Because trials have high variance de-pending on the student’s eploration, we generate a learningcurve by aggregating 30 separate trials. For example,Figure 2(left) shows the learning curves comparing the performanceof teacher-initiated and student-initiated approaches. The x-axis represents training episodes, and y-axis values show theaverage episode reward at that point in the training session.

For the teacher, cumulative attention is evaluated by aver-aging the number of states in which the teacher was asked tomonitor the student. We assume that the teacher completelystops monitoring once the advice budget is fully used. Cu-mulative attention curves are generated by averaging the totalnumber of states in which the teacher’s attention was requriedafter every 100 game episodes, averaging these values over30 trials. The left plot of Figure 2 shows a cumulative atten-tion curve. As in the learning curve, the x-axis correspondsto training episodes. The y-axis values show the number ofstates in which the teacher’s attention was required up to agiven time point.

The overall attention duration required from the teacher isthe average number of states it takes to use the entire advicebudget. This metric can also be assessed by looking at the x-value of the point in which the cumulative attention (y-value)flattens in the cumulative attention curves. For example, inFigure 2 (right), when using the correct important heuris-tic, the overall duration of required attention is 90 episodes,while early correcting only requires an attention duration of10 episodes as the advice budget gets used quickly.

To assess the statistical significance of differences in av-erage rewards and cumulative attention, we ran paired t-testscomparing the averages after each 100 training episodes.

Heuristic Initiator Threshold SharedAction

Early Correcting teacher None YesEarly Advising teacher None NoCorrect Important teacher 200 YesAdvise Important teacher 200 NoAsk Important student 50 YesAsk Uncertain student 30 YesAsk Unfamiliar student Avg. distance to nearest

neighborYes

Table 1: Student-initiated and teacher-initiated heurisitcs.

4.3 Teacher Vs. Student HeuristicsFigure 2 (left) shows the student’s learning rate when usingheuristics for teacher-initiated and student-initiated advising.When using the teacher-initiated approaches, the teacher con-stantly monitors the state of the student until the advice bud-get runs out. When student-initiated advising approaches areused, the teacher only monitors the student when it is asked toadvise. In all cases, the student’s intended action is availableto the teacher when giving advice.

Substantially and signficantly higher learning gains wereobtained when using the teacher-initiated Correct Importantheuristic compared to all other heuristics (p < 10−16). Thiscan be seen in Figure 2 (left). For example, after 200 train-ing episodes with the teacher, an average reward of 3055.64 isobtained when using Correct Important, compared to an aver-age reward of 2688.03 when using the next best heuristic. Allother heuristics led to higher learning rates compared to theno advice (green dashed line) condition (p < 10−10). Therewere no statistically significant differences between any otherpairs of advising strategies.

The higher learning gains obtained when using teacher-initiated advising are expected; the teacher has more knowl-edge than the student about the domain and has a good policy

for making decisions in it, which allows it to choose effec-tive teaching opportunities. Consider the game state shown inFigure 1 as an illustrative example. Intuitively, this is an im-portant state: if Pac-Man (the student) makes a wrong move,it might be eaten by a ghost; if, however, it proceeds towardsthe power pellet, it will have an opportunity to earn a highreward for eating a ghost. The teacher, which already knowsthe environment, can identify that this state is important basedon its Q-values, while the student might not yet have enoughinformation to come to this conclusion.

While the Correct Important heuristic results in the high-est learning gains, it requires significantly more teacher at-tention than the other approaches. This can be seen in Fig-ure 2 (right), which shows the average cumulative attentionrequired of the teacher; i.e., the total number of states inwhich teacher’s attention was required up to a given episode.Teacher attention is required in significantly more states whenusing Correct Important compared to all other heuristics(72382.06 states compared to only 6358.86 states for AskUnfamiliar, which is the most attention-demanding student-initiated heuristic, p < 0.0001). In addition, the overall du-ration of teacher’s attention (i.e., number of episodes untilthe advice budget is fully used) is larger for Correct Impor-tant (90 episodes compared to less than 40 episodes requiredwhen using any of the other heuristics).

4.4 Jointly-Initiated Teaching StrategiesThe results reported so far show that the teacher-initiatedadvising strategies outperform the student-initiated ones interms of learning gains, but require more attention. In thissubsection, we present results from an evaluation of thejointly-initiated teaching strategies, which aim to reduce theattention required from the teacher while maintaining thebenefits of student-initiated advising. We thus compare theirperformance with that of the top performing teacher-initiatedteaching strategy, Correct Important.

As Figure 3 (left) shows, when using the heuristic [AskImportant–Correct Important], the student obtains similar re-wards to those obtained when using Correct Important. Therewards at any given time point were on average slightlyhigher when using [Ask Important–Correct Important], butwhile this difference was statistically significant (p = 0.008),it was not substantial (average difference of 18.5 points). Fig-ure 3 (right) shows that the [Ask Important–Correct Impor-tant] heuristic required the teacher’s attention in fewer states(64,711.47 states compared to 72,382.07 states). This dif-ference was statistically significant (p < 10−5), and substan-tial. However, the overall duration of required teacher’s atten-tion when using [Ask Important–Correct Important] is 140episodes (indicated by the x value corresponding to the max-imal total attention), compared to 50 episodes when usingthe Correct Important heuristic. That is, while the jointly-initiated teaching strategy requires the teacher’s attention infewer states, the duration of the training session, and thusteacher’s needed attention span, is longer.

To ensure that the performance is achieved as a result ofthe student’s choice of states in which to ask for advice, wealso evaluate a random baseline where the student asks forthe teacher’s attention with 0.5 probability (the average rate

of asking for advice by the Ask Important heuristic until theadvice budget runs out). As shown in Figure 3, this ran-dom baseline (Ask Random–Correct Important], dashed pur-ple) does not perform as well as [Ask Important–Correct Im-portant]. Moreover, it requires significantly more cumulativeteacher attention, as well as a longer teaching period (bothattention and learning gains differences were statistically sig-nificant, p < 10−5). This shows that while the student’s per-ception of importance is not as accurate as that of the teacher,it is still useful for identifying advising opportunities.

The strength of the [Ask Important–Correct Important]heuristic is its recall for important states. While the studentheuristic Ask Important has many false positives when try-ing to identify important states due to the students’ inaccurateQ-values, combining it with the teacher’s Correct Importantheuristic, which assesses whether the state is truly important,mitigates this weakness.

The other jointly-initiated teaching strategies, [AskUncertain–Correct Important] and [Ask Unfamiliar–CorrectImportant], lead to some improvement in learning rate com-pared to the No Advice baseline, but perform significantlyworse than Ask Random and require more cumulative atten-tion, because they rarely capture important states. That is,they suffer from a high false negative rate when trying toidentify important states, and therefore when the teacher usesCorrect Important in combination with these approaches, ittypically decides not to give advice (as the state is not im-portant). This is evident by the long duration it takes until theadvice budget runs out when using these heuristics (Figure 3).

[Ask Uncertain–Correct Important] suffers from a higherfalse negative rate because in its essence, Ask Uncertain cap-tures states with a small Q-value range rather than those witha high one. While in some of these states the student mightbe uncertain of its actions, it might also mean that none ofthe actions will lead to significantly decreased performance.[Ask Unfamiliar–Correct Important] also suffers from possi-bly missing important states and using the advice when its im-pact is smaller, because unfamiliar states might not be impor-tant ones. In addition, appropriately identifying unfamiliarstates likely requires more sophisticated domain-dependentsimilarity methods.

4.5 The Effect of Student’s Initial PolicyThe quality of the students initial policy may affect the effec-tiveness of different advising strategies. To gain insights intothis relationship, we experimented with student agents thatdiffer in the quality of their initial policies.

We observe similar trends and relative performance of thedifferent teaching strategies when varying the length of thestudent’s independent training prior to the teaching session.As the quality of the initial policy of the student improves(i.e., the student’s initial policy is based on more indepen-dent learning episodes, in the same game settings), the per-formance of the jointly-initiated and student-initiated teach-ing strategies that are based on state importance can betteridentify important states in which the teacher’s attention isrequired. However, in general, the overall benefit of advisingthe student decreases, as there is less room for improvementof higher-quality initial policies.

Figure 2: Average reward (left) and average attention (right) obtained by student-initiated and teacher-initiated approaches. Thestudent was trained for 100 episodes prior to teaching, and actions were shared with teacher.

Figure 3: Average reward (left) and cumulative attention (right) obtained by jointly-initiated and teacher-initiated advising. Thestudent was trained for 100 episodes prior to teaching, and actions were shared with teacher.

Figure 4: Average reward (left) and cumulative attention (right) obtained by jointly-initiated and teacher-initiated advising. Thestudent was trained for 300 episodes prior to teaching.

Figure 5: Average reward (left) and cumulative attention (right) obtained by jointly-initiated and teacher-initiated advising. Thestudent trained for 150 episodes prior to teaching in game without power pellets, and actions were shared with teacher.

Figure 6: Average reward (left) and cumulative attention (right) obtained by jointly-initiated and teacher-initiated advisingwhen the action is shared vs. when it is not shared. The student was trained for 100 episodes prior to teaching.

Figure 4 shows the performance of the jointly-initiated ap-proaches and the Correct Important heuristic when the stu-dent’s initial policy was established after 300 episodes ofindividual training. While the overall trends are similar tothose obtained when the initial policy was determined afteronly 100 episodes (Figure 3), the reduction in the requiredteacher attention when using [Ask Important–Correct Impor-tant] compared to that required when using Correct Impor-tant increases when the student starts with a better initial pol-icy. For example, when the student was independently trainedfor 100 episodes, advising based on [Ask Important–CorrectImportant] required attention in 7670.6 fewer states than ad-vising based on Correct Important; the difference in requiredattention increased to 92162.6 states if the student was trainedfor 300 episodes independently. In addition, when the stu-dent had longer independent training, the overall attention du-ration when using [Ask Important–Correct Important]) wasonly 10 episodes longer than when using Correct Important(compared to 50 episode gap when the independent trainingonly lasts 100 episodes).

Teacher advice is especially beneficial when the studentlearns its initial policy in a limited setting that does not al-low the student to explore the complete state space. Fig-ure 5 shows the performance of the different teaching strate-gies when applied to a student that developed an initial pol-icy in settings without power pellets. Without any advising(green bottom line), the student’s policy does not improve,as it does not manage to learn about the positive rewards ofeating power pellets and consequently does not learn to eatghosts. Since the student has already established low weightsfor features that correspond to the possibility of eating powerpellets, without guidance it is not able to learn this new skill.However, with teaching, it quickly improves its policy.

While [Ask Important–Correct Important] is still the bestperforming heuristic for jointly-initiated advising, the [AskUnfamiliar–Correct Important] heuristic does relatively bet-ter compared to settings in which the student was trained onthe same game instance prior to teaching. Although it is out-performed by the [Ask Uncertain, Correct Important] heuris-tic in terms of student performance (after 400 episodes), itrequires significantly less teacher attention. The relative per-formance improvement for the Ask Unfamiliar approach forgetting the teacher’s attention can be explained by the fact thatstates that involve high proximity to power pellets may appearless familiar (as they were not included in the initial indepen-dent training), and also correspond to important states.

4.6 Sharing the Student’s Intended ActionCorrect Important and [Ask Important–Correct Important]both assume that the intended action is shared with theteacher so that the teacher can correct the student if the state isconsidered important and the student’s intended action is sub-optimal. In contrast, Advise Important and [Ask Important–Advise Important] give advice when the state is consideredimportant, regardless of the student’s intended action. AsFigure 6 (left) shows, sharing the intended action signifi-cantly improves performance for all heuristics as it allowsthe teacher to save the teaching budget to prevent studentmistakes. However, it also requires more attention from theteacher (right figure). The effect of sharing the action is sim-ilar for the teacher-initiated and jointly-initiated approaches.

5 Conclusion and Future WorkThis paper investigates interactive teaching strategies in astudent-teacher learning framework. We show that using ajoint decision-making approach, the amount of attention re-quired from the teacher can be reduced compared to teacher-based approaches, while providing similar learning benefits.

Our empirical evaluation used the Pac-Man game, whichis an example of a dynamic domain where making mis-takes in some states is particularly harmful (e.g., being eatenby a ghost). Real-world dynamic settings such as semi-autonomous driving will also likely have these types of im-portant states (e.g., causing an accident). However, the im-portance heuristic will likely not generalize well to domainswhere “unrecoverable” mistakes are rare (e.g., mapping a newterritory, where each action can be immediately reversed). Insuch settings, unfamiliarity- and uncertainty-based heuristicsare likely to be more useful.

There are several directions for future work. Methods forgeneralizing the teacher’s advice beyond a specific state couldfurther accelerate learning. For instance, the student knowsthe teacher decided not could try to learn a model of im-portant states based on the teacher’s responses to advice re-quests. From the human-agent interaction perspective, weplan to study people’s attention while interacting with semi-autonomous systems and incorporate attention models intothe student’s strategies for asking for advice. As AI systemscontinue seeking input from humans in critical domains suchas driving, we see value in developing approaches that reasonabout considerations specific to humans and benefit from thecomplementary abilities of humans and machines.

AcknowledgmentsWe thank Eric Horvitz for helpful comments and suggestions.

References[Abbeel and Ng, 2004] Pieter Abbeel and Andrew Y Ng.

Apprenticeship learning via inverse reinforcement learn-ing. In Proceedings of the twenty-first international con-ference on Machine learning, page 1. ACM, 2004.

[Argall et al., 2009] Brenna D Argall, Sonia Chernova,Manuela Veloso, and Brett Browning. A survey of robotlearning from demonstration. Robotics and autonomoussystems, 57(5):469–483, 2009.

[Chernova and Veloso, 2007] Sonia Chernova and ManuelaVeloso. Confidence-based policy learning from demon-stration using gaussian mixture models. In Proceedingsof the 6th international joint conference on Autonomousagents and multiagent systems, page 233. ACM, 2007.

[Clouse, 1996] Jeffery Allen Clouse. On integrating appren-tice learning and reinforcement learning. 1996.

[Cummings and Mitchell, 2008] Mary L Cummings andPaul J Mitchell. Predicting controller capacity in super-visory control of multiple uavs. Systems, Man and Cyber-netics, Part A: Systems and Humans, IEEE Transactionson, 38(2):451–460, 2008.

[Fong et al., 2002] Terrence Fong, Charles Thorpe, andCharles Baur. Robot as partner: Vehicle teleoperationwith collaborative control. In Multi-robot systems: Fromswarms to intelligent automata, pages 195–202. Springer,2002.

[Goodrich et al., 2001] Michael A Goodrich, Dan R Olsen,Jacob Crandall, and Thomas J Palmer. Experiments in ad-justable autonomy. In Proceedings of IJCAI Workshop onAutonomy, Delegation and Control: Interacting with Intel-ligent Agents, pages 1624–1629, 2001.

[Griffith et al., 2013] Shane Griffith, Kaushik Subramanian,Jonathan Scholz, Charles Isbell, and Andrea L Thomaz.Policy shaping: Integrating human feedback with rein-forcement learning. In Advances in Neural InformationProcessing Systems, pages 2625–2633, 2013.

[Horvitz et al., 1999] Eric Horvitz, Andy Jacobs, and DavidHovel. Attention-sensitive alerting. In Proceedings ofthe Fifteenth conference on Uncertainty in artificial intelli-gence, pages 305–313. Morgan Kaufmann Publishers Inc.,1999.

[Horvitz et al., 2003] Eric Horvitz, Carl Kadie, Tim Paek,and David Hovel. Models of attention in computing andcommunication: from principles to applications. Commu-nications of the ACM, 46(3):52–59, 2003.

[Judah et al., 2010] Kshitij Judah, Saikat Roy, Alan Fern,and Thomas G Dietterich. Reinforcement learning viapractice and critique advice. In AAAI, 2010.

[Judah et al., 2014] Kshitij Judah, Alan P Fern, Thomas GDietterich, et al. Active lmitation learning: formal andpractical reductions to iid learning. The Journal of Ma-chine Learning Research, 15(1):3925–3963, 2014.

[Knox and Stone, 2010] W Bradley Knox and Peter Stone.Combining manual feedback with subsequent mdp rewardsignals for reinforcement learning. In Proceedings of the9th International Conference on Autonomous Agents andMultiagent Systems: volume 1-Volume 1, pages 5–12. In-ternational Foundation for Autonomous Agents and Mul-tiagent Systems, 2010.

[Loftin et al., 2015] Robert Loftin, Bei Peng, James Mac-Glashan, Michael L Littman, Matthew E Taylor, JeffHuang, and David L Roberts. Learning behaviors viahuman-delivered discrete feedback: modeling implicitfeedback strategies to speed up learning. AutonomousAgents and Multi-Agent Systems, pages 1–30, 2015.

[Miller et al., 2015] David Miller, Annabel Sun, MishelJohns, Hillary Ive, David Sirkin, Sudipto Aich, and WendyJu. Distraction becomes engagement in automated driv-ing. In Proceedings of the Human Factors and ErgonomicsSociety Annual Meeting, volume 59, pages 1676–1680.SAGE Publications, 2015.

[Rohlfshagen and Lucas, 2011] Philipp Rohlfshagen and Si-mon M Lucas. Ms pac-man versus ghost team cec 2011competition. In Evolutionary Computation (CEC), 2011IEEE Congress on, pages 70–77. IEEE, 2011.

[Rosenstein et al., 2004] Michael T Rosenstein, Andrew GBarto, Jennie Si, Andy Barto, Warren Powell, and Don-ald Wunsch. Supervised actor-critic reinforcement learn-ing. Handbook of Learning and Approximate DynamicProgramming, pages 359–380, 2004.

[Rosman and Ramamoorthy, 2014] Benjamin Rosman andSubramanian Ramamoorthy. Giving advice to agentswith hidden goals. In Robotics and Automation (ICRA),2014 IEEE International Conference on, pages 1959–1964. IEEE, 2014.

[Tambe et al., 2002] Milind Tambe, Paul Scerri, andDavid V Pynadath. Adjustable autonomy for the realworld. Journal of Artificial Intelligence Research,17(1):171–228, 2002.

[Torrey and Taylor, 2013] Lisa Torrey and Matthew Taylor.Teaching on a budget: Agents advising agents in rein-forcement learning. In Proceedings of the 2013 interna-tional conference on Autonomous agents and multi-agentsystems, pages 1053–1060. International Foundation forAutonomous Agents and Multiagent Systems, 2013.

[Zilberstein, 2015] Shlomo Zilberstein. Building strongsemi-autonomous systems. In Proceedings of the 29thAAAI Conference on Artificial Intelligence, pages 4088–4092, 2015.

Documents

Interactive Teaching Strategies for Agent Training€¦ · Interactive Teaching Strategies for Agent Training ... This paper studies interactive teaching strategies for identifying