8
Integrating Human Instructions and Reinforcement Learners : An SRL Approach Pradyot Korupolu V N Computer Science Dept. IIT Madras India S S Manimaran Computer Science Dept. IIT Madras India Balaraman Ravindran Computer Science Dept. IIT Madras India Sriraam Natarajan Translational Science Institute Wake Forest University USA Abstract As robots and other intelligent systems move out from tailored environments and enter the real world they would have to adapt quickly and learn from their surroundings. Like hu- mans, they would have to learn from various sources - experience, peers, teachers, trial- and-error etc. In earlier work, we proposed a framework that facilitates a teacher-student like relationship between humans and robots, wherein the human-robot interactions were restricted to certain classes of simple instruc- tions. In this work we extend this framework to handle multiple interpretations of instruc- tions. Instead of the human teacher tak- ing pains to carefully instructing the learn- ing agent (robot) in a manner that it can understand, the teacher instructs in his own natural style and the agent appropriately in- terprets them. The agent generalizes these instructions by building relational models of tasks using Statistical Relational Learning techniques. In addition to learning from teachers, the agent learns on its own us- ing Reinforcement Learning (RL). The agent chooses the most appropriate among these knowledge models using certain confidence measures. The generalization power of SRL makes learning and transfer of knowledge to related objects easy. The trial-and-error na- ture of RL enables the agent to improve upon what it has been taught and in some cases unlearn human teaching. 1 Introduction One of the long term goals of AI is to build adap- tive systems that can interact and learn from humans. Reinforcement Learning [25] (RL) provides a natural framework for adaptive learning by receiving feedback from the environment in the form of rewards based on which it learns an action preference for each state that it visits. Standard RL methods use an atomic, propositional or propositional function representation to capture the current state and possible actions of the learner. Although this type of representation suf- fices for many applications as demonstrated by suc- cessful RL applications [27, 6], in many real world problems such as robotics, real-time strategy games, logistics and a variety of planning domains etc., there is a need for relational representations. In such do- mains, achieving abstraction or generalization by stan- dard function approximators can pose significant diffi- culties in terms of representation and requires a large number of training examples. On the other hand, these domains are naturally de- scribed by relations among an indefinite number of objects. Recently there have been algorithms pro- posed that directly operate on these relational do- mains [19, 20, 23, 30]. Sanner and Boutilier [23] used situation calculus to capture the dynamics and pro- posed a linear programming formulation for solving the action selection problem. Wang et al. [30] on the other hand, employed First Order Decision Diagrams (FODDs) to capture the domain dynamics and the re- ward and value functions. Action selection was then posed as manipulation of these diagrams using sev- eral arithmetic operators on FODDs that they defined. While these methods are attractive, they still suffer from the need to explore extensively in the environ- ment in order to collect examples for learning. It is natural in many domains that there is an availability of a human expert who can provide guidance or in- structions to the learner. In these RRL systems, the expert can be utilized to design the reward functions and even provide the models of the environment. This is the approach taken by Thomaz et al. [28] where the human teachers provide rewards for actions cho- sen by the learner. This is a cumbersome process in large domains compared to the possibility of the ex-

Integrating Human Instructions and Reinforcement Learners : An … · 2020. 9. 30. · Integrating Human Instructions and Reinforcement Learners : An SRL Approach Pradyot Korupolu

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Integrating Human Instructions and Reinforcement Learners : AnSRL Approach

    Pradyot Korupolu V NComputer Science Dept.

    IIT MadrasIndia

    S S ManimaranComputer Science Dept.

    IIT MadrasIndia

    Balaraman RavindranComputer Science Dept.

    IIT MadrasIndia

    Sriraam NatarajanTranslational Science Institute

    Wake Forest UniversityUSA

    Abstract

    As robots and other intelligent systems moveout from tailored environments and enter thereal world they would have to adapt quicklyand learn from their surroundings. Like hu-mans, they would have to learn from varioussources - experience, peers, teachers, trial-and-error etc. In earlier work, we proposed aframework that facilitates a teacher-studentlike relationship between humans and robots,wherein the human-robot interactions wererestricted to certain classes of simple instruc-tions. In this work we extend this frameworkto handle multiple interpretations of instruc-tions. Instead of the human teacher tak-ing pains to carefully instructing the learn-ing agent (robot) in a manner that it canunderstand, the teacher instructs in his ownnatural style and the agent appropriately in-terprets them. The agent generalizes theseinstructions by building relational models oftasks using Statistical Relational Learningtechniques. In addition to learning fromteachers, the agent learns on its own us-ing Reinforcement Learning (RL). The agentchooses the most appropriate among theseknowledge models using certain confidencemeasures. The generalization power of SRLmakes learning and transfer of knowledge torelated objects easy. The trial-and-error na-ture of RL enables the agent to improve uponwhat it has been taught and in some casesunlearn human teaching.

    1 Introduction

    One of the long term goals of AI is to build adap-tive systems that can interact and learn from humans.Reinforcement Learning [25] (RL) provides a natural

    framework for adaptive learning by receiving feedbackfrom the environment in the form of rewards basedon which it learns an action preference for each statethat it visits. Standard RL methods use an atomic,propositional or propositional function representationto capture the current state and possible actions ofthe learner. Although this type of representation suf-fices for many applications as demonstrated by suc-cessful RL applications [27, 6], in many real worldproblems such as robotics, real-time strategy games,logistics and a variety of planning domains etc., thereis a need for relational representations. In such do-mains, achieving abstraction or generalization by stan-dard function approximators can pose significant diffi-culties in terms of representation and requires a largenumber of training examples.

    On the other hand, these domains are naturally de-scribed by relations among an indefinite number ofobjects. Recently there have been algorithms pro-posed that directly operate on these relational do-mains [19, 20, 23, 30]. Sanner and Boutilier [23] usedsituation calculus to capture the dynamics and pro-posed a linear programming formulation for solvingthe action selection problem. Wang et al. [30] on theother hand, employed First Order Decision Diagrams(FODDs) to capture the domain dynamics and the re-ward and value functions. Action selection was thenposed as manipulation of these diagrams using sev-eral arithmetic operators on FODDs that they defined.While these methods are attractive, they still sufferfrom the need to explore extensively in the environ-ment in order to collect examples for learning. It isnatural in many domains that there is an availabilityof a human expert who can provide guidance or in-structions to the learner. In these RRL systems, theexpert can be utilized to design the reward functionsand even provide the models of the environment. Thisis the approach taken by Thomaz et al. [28] wherethe human teachers provide rewards for actions cho-sen by the learner. This is a cumbersome process inlarge domains compared to the possibility of the ex-

  • pert providing examples and direct instructions to thelearner. Recently, a policy-gradient approach to learn-ing in relational domains has been proposed that canuse a small number of expert trajectories to initializethe policy [11]. While this method can use the initialtrajectories, there is no interaction with the humanbeyond the initial model.

    AI has a long history of methods that use expert tra-jectories for learning to act (Imitation Learning). Im-itation learning has been studied under a variety ofnames including learning by observation [24], learningfrom demonstrations [2], programming by demonstra-tions [7], programming by example [15], apprenticeshiplearning [18], behavioral cloning [22], learning to act[12], and some others. One distinguishing feature ofimitation learning from ordinary supervised learning isthat the examples are not iid, but follow a trajectory.Nevertheless, techniques used from supervised learn-ing have been successful for imitation learning [21].Recently, imitation learning has been posed in highlystochastic yet relational domains using a statistical re-lational learning (SRL) [10] formulation and has beensolved using functional-gradient boosting [16]. Latelythere has been a surge in interest in developing ap-prenticeship learning methods (also called as inversereinforcement learning) [1, 21, 17, 26]. Here one wouldassume that the observations are generated by a near-optimal policy of a Markov Decision Process (MDP)with an unknown reward function. The approachesusually take the form of learning the reward functionof the agent and solving the corresponding MDP. Allthese methods discussed assume that the expert is op-timal (or at least near-optimal) and the aim of thesemethods is to generalize from the example trajectoriesand emulate the expert in several scenarios.

    We take a different approach to interacting and learn-ing from humans. In our framework, the human expertprovides instructions – suggestions for actions selec-tion or specification of relevant features in learning apolicy. This reduces the effort on the expert by notproviding examples for every possible scenario. In-stead, the expert can interact with the system andprovide suggestions of actions or features periodically.As mentioned earlier, we are interested in relationaldomains where states are described using objects andrelations. More precisely, we consider a sorter roboticssetting where there are balls and baskets in room. Insuch a setting the advice provided by the expert con-cerns the ground objects or actions. For example, theexpert could gesture to the red ball in the hand of therobot to be dropped in a red basket. While the expertgestures at the red basket, there is an implicit gen-eralization in that he/she refers to dropping the ballin a basket of the same color. Hence, there is a need

    to generalize the gestures of the expert to objects inthe domain. Also, the domain is noisy due to sensorynoise of robot, stochastic effects of actions etc. To gen-eralize in this stochastic environment, we propose theuse of SRL to capture and reason with the instruc-tions of the user. A related approach has been takenby Branavan et al. [5] where natural language instruc-tions are mapped into rewards for the learner. We onthe other hand, do not treat them as rewards but asdirect relevance statements on states and actions.

    The chief contribution of our work is the ability tohandle multiple interpretations of instructions. As anexample to realize the necessity for such a framework,consider a person searching for a misplaced key to hiscupboard and one of his friends points to a heavy pa-per weight on a table nearby. The person will eitherinterpret the friend’s instruction as break the lock withthe paper weight or as search for the key near the pa-per weight. As can be seen, interpretation of an in-struction greatly varies the result of the task at hand.In this paper, we propose a novel framework that en-ables a learner to handle such instructions and choosethe best possible interpretation based on specific util-ity measures. An additional contribution of our workis treating the problem of generalization in sequen-tial decision problems as a teacher-student setup wherethe student retains the capability to unlearn what theteacher has taught.

    In the next section, we introduce the Sorter Domain, aball sorting domain, that we test our approach on. Inthe later sections we introduce the background to theproblem, explain the different types of instructions weuse and explain the algorithm in detail.

    2 Sorter Domain

    The Sorter Domain (Fig 1) consists of 3 objects and3 baskets. The task of a sorter robot is to drop theobjects into the basket with the same color i.e., a redobject should be dropped into a red basket. The col-ors of the objects and baskets are chosen randomlysuch that every object has at least one basket withthe same color. An episode is completed when everyobject has been dropped into a basket. Once an ob-ject is dropped into a basket it cannot be picked upanymore. Dropping an object into the correct basketis rewarded +50, a wrong match is rewarded -50 andany other action is rewarded -1. Each action takes afinite time to complete execution. An object or basketoccupies 1 of 6 fixed positions.

  • Figure 1: The Sorter Domain

    3 Markov Decision Process

    Sequential problems are generally modeled as MarkovDecision Processes (MDPs) in Reinforcement Learn-ing. The MDP framework forms the basis of our defi-nition of instructions.

    A MDP is a tuple 〈S,A, ψ, P,R〉, where S is a finite setof states, A is a finite set of actions, ψ ⊆ SXA is theset of admissible state-action pairs, P : ψ → [0, 1] isthe transition probability function with P (s, a, s′) be-ing the probability of transition from state s to states′ by performing action a. R : ψ → IR is the expectedreward function with R(s, a) being the expected re-ward for performing action a in state s (this sum isknown as return). As = {a|(s, a) ∈ ψ} ⊆ A be theset of actions admissible in state s. We assume thatAs is non-empty for all s ∈ S. π : ψ → [0, 1] is astochastic policy, such that ∀s ∈ S

    ∑a∈As π(s, a) = 1.

    ∀(s, a) ∈ ψ, π(s, a) gives the probability of executingaction a in state s.

    The value of a state-action pair conditioned on pol-icy π, Qπ(s, a), is the expected value of a sum ofdiscounted future rewards of taking action a, stat-ing from state s, and following policy π thereafter.The optimal value functions assign to each state-actionpair, the highest expected return achievable by anypolicy. A policy whose value function is optimal isan optimal policy π∗. Conversely, for any station-ary MDP, any policy greedy with respect to the op-timal value functions must be an optimal policy :π∗ (s) = argmaxaQ

    ∗ (s, a)∀s ∈ S, where Q∗ (s, a) isthe optimal value function. If the RL agent knew theMDP, it could be able to compute the optimal valuefunction, and from it extract the optimal policy. How-ever, in the regular setting, the agent is only aware ofψ, the state-action space and must learn Q∗ by explor-ing. The Q-learning algorithm learns the optimal valuefunction by updating its current estimate,Qk(s, a), ofQ∗(s, a) using this simple update [32],

    Qk+1(s, a) = Qk(s, a)+α[r+γmaxa′Qk(s

    ′, a′)−Qk(s, a)]

    (1)

    α is the learning rate of the algorithm, γ ∈ [0, 1] is thediscounting factor and a′ is the greedy action in s′.

    An option (o) is a temporally extended action [3] ora policy fragment. Q-learning applies to options tooand is referred to as SMDP (Semi Markov DecisionProcess) Q-learning [3].

    Qk+1(s, o) = (1−α)Qk(s, o)+α[R+γτmaxo′Qk(s

    ′, o′)]

    (2)where R is the sum of the time discounted rewardsaccumulated while executing the option and τ is thetime taken for execution.

    A state s in the Sorter Domain is given by thefeatures {isCarrying, Carrying(O), Color(O1), in-Basket(O1), Color(O2), inBasket(O2), Color(O3),inBasket(O3), Color(B1), Color(B2), Color(B3),Botat(X)}. isCarrying - an indicator variable that istrue if the robot is carrying an object, Carrying(O) -object in robot’s arm, Color() - color of object/basket,inBasket(O) - indicates if object O has been droppedinto a basket and Botat(X) - robot’s current locationsuch that X = {O1, O2, O3, B1, B2, B3}. The set ofoptions are O = {Pickup, Drop, Goto(O1), Goto(O2),Goto(O3), Goto(B1), Goto(B2), Goto(B3)}.

    In real world robots, options such as Goto(.) are gen-erally implemented using planners making them deter-ministic in terms of the next state reached on execut-ing the option. But the time taken to execute the plan(option) can vary depending on the time taken to nav-igate to the target making them stochastic in τ . Theoption Goto(O2) will take different times to executedepending on the robot’s current position. The sen-sor’s noise is a chief contributor to this stochasticityas it greatly affects the robot’s ability to observe theworld and localize itself. SMDP Q-learning helps byproviding a learning framework that can accommodatethe stochasticity mentioned above. By abstracting outoptions as plans we also free ourselves from the has-sles of a domain being continuous. In essence, we workwith a discrete state space now.

    4 Instructions

    We define Instructions as inviolate external inputs tothe RL agent which affect its decision making or directits exploration or alter its belief on a policy. For exam-ple, an agent learning to wash cups can be instructedto use the scrubber.

    Instructions have also been used to specify regions ofthe state space or objects of interest. In [8], a humandirects Sonja, a game player, using instructions. Forexample, he instructs Sonja to Get the top most amuletfrom that room. This instruction binds the region ofsearch and the object of interest yet withholding direct

  • information about solving the task.

    4.1 Mathematical Formulation

    This section introduces a mathematical formulationfor instructions. Representing the policy π (s, a) asshown below makes it easy to understand the two typesof instructions that we use in this paper :

    π(s, a) = G(Φ(s), a) (3)

    where Φ(s) models operations on the state space. G (.)is a mapping from (Φ(s), a) to the real interval [0, 1].Φ(s) can either model mappings to a subspace of thecurrent state space or model projections of the state sonto a subset of features.

    4.1.1 π-Instructions

    Instructions of this type are in the form of action or op-tion to be performed at the current state : Iπ(s) = a,where a ∈ As. For example, in the Sorter Domain, theinstruction “Goto(O3)” is a π-Instruction. If policymodels are built over such instructions, their effect onthe policy would be π (s, a) ' 1.

    4.1.2 Φ-Instructions

    Instructions of this type are given as structured func-tions [33] denoted by IΦ. In this paper, we restrictourselves to using only projections, a class of struc-tured functions. Such an instruction, IΦ would becaptured by Φ(s) as ρD′ (s) , D

    ′ ⊆ D. D is the setof features representing the state set S and ρD′ is theprojection operation. D′ ⊆ D captures the possibil-ity that some features in a state representation areinconsequential in learning the optimal policy. For ex-ample, in the Sorter Domain, the instruction “Object3” is a Φ-Instruction. The effect of this instructionis to set D′ to {isCarrying, Carrying(O), Color(O3),inBasket(O3), Botat(X)}. The other objects and bas-kets are ignored.

    5 Proposed Approach

    In earlier work, we have shown in independent ex-periments that using each of the instruction typesmentioned above is very effective in speeding up RL.Choosing the best instruction type to be used proved achallenge though. As explained in the “paper weight”example in the introduction, interpretation of instruc-tions is crucial. In this paper, we attempt to overcomethis dilemma by proposing an algorithm that can han-dle multiple interpretations. In this section, we ex-plain our approach and the heuristics used. We as-sume a human instructs the agent in a manner that

    enables the agent to interpret the instruction bothas a π − instruction and a Φ − instruction. Usingpointing gestures is one such instructing mechanism,where pointing to an Object can either be interpretedas “Goto(Object)” or D′ = {Objectfeatures}.

    Although this difference in interpretation does not af-fect the instructor, it greatly influences the learner.For instance, a Φ − instruction results in projectingthe state space S onto a reduced feature space. Theprojected state space S′ is smaller and usually the ap-plicable set of actions AS′ is also smaller. Learningthe optimal policy on S′ is thus quicker. Whereasπ − instructions give the optimal action at a statethat can be generalized over similar states. As ex-plained, although both types aid in learning, their ef-fects are very different. Hence it is very important forthe learner to choose carefully.

    In the following approach, we build both models inparallel (Algo 1). The algorithm is split into twophases. During the training phase, the learner accu-mulates instructions, if available, to be used later totrain an instruction model. This model is used to se-lect actions in the post-training phase. The learnermaintains an individual set of instructions for both πand Φ Instructions, Dπ and DΦ. Every instruction,I(s), is interpreted as an action (π − Ins) and as abinding on the state space (Φ− Ins). I(s) = Pointinggesture towards an object is interpreted as an actionIπ(s) = Goto(Object) as well as a projection opera-tion IΦ(s) = ρD′(s), where D

    ′ = {ObjectFeatures}.Although both Dπ and DΦ are updated, the agent ex-ecutes Iπ(s). This is to make it easy to use the frame-work on real world agents. Ideally, each instructionmodel should suggest an action and both need to beperformed subject to being in the exact initial states,same random seeds etc. This is not easy to setup ina real world application and hence we choose to per-form Iπ(s) at a state. This arrangement is only duringthe training phase. In the case that instructions areunavailable at any step, the learner chooses an actionsuggested by a simple SMDP Q-Learner [4] that is in-dependent of the instruction models.

    Once the training period is over, the models Π̂π andΠ̂Φ are learned using the training sets. Learning thesemodels requires us to represent the probabilistic de-pendencies among attributes of the related objects.We use Markov Logic Networks (MLN) [9] to per-form the generalization as they can succinctly repre-sent these dependencies resulting in sample-efficientlearning and inferring. Combining MLNs and RL isnot new and has been done successfully in the past.Torrey et al. [29] have successfully used MLNS andRL to transfer knowledge from a simple 2-on-1 Break-away task to a 3-on-2 Breakaway task in the Robocup

  • simulated-soccer domain. Wang et al. [31] approxi-mate an RL agent’s policy using MLNs where theyupdate the weights of the MLN using Q-values. Onsimilar lines, we use MLNs to model a human instruc-tor’s preference where the inputs to the MLN are eitheractions or attention pointers.

    Algorithm 1 LearnWithInstructions

    while Training Period dos is the current stateif I(s) available thena← Iπ(s)Dπ ← Dπ ∪ {s, Iπ(s)}DΦ ← DΦ ∪ {s, IΦ(s)}

    elsea← Π̂Q(s)

    end ifUpdate Q(s, a)

    end whileTrain(Π̂πIns,Dπ)Train(Π̂ΦIns,DΦ)while Episode not terminated do

    Π̂(s)← max conf(Π̂πIns(s), Π̂ΦIns(s), Π̂Q(s))a← Π̂(s)Perform option aUpdate Q(s, a)

    end while

    In the second phase, instructions are absent and thetrained models are used to choose actions. The avail-able action models are Π̂πIns(s), Π̂ΦIns(s) and Π̂Q(s)(π− Ins,Φ− Ins and Q−Learner). The action sug-gested by the most confident model is used. The con-fidence a model is measured by

    conf = maxa

    Π̂(s, a)−maxb6=a∗

    Π̂(s, b) (4)

    where a∗ = arg maxa Π̂(s, a). The selected action isperformed and the Q − Learner is accordingly up-dated.

    5.1 Using the Π̂ΦIns model

    The Φ− Instructions result in a projection ρD′ of thestate space onto a reduced space. By learning the op-timal policy in the reduced space, the optimal policyin the original space can be derived by lifting actionssuitably. Since in this paper we work with simple pro-jections, the lifting of actions is trivial. An action inthe reduced space is lifted to be the same in the orig-inal state space. A detailed explanation of learningusing ΦIns has been given in our earlier work [14].

    5.2 Learning the models

    We transform the state features into a set of predi-cates and ground them using the feature values. Theinstructions at a state are also transformed into predi-cates. For example, the π−Instruction Go to object istransformed into a predicate Go-to-object that is set totrue along with the set of ground predicates describingthe state.

    We use the techniques proposed in [9] to learn thestructure of the MLNs and for inference. Every time anew state is encountered, an action (π(s)) and a cor-rect binding (Φ(s)) are inferred by treating the groundpredicates representing the state as the evidence. Inour experiments, we use the Alchemy package [13] forthe tasks mentioned above.

    The state features are represented as predicates{isCarrying, Carrying(O), Color(O1,c), inBas-ket(O1), Color(O2,c), inBasket(O2), Color(O3,c), in-Basket(O3), Color(B1,c), Color(B2,c), Color(B3,c),Botat(X)}. isCarrying - true if robot is carryingan object, Carrying(O) - true if object O is inrobot’s arm, Color(., c) - true if color of correspondingobject/basket is c ∈ {c1, c2, c3, c4, . . .}, inBasket(O)- true if object O has been dropped into a basketand Botat(x) - true if robot’s current location is xsuch that x ∈ {O1, O2, O3, B1, B2, B3}. The set ofactions A = {Pickup, Drop, Goto(O1), Goto(O2),Goto(O3), Goto(B1), Goto(B2), Goto(B3)}. Thecolorequal(a,b) predicate is True if a and b havethe same color. We assume background knowl-edge such as ∀a, isBasket(a) ⇒ ¬isObject(a) andisCarrying ⇒ ¬Pickup.

    The π − Instruction model (Π̂πIns(s)) is a set ofMLNs, where each MLN represents one among theoptions Pickup, Drop, Goto(basket), Goto(object). Inaddition to the π− Ins model, we also learn a Φ− Insmodel and pit the two against an SMDP Q-Learner. Inother words, we choose between three candidate poli-cies using the earlier confidence measure. The learneracts according to the chosen policy at the current state.We normalize the output probabilities of the instruc-tion models to get their action policies and use an�−Greedy action selection for the Q-Learner. We sim-ilarly choose a policy for the agent at each state thatit visits. Some of the weights learned by the Π̂πIns(s)for the action Goto(basket) are shown in Table 1.

    The table shows that the MLN has learnt that one ofthe important features of the task is that the agent hasto “go to the basket that is of the same color as theobject it is carrying”. In addition, the MLN has alsolearnt the importance of the system dynamics such as“a basket is not an object” and “cannot be near anobject that is already in a basket” etc.

  • Weight Formula17.015 ¬ isBasket(a1) ∨ ¬isObject(a1)16.4008 botat(a1)-15.5892 ¬Carrying(a1) ∨ ¬colorequal(a1,a2) ∨

    ¬Goto(a2)7.9959 isObject(a1)8.28407 isBasket(a1)-8.25381 Goto(a1)-9.01076 colorequal(a1,a2)10.1116 ¬inBasket(a1) ∨ ¬botat(a1)

    Table 1: Some formulas learned and their weights.

    6 Results

    In this section, we discuss the performance of our ap-proach on the Sorter Domain.

    All experiments involved a training period of sixepisodes during which all actions performed were asinstructed by a human. Our framework accumulatesthe training data and simultaneously updates its Q-Learners (one for the standard SMDP Q-Learner andone for the Π̂ΦIns model’s reduced space). It is to benoted here that we do not employ any kind of func-tion approximators for representing value functions.The instruction models serve as policy approximatorsthough. The standard SMDP Q-Learner does not useany relational features. We refer to our approach asthe Instruction Framework (IF) henceforth.

    6.1 Experiment 1

    In this experiment, we compare the performance ofIF with standard SMDP Q-Learning. The purposeof this experiment is to gauge the generalization abil-ity of IF. The experiment is split into blocks of 200episodes. The learners are initialized with the samestarting state for each of the 200 episodes. At the endof every block, one of the colors is replaced with a newone resulting in a different set of states for the learners.

    The SMDP Q-Learner’s performance drops drastically(Fig 2) whereas IF generalizes very well. This is dueto the introduction of new states for which the SMDPQ-Learner has to learn a policy from scratch. IF gen-eralizes well due to the availability of the instructionmodels. Although there are no human inputs afterthe training phase, the performance of IF improves asthe experiment proceeds. This is due to the increas-ing confidence of IF’s Q-learner. As more states arevisited, the Q-learner’s estimates of the actions’ re-turns improves and thus control slowly shifts from theinstruction models to the Q-learner. Since the train-ing examples were insufficient, the instruction mod-els learnt were not complete resulting in them having

    Figure 2: Comparison of IF with SMDP Q Learning

    Figure 3: Comparison of Π̂πIns, Π̂ΦIns and IF.

    low confidence at certain states. By combining theQ-learner with these models we overcome this insuffi-ciency in training data.

    6.2 Experiment 2

    This experiment demonstrates the importance of in-terpreting instructions properly and how IF helps bychoosing between possible interpretations.

    In this experiment we use 3 different learners, onewhich interprets instructions as π−Instructions alone(A), one that interprets them as Φ−Instructions alone(B) and one that interprets them as both (IF). Theyare plotted as “Pi+Q”, “Phi+Q” and “Pi+Phi+Q”respectively in Fig 3. In this experiment each of theselearners are provided with random starting states foreach episode i.e., the colors of the baskets and objectsare chosen randomly but ensuring that a solution ex-ists. All three learners are given the same set of in-structions during the training phase.

    B performs the worst suggesting that interpreting theinstructions as Φ − Ins was not entirely beneficial inthis domain. A performs much better compared toB implying that interpreting the instructions as Π −

  • Ins was overall more beneficial in this domain. Theperformance of IF is very similar to A implying thatour approach selects the better interpretation. At thispoint, it is to be noted that IF chooses interpretationsat each state independent of its choice at other states.

    6.3 Importance of Confidence Measures

    On carefully analyzing the working of IF, we noticedthat although the combined framework seems to beovercoming the weaker interpretation, it does not com-pletely eliminate it. There are certain states when IFchooses the weaker interpretation.

    Figure 4: Comparison of percentage of times the actionrecommended Π̂πIns, Π̂ΦIns and SMDP Q-Learnermodels is taken. The proposed framework’s perfor-mance (no. of steps taken to solve the task) has alsobeen plotted. The plots have been Bezier smoothedfor visibility purposes. The trends are evident in theunsmoothed data.

    In Fig 4, we have plotted the performance of IF (interms of length of episodes i.e, no. of steps taken tosolve the task) and the fraction of states in which eachof the models Π̂ΠIns, Π̂ΦIns and Π̂Q are preferred. Aslearning progresses, although the performance of IFimproves, the fraction of times Π̂ΦIns is preferred isslightly increasing. Although the Sorter Domain taskscan be solved in 11 steps, IF solves them in 13 (atthe end of 1000 episodes). We doubt this loss in per-formance is due to preferring Π̂ΦIns at a few states.One possible reason for this could be the confidencemeasure that we are using. The current confidencemeasure is naive and does not capture the confidenceof the instruction models effectively. We are currentlytesting our approach with better confidence measuresand have some initial results that show that perfor-mance improves with better confidence measures.

    We have also been running a few simple experimentswith noisy instruction models. We notice that the withbetter confidence measures our approach seems to be

    able to overcome the incorrect models quite well, i.e.,our approach retains the ability to unlearn incorrectinstruction models. The observations are only frominitial experimenting though and are not presentedhere.

    7 Conclusion

    We have shown that handling multiple interpretationsis advantageous and have proposed a novel frameworkthat handles this effectively. We have tested our pro-posed framework on a reasonably challenging domainhighlighting the generalization capabilities achieved byusing MLNs and importance of properly interpretinginstructions.

    As mentioned earlier, a future direction of researchwould be to use better measures of confidence and in-tegrate the earned returns into the confidence calcula-tion.

    References

    [1] P. Abbeel and A. Ng. Apprenticeship learning viainverse reinforcement learning. In Proceedings ofICML 04, 2004.

    [2] B. Argall, S. Chernova, M. Veloso, and B. Brown-ing. A survey of robot learning from demonstra-tion. Robotics and Autonomous Systems, 57:469–483, 2009.

    [3] Steven J. Bradtke and Michael O. Duff. Rein-forcement learning methods for continuous-timemarkov decision problems. In NIPS, pages 393–400, 1994.

    [4] Steven J. Bradtke and Michael O. Duff. Rein-forcement learning methods for continuous-timemarkov decision problems. In Proceedings ofNIPS, 1994.

    [5] S. R. K. Branavan, Harr Chen, Luke S. Zettle-moyer, and Regina Barzilay. Reinforcement learn-ing for mapping instructions to actions. In Pro-ceedings of ACL, ACL ’09, pages 82–90, Strouds-burg, PA, USA, 2009. Association for Computa-tional Linguistics.

    [6] Mark Brodie and Gerald DeJong. Learning toride a bicycle using iterated phantom induction.In Proceedings of the Sixteenth International Con-ference on Machine Learning, ICML ’99, pages57–66, San Francisco, CA, USA, 1999. MorganKaufmann Publishers Inc.

    [7] S. Calinon. Robot Programming By Demonstra-tion: A probabilistic approach. EPFL Press, 2009.

  • [8] David Chapman. Vision, instruction, and action.MIT Press, 1991.

    [9] Pedro Domingos, Stanley Kok, Hoifung Poon,Matthew Richardson, and Parag Singla. Unify-ing logical and statistical ai. In Proceedings of the21st national conference on Artificial intelligence- Volume 1, AAAI’06.

    [10] L. Getoor and B. Taskar. Introduction to Statis-tical Relational Learning. MIT Press, 2007.

    [11] Kristian Kersting and Kurt Driessens. Non-parametric policy gradients: a unified treatmentof propositional and relational domains. In Pro-ceedings of the 25th international conference onMachine learning, ICML ’08, pages 456–463, NewYork, NY, USA, 2008. ACM.

    [12] R. Khardon. Learning action strategies for plan-ning domains. Artificial Intelligence, 113:125–148, 1999.

    [13] S. Kok, M. Sumner, M. Richardson, P. Singla,H. Poon, D. Lowd, and P. Domingos. TheAlchemy system for statistical relational AI.Technical report, Department of Computer Sci-ence and Engineering, University of Washington.

    [14] Pradyot Korupolu V N, Manimaran Sivamuru-gan S, and Balaraman Ravindran. Instructing areinforcement learner. In the Proceedings of the25th FLAIRS Conference, FLAIRS ’12.

    [15] H. Lieberman. Programming by example (intro-duction). Communications of the ACM, 43:72–74,2000.

    [16] Sriraam Natarajan, Saket Joshi, Prasad Tade-palli, Kristian Kersting, and Jude W. Shav-lik. Imitation learning in relational domains: Afunctional-gradient boosting approach. In In Pro-ceedings of IJCAI ’11.

    [17] G. Neu and C. Szepesvari. Apprenticeship learn-ing using inverse reinforcement learning and gra-dient methods. In Proceedings of UAI, pages 295–302, 2007.

    [18] A. Ng and S. Russell. Algorithms for inverse re-inforcement learning. In ICML, 2000.

    [19] Martijn Van Otterlo. A survey of reinforcementlearning in relational domains. Technical report,CTIT Technical Report Series, 2005.

    [20] Robert Givan Prasad Tadepalli and KurtDriessens. Relational reinforcement learning: Anoverview. In In Proceedings of the ICML04 Work-shop on Relational Reinforcement Learning, 2004.

    [21] N. Ratliff, A. Bagnell, and M. Zinkevich. Maxi-mum margin planning. In ICML, 2006.

    [22] C. Sammut, S. Hurst, D. Kedzier, and D. Michie.Learning to fly. In ICML, 1992.

    [23] Scott Sanner and Craig Boutilier. Approximatelinear programming for first-order mdps. In InProc. UAI05, 509 517, 2005.

    [24] A. Segre and G. DeJong. Explanation-based ma-nipulator learning: Acquisition of planning abil-ity through observation. In Conf on Robotics andAutomation, 1985.

    [25] Richard Sutton and Andrew Barto. Reinforce-ment Learning : An Introduction. MIT Press,1998.

    [26] U. Syed and R. Schapire. A game-theoretic ap-proach to apprenticeship learning. In NIPS, 2007.

    [27] Gerald Tesauro. Temporal difference learning ofbackgammon strategy. In ML, pages 451–457,1992.

    [28] A.L. Thomaz, G. Hoffman, and C. Breazeal. Re-inforcement learning with human teachers: Un-derstanding how people want to teach robots.In The 15th IEEE International Symposium onRobot and Human Interactive Communication,2006. ROMAN 2006., pages 352 –357, sept. 2006.

    [29] Lisa Torrey, Jude Shavlik, Sriraam Natarajan,Pavan Kuppili, and Trevor Walker. Transfer inreinforcement learning via markov logic networks.In Proceedigns of AAAI workshop on TransferLearning for Complex Tasks 2008.

    [30] Chenggang Wang, Saket Joshi, and RoniKhardon. First order decision diagrams for re-lational mdps. JAIR, 31(1):431–472, March 2008.

    [31] Weiwei Wang, Yang Gao, Xingguo Chen, andShen Ge. Reinforcement learning with markovlogic networks. In Proceedigns of European Work-shop on Reinforcement Learning 2008, 2008.

    [32] Christopher J. C. H. Watkins and Peter Dayan.Q-learning. Machine Learning, 1992.

    [33] Bernard P. Zeigler. Toward a formal theory ofmodeling and simulation: Structure preservingmorphisms. J. ACM, 19, 1972.