Katharina Tluk v. Toschanowitz, Barbara Hammer and Helge Ritter- Mapping the Design Space of Reinforcement Learning Problems – a Case Study

Mapping the Design Space of Reinforcement LearningProblems – a Case Study

Katharina Tluk v. ToschanowitzNeuroinformatics GroupUniversity of Bielefeld

[email protected]

Barbara HammerAG LNM

University of [email protected]

Helge RitterNeuroinformatics GroupUniversity of Bielefeld

[email protected]

Abstract

This paper reports on a case study motivated by a typical reinforcement learning problemin robotics: an overall goal which decomposes into several subgoals has to be reachedin a discrete large sized state space. For simplicity, we model this problem in a standardgridworld setting and perform an extensive comparison of different parameter and designchoices. During this, we focus on the central role of the representation of the state space.We examine three fundamentally different representations with counterparts in “real life”robotics. We investigate their behaviour with respect to (i) the size and properties of thestate space, (ii) different exploration strategies including the recent proposal of multi-step-actions and (iii) the type and parameters of the reward function.

1 Introduction

Reinforcement learning (RL) provides an elegant framework for modeling biological and technicalreward-based learning [7, 18]. In contrast to supervised learning which uses explicit teacher informa-tion, RL algorithms only require a scalar reinforcement signal. They are therefore ideally suited forcomplex real-world applications like self-exploring robots where the optimal solution is not knownbeforehand. Numerous applications of RL to various complex tasks in the area of robotics over thepast years demonstrate this fact, including applications to robot soccer and the synthesis of non-trivialmotor behaviour like standing-up [1, 9, 12, 17]. The experimental results can be put on a mathematicalfoundation by linking these learning strategies to dynamic programming paradigms [3].

The construction of reinforcement learning algorithms and their application to real-world problemsinvolves many non-trivial design choices ranging from the representation of the problem itself to thechoice of the optimal parameter values. Since real-world problems – especially those encounteredin human-inspired robotic systems – tend to be very complex and high dimensional, these designchoices are crucial for the successful operation and fast performance of the learning algorithm. RL

is well-founded and guaranteed to converge if the underlying observation space is discrete and fullinformation about the process which must fulfil the Markov property is available [3]. Real worldprocesses, however, usually rely on very large or even continuous state spaces so that full explorationis no longer feasible. As an example, consider the task of learning different grasping strategies witha human-like pneumatic five fingered hand which will be our ultimate goal application for RL: Thisprocess is characterised by twelve degrees of freedom. The question we have to answer is how toexplore and map this high dimensional space of possibilities in order to create and apply an efficientRL system to learn optimal control strategies. On the one hand, we could use function approximationsuch as neural networks to directly deal with the real-valued state space (see e.g. [6, 14]). In that case,the convergence of the standard RL algorithms is no longer guaranteed and the learning process mightdiverge [2, 3]. On the other hand, we could approximate the process by a finite number of discretevalues and learn a strategy in terms of the discrete state space. However, even a small number ofintervals per variable yields a high dimensional state space. Therefore, additional information mustbe incorporated to shape the state space and the search process. This information might include thedecomposition of the task into subgoals (for grasping, possible subgoals could be reaching a contactpoint or the sequential closing of the fingers around the object) or a specific exploration strategy(e.g. imitating a human). In addition, prior knowledge can be used to map the state space into lowerdimensions by ignoring information which is irrelevant for an optimal strategy: In robot grasping,the exact position of an object is irrelevant whereas the direction in which the effector should movewould be sufficient information to find an optimal strategy.

In this paper, we are interested in the possibilities of shaping a high-dimensional discrete statespace for reinforcement learning. Our main focus lies on the particularly crucial choice of the repre-sentation of the state space. In a simple case study – an artificial gridworld problem which has beenloosely motivated by the aforementioned task of robot grasping – we exemplarily examine the effectsof three different representation schemes on the efficiency and convergence properties of RL. Notethat a compressed state space representation introducesperceptual aliasing[23]: during mapping,states from the original problem become indistinguishable and the problem looses its Markov prop-erty. Consequently, the state space representation has a considerable effect on learning: on the onehand, the structure of the state space affects the exploration of the learner and might thus require aspecific exploration strategy for optimal convergence. On the other hand, the learned control strategyis formulated in terms of the current state space with the result that the choice of its representationdetermines the existence, uniqueness, and generalisation abilities of the resulting strategy. During ourexperiments, we will put a special focus on the following aspects of the problem: different modes ofrepresentation of the state-action space are investigated with respect to (i) its size and specific prop-erties, (ii) exploration versus exploitation, in particular speeding up the exploration by incorporatingmulti-step actions, and (iii) the choice of the reward function, in particular incorporating subgoals.A quite compact representation in combination with advanced exploration strategies will allow us toachieve robust control strategies which are also potential candidates for alternative settings.

We first introduce the general scenario and shortly recall Q-learning. Afterwards, we introduceand discuss three different representation schemes of the state space. We investigate the behaviour ofthese representations concerning different exploration strategies and reward types. We conclude witha discussion and a set of open questions for further research.

2 The Scenario

Our scenario is a simple artificial gridworld problem which can be seen as a slightly modified versionof rooms gridworld [19] with just one room and several (sub-)goal states that have to be reachedin a certain order. The state space consists of a two-dimensional grid withn distinct goal states{g1, . . . , gn}. The objective of the agent is to successively reach all of the goal states in the specifiedorder1, 2, . . . , n with a minimum total number of steps. If the agent reaches the final goal stategn

Figure 1. Left: A10×10 grid with one actor and a set of goal states that have to be reached in acertain order. The agent can perform four different movements. Right: The ad hoc representationusing the (x,y)-values (3,3) in the compressed version of the state space (see text), the distancerepresentation (3,3,2) and the direction representation (right, below, left). The grid shown hereis smaller than the one used for the experiments.

after having visited the goal statesgi in the correct order, the attempt is counted as a success, otherwiseit is regarded as a failure. The possible actions are the primitive deterministic one-square-movementswithin thevon Neumann-neighbourhood (see figure 1 (left)).

The selection of this scenario as a testing problem was mainly motivated by two important aspects:on the one hand, it is a simple scenario so that we can focus on the aspect of representation and avoidadditional difficulties like noise or imprecision of the computation. On the other hand, this setting al-ready incorporates important aspects of real life applications like grasping with a five-fingered robotichand: We can scale the problem by increasing the grid size which mirrors the possibility of discretis-ing real-valued scenarios using differently sized meshes. The task of sequentially visiting the statesg1 to gn causes a high dimensionality and complexity of the problem. In addition, this decompositioninto several subgoals loosely mimics various sequential aspects of grasping such as the successiveclosing of the different fingers around an object or the transitions between the different phases of agrasping attempt, e.g. the closing behaviour of the fingers before and after the first contact with theobject [13, 16].

We set the immediate reward to−0.1 for each step that does not end on one of the goal states,1 forreaching one of the (sub-)goal states{g1, . . . , gn−1} in the correct order, and10 for reaching the finalgoal stategn. The aim is learning a strategy which maximizes the overall discounted reward at timepoint t:

∑i γ

irt+i wherert+i is the reward at timet + i andγ < 1 is the discount factor. Standardone-step Q-learning [20, 21] finds an optimal strategy by means of the following iterative update:

Q(st, at)← Q(st, at) + α [rt+1 + γ maxa(Q(st+1, a)−Q(st, at))]

wherest andat are the state and action at timet, α is the learning rate andγ is the discount factor.In the limit, Q estimates the expected overall reward when taking actionat in statest. An optimalstrategy always chooses actionat = argmaxaQ(st, a). Q-learning is guaranteed to converge in aMarkovian setting if all state-action pairs are adapted infinitely often with an appropriate learningrate [3] but it is unclear what happens in scenarios that do not possess the Markov property.

3 Experiments and Results

At the beginning of the first learning trial, the starting state and then distinct goal states are selectedrandomly. The agent is then positioned in the starting state and starts choosing and performing actions,thereby changing its position in the state space. The trial terminates when the agent reaches the finalgoal state after having visited all the (sub-)goal states in the correct order or when the total numberof steps exceeds a specified maximum value. After one success (or failure due to step number), theagent starts a new trial with the same starting point and the same goal states, but it remembers thealready partially adapted Q-matrix from the last trial. This procedure is repeated 100 times with the

same starting point and goal states and a successively enhanced Q-matrix so that the performance ofthe agent typically improves drastically with the increasing number of trials (see figure 2).

We usen = 3 goal states in a10 × 10 grid and a maximum number of 10,000 steps per trialas default values for all further experiments. In addition, we use periodic boundary conditions toavoid any impact of edge effects on our results. Unless otherwise mentioned, the action selection isperformed using anε-greedy strategy withε = 0.1. We use a discount rate ofγ = 0.9, a learning rateof α = 0.1 and a zero initialised Q-matrix. After 100 trials with one set of goal states and startingpoint and one Q-matrix, a new set of starting conditions is chosen randomly, the Q-matrix is initialisedwith zeroes and the agent begins another 100-trial-period. In total, 500 trial periods of 100 trials eachare performed for each parameter set. As an evaluation measure, we report the average number ofsteps needed until reaching the final goal state over the course of 100 trials.

3.1 Representation of the State-Action Space

Most of the scenarios where reinforcement learning is applied – and especially those in robotics –have an underlying continuous state-action space rather than the grid-based state space of the currentexample. If these problems are to be solved with an algorithm working with a discrete state-actionspace like simple Q-learning, the discrete representation of the state-action space plays a pivotal rolefor the success and the speed of the learning algorithm. One of the main objectives when choosinga representation is a small resulting state-action space to reduce the necessary amount of explorationand thus to shorten the learning time. In addition, the algorithm should be able to generalise: thelearned optimal policy should be applicable to a range of different settings including a scaled versionof the basic problem. The main characteristics of the three representation possibilities that will bediscussed in this section are summarised in table 1.

A direct representation of our scenario is achieved through the use of thex- andy-coordinates ofthe agent and of each of then goal states plus a success counters ranging from 0 ton. Since thisyields a deterministic Markovian scenario, the convergence of the single-step Q-learning algorithm isguaranteed for appropriate parameter values [3]. A disadvantage of this representation is the exponen-tial increase of the size of the state-space with the number of subgoalsn to be visited. This size canbe reduced considerably if thex- andy-coordinates of then goal states are omitted and just thex- andy-coordinates of the agent and the success counters are used (see figure 1 (right)). This representa-tion still provides a unique characterisation of each state if one fixed setting is considered. However,the resulting strategy generalises neither to different positions of the subgoals nor to different gridsizes. We refer to this setting, a characterisation of the state byx andy coordinates of the agent anda counter for the already reached subgoals, byad hocrepresentation. It might be interesting to modele.g. the situation that a fixed grasp of one fixed object in a well defined position is to be learned.

As an alternative, we consider representations that generalise to different positions of the subgoalsor even to different grid sizes. Naturally, they should result in a small state space comparable tothead hocrepresentation. An interesting representation in the context of robotics is to maintain thedistancesof the agent to all (sub-)goals which are still to be visited and a success counter (see figure 1

situation perceptual implemen- size of the generalisationinformation aliasing tability state space capability

ad hoc ++ 0 - ++ 0distance + + 0 0 +direction 0 ++ + + ++

Table 1. Comparison of the three different representations including the implementability on areal robot (not discussed in the text).

ad hoc distance directionscaling behaviour dx × dy × n (bdx2 c+ bdy2 c)

n × n 5n × ngrid size 10× 10 20× 20 10× 10 20× 20 any

1 100 400 10 20 52 200 800 200 800 50

n 3 300 1200 3000 24,000 3754 400 1600 40,000 640,000 25005 500 2000 500,000 16,000,000 15625

Table 2. The dimension of the state space for the different representations.n is the number ofgoal states to be visited,dx × dy is the dimension of the grid. The state space for an ad hocrepresentation with the positions of alln subgoals scales with(dx × dy)n × n.

(right)). This encoding mirrors the assumption that a robot only needs to know when it moves closerto or away from a goal. A representation in terms ofdistancescan be expected to generalise todifferent goal positions within the grid and also, partially, to differently sized grids. The state spacescales exponentially with the number of states to be visited; however, it is considerably smaller thana direct representation of all positions of the goals and, in addition, it can be expected to offer bettergeneralisation capabilities. Unfortunately, this representation does not differentiate between all thestates of the underlying system and thus need not yield a Markovian process. In addition, it is unclearwhether an optimal strategy can be formulated in terms of this representation: If the current positionof the agent is (5,5), the different goal positions (2,3) and (9,6) result in the same distance from theagent to the goal (5) but require different moves.

A third representation that conserves only a small portion of the original information is the use ofthedirectionsfrom the agent to each of then goal states plus a success counters (see figure 1 (right)).Speaking in terms of robotics, the agent only needs to know the next grasping direction in order tofind a successful strategy. For simplification, the number of different directions is limited to five(above, below, right, left,“right here”) in the current scenario. Thedirectionrepresentation results ina small state space which is independent of the underlying grid size. It can be expected to facilitategeneralisation not only between scenarios with the same grid size but also between differently sizedgrids. Another advantage is the fact that an optimal strategy can clearly be formulated in terms ofthis state space (‘move in the direction of the current goal until you reach it’). However, this strategyintroduces severe perceptual aliasing and need not maintain the Markovian property of the process.If the current position of the agent is (5,5), for example, the different goal positions (7,7) and (8,3)result in the same direction from the agent to the goal (right) and are thus indistinguishable.

As mentioned previously, a central aspect of the different representations that has a great influenceon learning speed and behaviour of the algorithm is the size of the state space which is given intable 2. As seen in this section the use of thedirection and thedistancesrepresentations can causeproblems because the resulting state space is only partially observable. This phenomenon is knownasperceptual aliasing[22, 23]. Consequently, the question arises whether the optimal strategy canbe found in these representations using the classical learning algorithm which convergence is onlyguaranteed when the Markovian property holds.

convergence speedsolution quality need for explorationad hoc ++ ++ +

distance 0 0 ++direction ++ ++ ++

Table 3. The Results for different representations.

Figure 2. First results: Two exemplary learning curves using the ad hoc (left) and the direction(right) representation. A linear decay ofε with the parameterk is used (see section 3.2). De-pending on the choice of parameter values, the direction representation either leads to a slowerconvergence of the learning algorithm or to a high number of outliers (trials with a very highnumber of steps until the final goal is reached).

3.1.1 First Results

Table 3 provides a condensed “map” of the overall characteristics of the different experiments per-formed so far. These qualitative results mirror the quantitative behaviour of the algorithm. As ex-pected, thedistancerepresentation needs a long learning time which is mainly due to the size of thestate space which is about 10 times larger than in the other two representations (see table 2). In addi-tion, it did not even succeed in learning the optimal policy in every set of trials which indicates thatour assumption of a policy that is not representable in terms of the reduced state-action space is true.We therefore discarded this representation and use only thead hocand thedistancerepresentationduring our further, more detailed, experiments.

The (simplified)ad hocrepresentation and thedirectionrepresentation both produce good results.Using thead hocrepresentation, the learning of a near-optimal policy usually took place within about40 trials and the resulting policy was very close to the global optimum (see figure 2). Depending onthe choice of the learning parameters, thedirection representation either needed a greater number oftrials to learn an optimal policy or the learning speed was almost the same but with a greater numberof outliers (trials with a very large number of steps until success or even with no success at all, seefigure 2). This effect can be explained by perceptual aliasing and the resulting special structure of theQ-matrix: Since many different state-action pairs are mapped to the same entry, the algorithm needs agreater amount of random exploration until it has reached the goal state often enough to backpropagateits positive value. However, since thedirection representation offers much more potential regardingthe generalisation ability (as explained in section 3.1), it is certainly worth further investigation evenif first results seem to show some advantages of the (simplified)ad hocrepresentation.

3.2 Exploration vs. Exploitation

As seen in the last section, exploration is extremely important for the performance of the learningalgorithm due to the phenomenon of perceptual aliasing. During learning, the amount of explorationand exploitation has to be carefully balanced. Since the learning algorithm does not know anythingabout its surroundings at the beginning, the first goal must be to explore a high number of possiblestate-action combinations in order to gather as much information as possible about the system. Oncea sufficient amount of information has been accumulated, it is more advantageous to at least partlyexploit this information and to restrict the exploration to those areas of the state-action space that

Figure 3. Exploration results: ad hoc (left) and direction representation (right)

seem to be relevant for the solution of the current problem (this problem is known as theexploration-exploitation dilemma(see e.g. [18])).

One way of achieving the desired behaviour is to use anε-greedy strategy: the best action ac-cording to the current Q-matrix is chosen with probability (1 − ε), a random action is chosen withprobabilityε. We experimented with several differentε-greedy strategies with a decayingεwhereτ ∈{0, 1, . . . , 100} is the number of the current trial: (i) a linear decrease with decay factork ∈ [0.01, 0.1]:ε = max(0.9 − k · τ, 0), (ii) an exponential decay with the parameterl ∈ [1, 10]: ε = 0.9 · l−τ and(iii) a sigmoidal decay withp, q ∈ N: ε = 1/(1 + e−(q−τ)/p).

3.2.1 Results

During the following experiments, we first determined the optimal parameter values for the decay ofε and afterwards compared the behaviour of the different scenarios using these optimal values. Theuse of sigmoidal decay generally resulted in a considerably slower learning behaviour than the otherexploration-exploitation strategies. Consequently, this approach was soon abandoned and all furtherexperiments were focused on the linear and the exponential decay ofε. In thead hocrepresentationscenario,k = 0.03 was optimal for a linear decay ofε andl = 2 was optimal for an exponential decay(see figure 3). The latter shows a somewhat faster learning behaviour. In thedirectionrepresentationscenario,k = 0.02 was optimal for a linear decay andl = 1.05 was optimal for an exponentialdecay. The latter again shows a slightly faster learning behaviour. A comparison of the optimalparameters shows a much greater need for exploration in thedirection representation than in theadhocrepresentation which tallies with the results from section 3.1.1. This is caused by the high amountof perceptual aliasing and the resulting special structure of the Q-matrix. Therefore the amount ofexploration proved to be crucial for the performance of the learning algorithm.

3.3 Speeding up Exploration

In order to speed up the exploration process, we experimented with multi-step actions (MSAs) of afixed length [15]. A MSA corresponds top successive executions of the same primitive action. Themotivation behind MSAs is the fact that the immediate iteration of the same action is often part of anoptimal strategy in many technical or biological processes. In grasping, for example, it is reasonable tomove the gripper more than one step into the same direction before making a new decision, especiallyif the object is still far away. Thus, the incorporation of MSAs shapes the search space in thesesettings so that promising action sequences are explored first. In our case, a set of MSAs consists offour multi-step actions corresponding to the four primitive actions. The parameters that need to bechosen are the numberm of MSA sets that are to be added to the regular action set plus the repeatcountpi for each of the sets. The use of MSAs requires an expansion of the reinforcement function:

Figure 4. Speeding up exploration via multi-step actions: in addition to the four primitive ac-tions, there are now four multi-step actions of length 3 in the action set. The ad hoc representa-tion is shown on the left, the direction representation on the right.

the value of a multi-step action is set to be the total (properly discounted) value of the respectivesingle step actionsrMSAi =

∑pij=0 γ

j · ri whereri is the reward for the primitive actioni. In addition,we used the MSA-Q-learning algorithm [15] which propagates the rewards received for performingan MSA to the corresponding primitive actions in order to best exploit the gained information.

3.3.1 Results

The best results were achieved by adding just one set of MSAs of length3 to the action set. As insection 3.2.1, we first determined the optimal parameter values for the decay ofεwithin each scenario.The general shape of the resulting learning curves is very similar to the one seen in section 3.2.1, butthere are two notable differences (see figure 4): Firstly, the speed of learning has improved so that thealgorithm needs about 10 trials less to find a near-optimal policy. Secondly, the optimal parametersfor the decay ofε have changed drastically: fromk = 0.03 (lin. decay) andl = 2 (exp. decay) tok = 0.07 andl = 10 in thead hocrepresentation and fromk = 0.02 andl = 1.05 to k = 0.05 andl = 5 in thedirectionrepresentation.

This shows that the algorithm now needs a substantially smaller amount of random explorationto learn the optimal policy because the exploration is biased into promising directions by the useof MSAs. This effect is especially pronounced in thedirection representation which concurs withthe results from sections 3.1.1 and 3.2.1. Consequently, the use of multi-step actions to speed upexploration can be seen as a good first step in bringing the reinforcement learning algorithm closer toapplicability in a real world problem where exploration is expensive.

3.4 Choice of Rewards

A second important aspect we investigated is the optimal choice of the reward function. In the defaultscenario, we gave a reward of 1 for reaching a sub-goal state, a reward of 10 for reaching the finalgoal state and a reward of -0.1 for all other steps (calledmany rewardsin this section). We tested twoadditional scenarios: One with just the rewards for the sub-goal states and the final goal state, but nonegative reward (calledeach success) and one where the reward is given only for reaching the finalgoal state (calledend reward). The results can be seen in figure 5.

As expected, the strategyeach successdoes not encourage short solutions and thus yields onlysuboptimal strategies. However, since no negative rewards are added to the Q-values, the Q-matrixand therefore also the graphs are very smooth. When using onlyend rewards, the convergence takestwice as long as withmany rewardsusing thead hocrepresentation (see figure 5) because no positive

Figure 5. Choice of rewards: Ad hoc representation, linear decay ofε with k = 0.07 is shownon the left, exponential decay ofε with l = 10 is shown on the right.

feedback is given until the final goal state has been reached. Surprisingly, it eventually yields a nearlyoptimal strategy even though short solutions are not encouraged by the punishment of intermediatesteps. This can be explained by the fact that, since exploration is so difficult for this task, only shortsolutions – which can be explored in a short time – survive. Themany rewardsapproach offers thehighest amount of information and thus leads to the fastest learning behaviour. This behaviour iseven more pronounced if the compresseddirection representationis used: Q-learning now does notsucceed at all if combined withend rewardsor each success. This is caused by the lack of sufficientstructure of the state space which inhibits successful exploration of the complex task without guidanceby intermediate rewards. Consequently, the division of the task into subgoals and the punishment ofa large step number turn out to be crucial for thedirection representation.

3.5 Scaling behaviour

An aspect that is of particular importance especially with regard to a possibly continuous underlyingstate space is the behaviour of the algorithm in a scaled version of the basic problem. We conductseveral experiments with differently detailed discrete approximations of the same state space andmeasure the number of steps necessary for 100 successes (see table 4). We use only the primitiveone-step-actions and no multi-step actions in this case because the latter would blur the differencesbetween the different grid sizes. The step numbers are adjusted so that they refer to steps of equallength throughout the differently fine grids. As expected, a larger state space leads to a considerablyhigher step number due to the need for more exploration with an increasing grid size. There seemsto be no significant difference between thead hocand thedirection representation. Note, however,that thedirectionrepresentation offers a huge advantage: the resulting Q-matrix is independent of thegrid size. Thus, the Q-matrix learned in a coarse discretisation (e.g.10× 10) can be directly appliedto a finer discretisation (e.g.30 × 30). Further experiments to determine the different qualities of Q-matrices learned with differently sized grids and their applicability to different settings are currentlybeing conducted. This property would allow us to design efficient iterative learning schemes for

ad hoc (k = 0.03) direction (k=0.02)n = 1 n = 2 n = 3 n = 1 n = 2 n = 3

10× 10 2,623 5,736 8,969 3,429 7,559 11,47320× 20 8,793 20,607 31,491 7,366 23,610 33,95330× 30 17,394 36,852 55,858 15,304 29,689 60,324

Table 4. Number of steps needed for 100 successes using different grid sizes.

thedirection representation moving from a very coarse discretisation to a fine approximation of thecontinuous state space while saving a considerable amount of learning time.

4 Discussion

In this paper, we investigated different design choices of a reinforcement learning application to anartificial gridworld problem which serves as an idea filter for real life target applications. As demon-strated, different representations cause major differences with respect to the basic aspects of the prob-lem, notably the size of the state space and the mathematical properties of the process due to percep-tual aliasing; consequently, they show a different robustness and need for shaping and exploration.We demonstrated that exploration is particularly relevant if the state space representation hides a partof the structure of the problem like thedirection representation. In this setting, advanced explorationstrategies such as multi-step actions prove to be particularly valuable. In addition, the decompositionof the problem into subgoals is crucial.

Even in this simplified setting we found some remarkable and unexpected characteristics: thedis-tancesencoding of the state space is inferior to an encoding in terms of directions even though the lat-ter provides less information about the underlying state. Multi-step actions improve the convergencebehaviour, but simple multi-step actions which repeat each action thrice provide better performancethan a larger action spectrum including MSAs of different repeat counts. Themany rewardsrein-forcement function proved superior to the two alternative reinforcement signals given byend rewardsor each success. However, unlikeeach success, end rewardsconverged to short solutions althoughno punishment of unnecessary steps has been incorporated in this setting. These findings stress theimportance of creating a map of simplified situations to allow a thorough investigation of reinforce-ment paradigms. The results found in these reduced settings can guide design choices and heuristicsfor more complex settings in real life reinforcement scenarios.

An important issue closely connected to the scaling behaviour of learning algorithms is its gener-alisation ability. The design of the state space defines the representation of the value function andthus widely determines the generalisation ability of a reinforcement learner. So far, we did not yetexplicitely address the generalisation ability of the various representations considered in this article. Itcan be expected that thedirection representationfacilitates generalisation to different settings withinthe same grid, and, moreover, to grid structures of different sizes. The latter property is particularlystriking if an underlying continuous problem is investigated. As demonstrated, smaller scenariosrequire less exploration so that a good strategy can be learned in a short time for small grids. Ap-propriate generalisation behaviour would then allow us to immediately transfer the trained Q-matrixto larger grids. This could be combined with automatic grid adaptation strategies as proposed e.g.in [10]. Since a successful generalisation to larger grid spaces can only be expected if large partsof the Q-matrix have been updated, further reduction techniques which drop irrelevant attributes asproposed e.g. in [4] for supervised learning could also be valuable.

Another innovative possibility of achieving inherent generalisation is to expand the capacity of theaction primitives. Instead of simple one step moves, restricted though powerful actions which dependon the current perception of the agent could be introduced. Locally linear control models constituteparticularly promising candidates for such a design because of their inherent generalisation capabilitycombined with impressive, though linearly restricted, capacity [5, 8, 11]. So far we have tackled theabove problem as an artificial setting which mirrors important structures of real life problems such asdifferently sized grids, subgoals, a large state space. The incorporation of powerful action primitives,however, turns this simple strategy learned by reinforcement learning into a powerful control strategyapplicable to real-life scenarios like grasping. Local finger movements can be reliably controlledby basic actions such as locally linear controllers where the applicability of a single controller islimited to a linear regime of the process. A global control of these local moves shares importantcharacteristics with our setting: there is a comparably small number of possible basic actions which

has to be coordinated. Thus, our setting can be interpreted as a high-level reinforcement learner ontop of basic independent actions. Consequently, a transfer of multi-step actions to this domain seemsparticularly promising: the iteration of the same action might correspond to the choice of the samelocal linear controller until its local application domain has been left. An interesting further directionwithin this context is to design perception dependent stop criteria for the durations of actions.

References

[1] C. G. Atkeson and S. Schaal. Robot learning from demonstration. InProc. 14th ICML, pages 12–20.Morgan Kaufmann, 1997.

[2] L. C. Baird. Residual algorithms: Reinforment learning with function approximation. InProc. 12thICML, pages 30–37. Morgan Kaufmann, 1995.

[3] D. P. Bertsekas and J. N. Tsisiklis.Neuro-dynamic programming. Athena Scientific, Belmont, MA, 1996.[4] T. Bojer, B. Hammer, D. Schunk, and K. Tluk von Toschanowitz. Relevance determination in learning

vector quantization. In M. Verleysen, editor,ESANN’2001, pages 271–276. D-facto publications, 2001.[5] T. W. Cacciatore and S. J. Nowlan. Mixtures of controllers for jump linear and non-linear plants. In

J. D. Cowan, G. Tesauro, and J. Alspector, editors,Advances in NIPS, volume 6, pages 719–726. MorganKaufmann Publishers, Inc., 1994.

[6] C. Gaskett, D. Wettergreen, and A. Zelinsky. Q-learning in continuous state and action spaces. InAustralian Joint Conference on Artificial Intelligence, pages 417–428, 1999.

[7] L. P. Kaelbling, M. L. Littman, and A. P. Moore. Reinforcement learning: A survey.Journal of ArtificialIntelligence Research, 4:237–285, 1996.

[8] Z. Kalmar, C. Szepesvari, and A. Lorincz. Module-based reinforcement learning: Experiments with areal robot.Machine Learning, 31(1–3):55–85, April 1997.

[9] J. Morimoto and K. Doya. Acquisition of stand-up behavior by a real robot using hierarchical reinforce-ment learning.Robotics and Autonomous Systems, 36:37–51, 2001.

[10] S. Pareigis. Adaptive choice of grid and time in reinforcement learning.Advances in Neural InformationProcessing Systems, 10:1036–1042, 1998.

[11] J. Randløv, A. G. Barto, and M. T. Rosenstein. Combining reinforcement learning with a local controlalgorithm. InProc. 17th ICML, pages 775–782, 2000.

[12] M. Riedmiller, A. Merke, D. Meier, A. Hoffmann, A. Sinner, O. Thate, and R. Ehrmann. Karlsruhebrainstormers — a reinforcement learning approach to robotic soccer. In P. Stone, T. Balch, and G. Kraet-zschmar, editors,RoboCup-2000: Robot Soccer World Cup IV, pages 367–372. Springer, Berlin, 2001.

[13] H. Ritter, J. Steil, C. Nolker, F. Rothling, and P. McGuire. Neural architectures for robotic intelligence.Reviews in the Neurosciences, 14(1-2):121–143, 2003.

[14] J. C. Santamarıa, R. S. Sutton, and A. Ram. Experiments with reinforcement learning in problems withcontinuous state and action spaces.Adaptive Behavior, 6(2), 1998.

[15] R. Schoknecht and M. Riedmiller. Reinforcement learning on explicitly specified time scales.NeuralComputing & Applications Journal, 12(2):61–80, 2003.

[16] J. Steil, F. Rothling, R. Haschke, and H. Ritter. Learning issues in a multi-modal robot-instruction sce-nario. InProc. IROS, volume Workshop on ”Robot Programming Through Demonstration”, Oct 2003.

[17] P. Stone and R. S. Sutton. Scaling reinforcement learning toward RoboCup soccer. InProc. 18th ICML,pages 537–544, San Francisco, CA, 2001. Morgan Kaufmann.

[18] R. S. Sutton and A. G. Barto.Reinforcement Learning: - An Introduction. MIT Press, Cambridge, 1998.[19] R. S. Sutton, D. Precup, and S. P. Singh. Between MDPs and semi-MDPs: A framework for temporal

abstraction in reinforcement learning.Artificial Intelligence, 112(1–2):181–211, 1999.[20] C. Watkins.Learning from Delayed Rewards. PhD thesis, Cambridge University, 1989.[21] C. Watkins and P. Dayan. Q-learning.Machine Learning, 8:279–292, 1992.[22] T. Wengerek.Reinforcement-Lernen in der Robotik. PhD thesis, Techn. Fak., Universitat Bielefeld, 1995.[23] S. D. Whitehead and D. H. Ballard. Active perception and reinforcement learning. InProc. 10th ICML,

pages 179–188. Morgan Kaufmann, 1990.

Documents

Katharina Tluk v. Toschanowitz, Barbara Hammer and Helge Ritter- Mapping the Design Space of Reinforcement Learning Problems – a Case Study