Reinforcement Learning: A Short Cut

Evolution of Reinforcement Learning AlgorithmsKrishna C Podila N V

ABSTRACT

This paper shows why agents need to learn and why reinforcement learning (RL) is the best way to do so. The paper shows what are the issues posed to Reinforcement Learning and how they were tacked. The main part of this paper is to show how reinforcement learning algorithms have evolved over time. Paper discusses how few threats to traditional RL algorithms are decomposed and handled.

INTRODUCTION

Why learning?

Robots have been doing tasks which were predefined. Some major Limitations with this system are, every task needs a new program and system is not fail-safe to a disturbance. Robots need to learn by themselves to overcome above problems.

What is Learning??

Firstly, Supervised Learning:

Needs training data similar to the task. Possible variations to be trained.

Now, Unsupervised Learning:

Needs to be told what the goal is. Learns its way towards the goal. Adapts to changes (Compliance) Needs very less if not no human

interaction

Reinforcement Learning is unsupervised.

Reinforcement Learning

RL lets agents learn from their experience. Every movement gets the agent a reward. Generally these rewards are scalar (eg: -1,0,1). Mistakes punish the agent with negative reward while goals reward them.

State, Action, Reward, Policy are the terms we generally encounter.

State: Situation of environment with respect to agent.

Action: Immediate step taken by agent. Reward: Punishment or accolade for an

action. Policy: Plan to follow.

State: Agent in place 29

Agent: Take left, move forward for 4 places

State: Hit the wall at 26, reward -1

Agent: Take Left, move 6 spaces

State: Subgoal 3 reached at 86, reward +5

.........

Fig1: State, action, reward

(Kaelbling, Littman, & Moore, 1996)

The states and actions described are interlinked. Every state has bunch of actions, every action leads agent to a new state. These links are part of Markov Decision Processes

Markov Decision Processes (MDP)

The algorithms we deal with in the rest of the paper, assume we know the state space either completely or partially. The effect of an action is known with certainty or with probability.

MDP (Markov Decision Process) is a way to represent how states are connected using actions. MDP incorporates Rewards and Transaction probabilities for states and Actions respectively. (Sutton & Barto, 1998)

We will also see, how the problems in RL algorithms have influenced MDP’s evolution. POMDP (Partially Observed), SMDP (Semi) etc. are formed to overcome issues left unhandled by traditional MDPs (White, 1993).

DISCUSSION

In this section, we are going to see how RL is accomplished in general. Algorithms developed to achieve reinforcement learning are the main point of discussion. We are going to see how algorithms have evolved from a very basic idea of “Evaluative Feedback”.

S1062150 Page 1


Evaluative feedback works by taking results of the action taken into consideration. Every possible action from a specific state is given a scalar reward relative to other actions. The system learns to do a best possible sequence of actions to reach the ultimate goal by learning best action from each state. This method can be termed as trial and error. This method is modified using Exploitation and Exploration techniques for an enhance performance.

Exploitation is acting by greed while Exploration is to find all possible ways. Exploitation is used always to maximise the reward (Of course, only of known actions). Every action taken during exploitation is to achieve the best possible reward for the next state. Exploration on the other hand focuses on finding other ways to achieve goal which may or may not be profitable. For an agent to decide what action to take, it needs the promising values an action can fetch.

State Value

Now let us see how the values are computed. Let us assume that the agent is in state S1. When an action a1 is chosen to be executed from S1, Value V of the action a1 is, reward r1 attained from action a1 added with rewards r2, r3 ….. rx (rewards obtained by actions that are taken from subsequently attained states) over total number of actions taken. (Sutton & Barto, 1998)

V (a1 )= r 1+r 2+r3…. rxx

(1)

Exploration and Exploitation

Exploitation is euphemism for selection Greedy actions. An action a* is chosen based on the best value every possible action can provide

a∗¿max V (a )(2)

Always choosing a greedy action might not be an optimal path to achieve goal. An unexplored path might be more promising than a current and best policy.

To strike a balance, ϵ-greedy method is created. Epsilon (ϵ) is the amount of time a random action can be selected and rest of the time a greedy

action is selected. Initially ϵ can be set high and can be reduced gradually over time to attain better results. But during the random selection, all the possible actions are given equal importance irrespective of their effect.

Using Gibbs/Boltzmann temperature, we can select the action with the probability. Here probability is mapped as estimated reward an action can fetch. So, better actions are highly probable to be chosen than worse ones.

exp (V (a )

τ)

Σ exp (V ( x )

τ)(3)

In the above Equation, a is the action which is up against x (x is all other actions). The above method is called softmax (Syafiie S, 2004).

Weighted Averaging

While we are looking into maximising the reward we obtain by taking an action, we should also consider how long into the future we should look into. It is possible that we lost sight of near and short term goals. To tackle this problem, we can assign weights to the rewards based on the time of the occurrence of action. We can keep ϒ below 1 and use it to give weights to particular reward. In the below equation, First reward has higher precedence over the next and so on.

R=r t+1+ϒ rt+2+ϒ 2r t+3+… (4)

So a value of a state can be rewritten as…

V∏(s) = E∏ { ∑x=0

∞

ϒ x rt+x +1|st=s }(5)

In the above equation, ∏ is a policy followed from which instructions to select actions like a are selected from states like s. Time step t is computed from 0 to infinite. E∏ is estimated value if policy ∏ is followed. Thus a value of state is computed. But how good an action can be identified only if the goodness of an action from a particular state is computed.

Bellman Equation

Goodness of a state action pair can be computed by getting its value in following way (S, 1992)

S1062150 Page 2


V∏(s) =∑a

∏ (s , a)∑s '

P ss'a ¿] (6)

While ∑a

∏ (s , a) refers to action taken from

current state, s ' refers to state achieved from s due to action a. ss ' refers to next possible state froms '. Pss '

a is the probability of transitioning from s ' to ss ' with action sequence. This equation is also known as Bellman equation (Sutton & Barto, 1998).

Once the feedback method is implemented and computation of value functions are attained, techniques for maximising reward (in layman terms, choosing best possible path to goal) like Dynamic Programming has evolved.

Dynamic Programming (DP)

Dynamic programming sweeps to every state and gets agent the best path. Consider the diagram below….

In the above figure, let circles be states and edges be actions. In dynamic programming (See fig2), all possible actions from S1 (a1, a2, a3) are checked individually. Transition from S1 to S2 is taken as a case and the possible value of S2 is computed using actions a4 and a5. Likewise all actions and all states are swept. Then actions yielding maximum value as output are chosen and executed. An example of code showing implementation of basic Dynamic Programming is below.

Get Policy ∏

Do

iDiff = 0;

For s = 0 to allStates

V = V(s)

V(s) = ∑a

∏ (s , a)∑s '

P ss'a ¿]

iDiff = iDiff > abs(V – V(s) )? iDiff : abs(V-V(s)

While (iDiff > iThreshold)

Execute ∏ // based on V(s)

The code above depicts a way to compute values and follow the policy to reach the goal. A little tweak in the technique can result in maximising the reward.

∏’(s) = maxaQ∏❑ ( S , a )(7)

As per above equation, optimal reward can be easily obtained. This is also known as policy improvement. Agent follows the policy, but when an action proves more valuable than the chosen one by policy, policy ∏ is modified to replace with a better action. ∏’(s) is modified or improved policy. (Busoniu, Babuska, Schutter, & Ernst, 2010)

The improvement we discussed above is by greedy improvement method. It is possible to obtain optimal policy in this method using dynamic programming. Thus evaluating the policy by computing the values and by improving them in iterations we can attain optimal policy.

∏0(Ω) V∏0

∏1(Ω) V∏1

Dynamic Programming could prove expensive when state space Ω is large and actions for every state are numerous. Thus a way to stop the sweep was required. Sweep is stopped when the reward obtained by a state is too small to show an impact on decision making. The dependence of Dynamic Programming on MDP is a huge drawback where the environment is new and there are no predefined rewards or Transition probabilities.

The complete dependence on MDPs lead scientists to use a different method called Montecarlo. This method does not need to know the probabilities of every action from a state.

Monte-Carlo Method (MC)

Monte-Carlo methods discover optimal policies by trying many different ways to reach the goal. These methods neither assume to know the probability of a state being selected from the current one or Reward obtained after few states

S1062150 Page 3

Fig4: Policy Improvement

Fig3: Dynamic Programming Pseudo

Improve

Improve

Evaluate

Evaluatea3a1

S6

S2

S1

Fig2: State & Action Representation


are passed. These methods follow sample sequences as per policy and determine the best one over iterations and in most cases the best policy converges to optimal one.

Each sample sequence is called an Episode. At the end of every episode, total reward obtained is averaged over total states encountered. All encountered states are updated with the value. Policy should be written in order to cover all possible states n number of times. (Kaelbling, Littman, & Moore, 1996)

When a state is visited multiple times, each state is assigned value either by “First Visit” or “Every Visit” Mote-Carlo method. In First Visit MC, weighted value (See formula (4) ) of goal from the first time a state is visited is assigned to the state. In Every Visit method, Value will be averaged over all state values encountered in the process to achieve goal.

Episode = GenerateEpisode(∏)

For x = 0 to allStates(Episode)

aReturns(allStates(x)->StateID) += aReturns(allStates(x)->Value)

aVisited(allStates(x)->StateID)++

End

For x = 0 to iTotalStates

aReturns(x) /= aVisited(x)

End

Greatest advantage is that Computational complexity of Dynamic Programming using recursions is completely avoided in Monte-Carlo. Transitions are only made based on policy; learning is done through samples, updating root nodes and getting us values of all states. Once converged, it is easy to obtain optimal path.

Disadvantage of this method is, for improving policy we need thousands of iterations to be completed. This was solved using Exploring Starts algorithm.

Exploring Starts is a method where Episode generation is followed by its execution, then it evaluation and changing the policy based on its results. This speeds up the policy improvement wagon. Even though improvement looks

promising, Sampling forever or Getting to know all states is almost impossible.

Exploration is the backbone of Monte-Carlo method as it handles unknown environments. Two methods “On-Policy” (where ϵ-greedy is used) and “Off-Policy” (where one policy is used for exploration while other tries to determine optimal way to achieve goal) are created.

On-Policy method is fairly intuitive. After an episode is executed, value of each state is set to maximum possible except for ϵ times (See (8)).

∏ (S , a )=¿

Here, a* is greedy choice, a(s) is action from state‘s’, ϵ is the time where policy is not greedy.

Off-Policy method follows a Behaviour Policy and Estimate Policy based on the weighted returns. More of this is explained in Q-Learning method.

One major disadvantage of Monte-Carlo is to wait for an episode to finish, which might take infinitely long time for large state spaces. This could be solved by updating the root nodes with the estimate of the current state otherwise known as Temporal Difference Learning.

Temporal Difference Learning (TD)

The agent learns from experience directly. Once V(St+1) is attained, V(St) is updated with newly estimated value. Let us consider TempDiff(0),

V (S t )=V (S t )+α [r t+1+¿ϒ V ∏ (S t+1 )−V ( St ) ](9)where V (S t ) is previously estimated value of state. This is modified with estimated value of next state and exact reward obtained to reach the state which gives more exact value. These are moderated with a constant α.

By modifying values on the fly (Boot Strapping), TD is similar to DP. TD does not sweep all the possible actions and states but follows a sample path like MC. Thus TD is a hybrid algorithm of DP and MC. But TD converges faster due to Bootstrapping and computationally not complex like DP.

Due to the advantages discussed above, On-Policy and Off-Policy techniques of TD are developed to be used in multiple situations. The On-Policy method is called SARSA while Off-Policy method is Q-Learning. (Sutton & Barto, 1998)

S1062150 Page 4

Fig5: Every Visit Monte-Carlo


SARSA

The State Action pair values are computed to get a greedy behaviour policy. For learning a ϵ-greedy method can be used. The method is called SARSA because; a Reward (R) is obtained for moving from first state action (SA) pair to the next.

Q-Learning

This Off-Policy technique updates maximum value path to the estimation policy while following behaviour policy. This could lead to the shortest path to the goal.

While(1)

State = StartState()

While(State != Goal)

Action = BehaviourPolicy(Q(s,a)) // ϵ-greedy

Execute(Action)

Q(s,a) = Q(s,a) + α[r+ϒMax(Q(s’,a’) – Q(s,a)]

S = s’

As value of the state is increased, possibility of that state being selected next time increases. But this has pitfalls if the states nearby are disastrous.

Q-Learning can be used where shortest path is desired and occasional negative effect on the system does not have a huge impact. SARSA is safest way to learn but might not be optimal all the time.

All the methods discussed above are feasible with a limited number of state spaces. In real world where robots have to interact, state space could be ginormous. States and Actions may be needed to handle continuous time. For this to be solved, RL has evolved to tackle states in a different way, which can be seen in function approximation.

STATE SPACES

MDP is the only style of state space we discussed in this paper. State spaces are lot more complicated. One possibility is that working area of Robot is too large, traditional state space system doesn’t work, for which function approximation needs to be used. Sometimes an agent might not get a full view of MDP for which Partially

Observable MDPs (POMDP) has to be handled. There are situations where in performing a single step and re-planning is too costly demands usage of Semi MDPs (SMDP). These are the few views of how efficient usage of RL demands in evolving the structure of handling problems.

I am going to explain Function Approximation and SMDP with the help of RoboCup soccer (Peter Stone, 2005).

Function Approximation

Function Approximation works by estimating the model of the state space. The huge state space is reduced into states using generalization techniques. (Kostas Kostiadis, Unknown). Once the model is constructed, Reinforcement Learning Algorithms like Q-Learning and SARSA are implemented.

We are going to see how function approximation is put into use for the RoboCup Soccer. There is a 30mx30m region; it is difficult for agents to treat that huge arena as discrete state spaces. So, state space is generated using a model. Here people use Distance between the Agent to the ball, distance between enemy agents to ball, Angle between two enemy agents with respect to ball, possession of the ball etc as state variables. Thus every agent will have few states to think about.

Now that state space is conveniently mapped, an agent has to act based on them. But as these states are functions of distance and angles but not adjoining locations or grids, traditional way of moving one step ahead will not work. These kinds of problems demanded in generation of actions which are internally composed of set of actions otherwise known as SMDP actions.

Semi-Markov Decision Process (SMDP)

SMDP lets agent tackle the modelling of changes in system in real time (Rasanen, 2006). SMDP lets an agent change the plan every time system changes its state. The value of state to be achieved from an SMDP action depends on the time. This can be estimated by Probability that agent achieves Next decision in a particular time and probability that final goal is achieved in certain time. SMDP action can also be seen as a macro.

An SMDP action can be broken down into multiple actions done over some time units. For example, a robot to stealing ball from an opponent. (Peter Stone, 2005).

S1062150 Page 5

Fig6:Q-Learning


An SMDP action might be called StealBall from location X. But that action can be split in to, agent turning to the proper angle to get to the ball, travelling to the probable location of the ball, adjusting to the ball and moving against the direction of the opposition agent. If the agent discovers SMDP action cannot possibly be completed, the old action could be terminated. But this assumes that the field is fully observable. There are situations where in not all the states are observable for the robot and for handling such situations POMDP is designed.

Partially Observable Markov Decision Process (POMDP)

As POMDP is partially observable MDP, the agent works on it with beliefs or Probability or confidence (Monahan, 1982). These are generally done through posterior distribution of the states.

Vt(S) = ϒ maxa¿

This is belief or probability that agent is in a particular state assuming that it has taken action ‘a’ from past state‘s’. (N. Roy, 2005)

CONCLUSION

The importance of Reinforcement learning is undisputable and this paper has supported the stance right from the Introduction. This paper is a clear depiction of how the Reinforcement learning algorithms have evolved gradually. This paper shows primitive types of RL techniques and their emergence into real time and complex algorithms. The paper is a survey model which highlighted the pitfalls of each method and how others were derived to solve them. Some issues which were fatal to RL were shown and the way in which they were handled is explained. In conclusion this paper is just a part of what RL algorithms have been, how far they have come, problems faced, problems solved and implicitly how RL is the future of Artificial Intelligence.

Bibliography

Busoniu, L., Babuska, R., Schutter, B. D., & Ernst, D. (2010). Reinforcement Learning and Dynamic Programming using Function Approximators. Taylor & Francis CRC Press.

Conn, K. G. (2005). Supervised-Reinforcement Learning For A Mobile Robot In A Real-World. Nashville, Tennessee: Vanderbilt University.

S1062150 Page 6


Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 237-285.

Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 237-285.

Kostas Kostiadis, H. H. (Unknown). KaBaGe-RL: Kanerva-based Generalisation and Reinforcement Learning for Possession Football.

Monahan, G. E. (1982). Survey of partially observable Markov Decison Processes: Theory, Models and Algorithms. Management Science.

N. Roy, G. G. (2005). Finding approximate POMDP solutions through belief compression. Artificial Intelligence Research.

Peter Stone, R. S. (2005). Reinforcement Learning for RoboCup-Soccer Keepaway.

Rasanen, O. (2006). Semi-Markov Decision Processes. Seminar on MDP.

S, P. (1992). A generalized dynamic programming principle and Hamilton-Jacobi-Bellman equation. Stochastics An International Journal of Probability.

Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.

Syafiie S, T. F. (2004). Softmax and ε-greedy policies applied to process control. IFAC Workshop on Adaptive and Learning Systems.

White, D. J. (1993). Markov Decision Processes. Manchester: Wiley.

S1062150 Page 7

Documents

Reinforcement Learning: A Short Cut