[IEEE 2012 Brazilian Robotics Symposium and Latin American Robotics Symposium (SBR-LARS) - Fortaleza, Brazil (2012.10.16-2012.10.19)] 2012 Brazilian Robotics Symposium and Latin American

A Proposal of QLearning to Control the Atack of a 2D Robot Soccer Simulation Team

José Rodrigo Ferreira Neri, Maicon Rafael Zatelli Department of Automation and Systems (DAS) Federal University of Santa Catarina (UFSC)

Florianópolis, SC, Brazil e-mail: {jrneri,maicon}@das.ufsc.br

João Alberto Fabro Computational Intelligence Research Group,

Informatics Department (DAINF) Federal University of Technology of Paraná (UTFPR)

Curitiba, PR, Brazil e-mail: [email protected]

Carlos Henrique Farias dos Santos Robotic Research Group, Center of Engineering and

Exact Sciences (CECE) State University of West of Paraná (UNIOESTE)

Foz do Iguaçu, PR, Brazil e-mail: [email protected]

Abstract—This document presents a novel approach to control the attack behavior of a team of simulated soccer playing robot of the Robocup 2D category. The presented approach modifies the behavior of each player only when in the state “controlling the ball”. The approach is based on a modified Q-Learning algorithm that implements a continuous machine learning process. After an initial learning phase, each player uses its previous experience during the simulation of the game to decide if it should dribble, pass or kick the ball to the adversary goal. Simulation results show that the proposed approach is capable of learn how to score goals, even when faced with the -champion of the previous world championship.

Keywords - Simulated robot soccer, machine learning, learning, Q-Learning algorithm.

I. INTRODUCTION Robotic soccer was first proposed as a research problem

in 1996 [1]. The idea is to use soccer as a standard platform for comparison of different approaches to a diversity of problems associated to Artificial Intelligence and Robotics. The Robocup 2D simulation category [2] focuses on the development of distributed control algorithms. Each of the eleven simulated soccer agents receive partial information about the state of the game, and have to take action so that the team can achieve its objectives (defend, an ultimately score more goals than the adversary). Each player must identify objects in the environment as the ball, and the robots around (from its own team and the opposing team). Each player should be able to represent the environment, select sub-objectives, plan and implement actions to achieve the objectives of the team [2]. As these objectives depend upon individual decisions taken by each player, this paper proposes the use of an on-line version of a Reinforcement Learning algorithm (Q-Learning [3]). The learning algorithm is used to modulate the decision process of each simulated player, based on a training section that evaluates the decisions that best suited the players in previous (and the

current) matches. This algorithm was proposed and used in the GPR-2D team that achieved the first place of the Brazilian RoboCup 2D simulation category in 2011.

This paper has six sections. The first section presented the problem and the motivation for implementing the proposal. Section 2 presents the main idea for the development of the GPR-2D team, and a brief description of the team used as base for the development, the Helios team from Japan [4]. The third section describes the reinforcement learning method implemented (Q-Learning). Section 4 presents the modeling of the problem. In section 5 the implementation of the algorithm is presented. In the final section the results are discussed, as well as the conclusions and some ideas that can serve as directions for future research are presented.

II. THE GPR-2D TEAM The GPR-2D team is the current champion of the

Robocup Open Brazil 2011[5] and is currently being developed by GPR (Grupo de Pesquisas em Robótica - Robotics Research Group) from the State University of West-Paraná(UNIOESTE), in collaboration with the Federal University of Technology of Paraná (UTFPR) and the Federal University of Santa Catarina (UFSC) (all from Brazil). The team is based on the source code of the Helios team (Japan) that participated at the Robocup 2011. The main proposed advance is the modification of the behavior of each player using the Q-learning [3], a Reinforcement Learning algorithm. This strategy allows the agents to act appropriately when they are in possession of the ball, picking the best action to take in a given state of the environment. The main difference of the proposed approach is the use of different learning matrices for different areas of the field. The proposal is detailed on section 4 and 5. In the next sub-section, a brief description of the team used as base for the development (Helios [4] from Japan) is presented.

2012 Brazilian Robotics Symposium and Latin American Robotics Symposium

978-0-7695-4906-4/12 $26.00 © 2012 IEEE

DOI 10.1109/SBR-LARS.2012.35

174

A. The Helios 2011 Base Team Helios [4] is a team from Japan that released the source

code of its 2D simulation team (agent2D). This team uses probability estimation to improve the ball intercept skill and a game tree search methodology for online tactics planning, and it has become a base for many teams, including ours. In the following sub-sections some details about agent2D are briefly presented.

B. Decision Making of Ball Intercept based on Probability Estimation Helios team uses probability estimation to provide the

skill to intercept the ball during the game. To execute this action, the agent has to maneuver itself towards the ball during passes of the other team, and reach the ball before any opponent. This skill was divided into three subtasks:

1: Estimation of the number of cycles necessary to intercept the ball.

2: Decision making process, considering if the agent should move to the ball or not.

3: Maneuvering the agent to the ball. To execute the subtask 1, the agent must sense the field,

and estimate the number of cycles required to intercept the ball. Subtask 2 (the decision) is based on the information obtained from subtask 1, and other information from the field (position of the closest opponent, for example). In Subtask 3, if the agent decides to try to intercept the ball, it is selected the most appropriate action (turn a certain amount of degrees, and dash towards the ball). The decision making (subtask 2) is done using probability estimation, and the probabilities are updated according to the success of the decisions taken.

C. Online Tactics Planning through Search Trees The “ball intercept” behavior is responsible by the

defensive actions (when the other team has ball possession). To cope with the decisions when the agents of the team are in possession of the ball, a different approach is used in Helios. Each agent continuously executes a search for a sequence of actions using a tree search. To achieve a cooperative behavior, each agent uses an “Action Generator”, that is a computational function that provides options for the player: each player can move with the ball (dribble), pass the ball to a close companion, or kick the ball towards the goal. It is important to realize that to obtain a cooperative behavior among several teammate agents, it is necessary that even the agents that don´t have the ball must position themselves to receive the ball, and then plan the next steps in the case of receiving a pass. As all the agents have the same objective (score a goal), even planning separately allows the team to achieve cooperation. As an example, suppose that one agent is dribbling with the ball: it can continue to dribble, kick to the goal or pass to a teammate. The agent evaluates each of these options (using heuristics) and chooses the most promising: pass the ball. The other agent, which is in position to receive, uses the same mechanism to decide its action, getting closer to the agent with the ball in the first step, dribbling in the second, and passing to another agent on the third step. The evaluation

of each decision is done by a “Field Evaluator”. Both “Action Generator” and “Field Evaluator” are described in the next subsection. By continuously executing this short term, independent plans, it is possible to achieve a true cooperative behavior among teammates.

D. Tree Search of Action Sequences “Action Generator” is a function that generates several

action-state pairs. After the generation of an action, this function also estimates the state of the agent after the execution of the action. Each action-state pair is a branch of the search tree, and the best evaluated branch (evaluated by the “Field Evaluator”) can be expanded, generating sequences of actions (plans) that the agent then executes. For efficiency reasons, only the “best” action-state pair is expanded, and the number of actions is just 2 or 3 levels, due to the processing time needed.

“Field Evaluator” evaluates a generated action-state pair or entire action sequence. This function returns a real number, that is proportional to the evaluation, and a higher result implies that the action, if taken, can lead the agent to a better position/situation. Each agent then chooses the best action sequence based on this value, and executes the first action. The process is then repeated for each step of the simulated game [4].

III. REINFORCEMENT LEARNING According to [3], Reinforcement Learning (RL) is a

formalism that allows the AI agent to learn from its interaction with the environment in which it is inserted. The main objective of the algorithm is to find the best rewarding action among all possible actions in any situation. Learning takes place through selection of a certain action when in a certain situation, and evaluating the outcome. The outcome is then used to reinforce the probability of taking certain actions, or decrease the probability of taking other actions. There are several techniques of RL, one of the most used is the Q-learning algorithm, that is also used in related works on simulated robots soccer on Robocup [6-8].

A. Q-Learning The Q-learning method can be applied when an agent is

embedded in a finite and discrete world. At each iteration, the agent observes the current state of the environment, chooses between a finite number of choices, and receives an immediate reward for each of its actions [9].

In its simplest form, Q-learning with one step can be

defined by the following expression:

�� ][ 1 tttattttt a,sQa,sQmaxraa,sQa,sQ ��

where st corresponds to the current state; at is the action taken at state st; rt is the reward received by taking the action at at the

state st;

175

st+1 is the next state; � is the discount factor (0< � <1); α is the learning rate (0 <= α <1).

The function Q(st, at) is the value associated with the state-action pair (st, at) and represents how good is the choice of this action in maximizing the cumulative return function. The action-value function Q(st, at), that store the reinforcements received, is updated from its current value for each state-action pair. Thus, a part of the reinforcement received in a state will be transferred to the state prior to this.

In Q-learning algorithm, the choice of action to be performed in a given state of the environment can be made with any criteria of exploration/exploitation, including randomly. A policy that is a widely used is called є-greedy, where the agent can choose a random action, or the action that has the largest value increase in Q(st, at) [3]. The choice with the є-greedy policy occurs as follows (equation 2):

��

� � ��

otherwisea,sQmaxqifa

ttta

random ��

If the value of q chosen at random is less than the value ε set, a random action is selected, otherwise the action in table Q(st, at) with the largest reinforcement value assigned is selected. In the simulations ε was set to 0.2 (20% of the time, the action of the player was selected randomly).

IV. PROBLEM MODELING To apply the Q-Learning algorithm, one of the first steps

is the discretization of the states that each simulated player can achieve. The proposed states for this problem were:

1. Lead_pass: When in this state, it is possible to launch the ball to a fellow player that is better positioned;

2. AdversaryBehind: The closest adversary player is behind the player that is in possession of the ball;

3. AdversaryFarAway: The distance to the closest adversary is bigger than 7 meters;

4. AdversaryNotSoClose: The distance to the closest adversary is bigger than 6 meters;

5. AdversaryIsClose: The distance to the closest adversary is bigger than 5 meters;

6. AdversaryVeryClose: The distance to the closest adversary is bigger than 4 meters;

7. KickOpportunity: There is a direct line from the player that possesses the ball and the adversary goal.

The following are the possible actions that can be taken by each agent: 1. Launch: Launch the ball towards a fellow player better

positioned, i.e., that is much closer to the adversary goal;

2. SlowForward: To slowly advance with the ball; 3. FastForward: To quickly advance with the ball; 4. Pass: To pass the ball to a fellow player that is close,

but not necessarily best positioned; 5. Dribble: Try to dribble the closest adversary player; 6. Hold_the_ball: Just hold the ball; 7. Kick: To kick the ball in the direction of the adversary

goal.

Thus, seven states were defined for the environment, and seven actions for the agents. After defining the states of the environment and actions of the players, it is necessary to define the matrices R (Reinforcement Matrix) and Q. The matrix R is where the reinforcements are stored, in order to define the actions of the agent.

In the definition of the matrix R (Table I), the lines are the states of the environment and the columns are the players' actions. Each state of the environment can have one or more actions. A value of 0 (zero) means that this action has no reinforcement, -1 (minus one) means that the action is not performed, and values greater than zero means reinforcement for that action. It is observed that for state 7 (KickOpportunity) the agent can choose the action 7 (Kick) in which a reinforcement is assigned 100.

TABLE I. MATRIX R.

State\ Action

1

2

3

4

5

6

7

1 0 -1 0 0 -1 -1 -1 2 -1 0 0 0 -1 -1 -1 3 -1 0 0 -1 -1 -1 -1 4 -1 0 0 0 -1 -1 -1 5 -1 -1 -1 0 0 0 -1 6 -1 -1 -1 0 0 0 -1 7 -1 -1 -1 -1 -1 -1 100

In the Q matrix the values of Q(st, at) assigned by Q-

learning algorithm are stored. At first, the matrix has only zeroes, but at each new game, the values obtained at the end of the previous game are maintained. The parameters used for � and α were 0.8 and 0.5, respectively. It is relevant to note that this initial matrix is updated during each game, and after the end of the game, the matrix obtained is stored in a file. So, the final matrix of the previous game is the initial matrix of the next, successively allowing the team to modify/improve its performance.

176

V. THE Q-LEARNING ALGORITHM The implemented Q-learning algorithm is as follows:

Do Observe actual state ts Choose an action that will be executed according to the e-greedy policy Execute the selected action If there is change of state in relation to previous cycle, then: Determine the reinforcement to be assigned to previous state Update � �tt a,sQ with the reinforcement value using

� � � � � � � �][ 1 tttattttt a,sQa,sQmaxraa,sQa,sQ ��

Save the actual state to verify later While the game doesn't end

Two interest areas were defined on the field, as can be seen on Fig. 1.

For each of these areas, it were defined a set of Q and R matrices. If the agent is inside the area labeled as 2, for example, it uses the values of the matrices R2 and Q2. If the agent moves to the area labeled 1, it uses the values of matrices R1 and Q1. Depending on the position of the agent in the field, it will behave differently, even if it is in the same state of the environment. The main difference between this approach and the base Helios team is the use of the Q-learning matrices to decide the actions when the agents enter the interest areas, thus providing the ability to “learn” how to score goal against different adversaries.

The initial training of the policy was executed during 10 matches against the team used as base, the Helios Team [4]. Each simulated game has 6000 simulation steps, so the training phase of the algorithm performed 60000 steps

Fig. 1. Shows the two different interest areas (each one has different Q-Learning Matrices).

. Only when the players of our team had the possession of the ball, and approached the adversary goal (thus entering areas 2 or 1), the R matrices were updated. After these first 10 matches, the stabilization of the behavior of the attack was perceived.

To evaluate the performance of the team, 10 more simulations were performed against the Helios team. The results are shown on Table II.

TABLE II. RESULTS AGAINST THE HELIOS TEAM.

Match Team Result Team 1° Helios_Base 0x0 GPR-2D 2° Helios_Base 0x0 GPR-2D 3° Helios_Base 0x1 GPR-2D 4° Helios_Base 0x0 GPR-2D 5° Helios_Base 0x0 GPR-2D 6° Helios_Base 0x1 GPR-2D 7° Helios_Base 0x0 GPR-2D 8° Helios_Base 0x2 GPR-2D 9° Helios_Base 0x0 GPR-2D 10° Helios_Base 0x2 GPR-2D Total Helios_Base 0x6 GPR-2D

It can be seen that the new attack, based on reinforcement

learning techniques, was capable to score and win against the Helios team that was vice-champion at the Robocup 2011. To evaluate the ability of the proposed approach to outperform its base team against another adversary, simulated matches were performed against another available team: PetSoccer[10] (Brazilian Champion of 2009). The PetSoccer team uses fuzzy logic to make decisions about positioning, passing and kicking, and even to decide when to send communication messages between players, but no reinforcement learning is performed. The results of the matches are presented on Tables III and IV. From the presented results in Tables III and IV, it can be concluded that the proposed approach obtained better results than the base team (Helios_Base) and a different team.

TABLE III. RESULTS FROM MATCHES OF HELIOS AGAINST PETSOCCER.

Match Team Result Team 1° Helios_Base 7x0 Pet_Soccer 2° Helios_Base 3x0 Pet_Soccer 3° Helios_Base 2x0 Pet_Soccer 4° Helios_Base 6x0 Pet_Soccer 5° Helios_Base 5x0 Pet_Soccer 6° Helios_Base 7x0 Pet_Soccer 7° Helios_Base 3x0 Pet_Soccer 8° Helios_Base 5x0 Pet_Soccer 9° Helios_Base 2x0 Pet_Soccer 10° Helios_Base 7x0 Pet_Soccer Total Helios_Base 47x0 Pet_Soccer

177

TABLE IV. RESULTS FROM MATCHES OF GPR-2D AGAINST PETSOCCER.

Match Team Result Team 1° GPR-2D 8x0 Pet_Soccer 2° GPR-2D 7x0 Pet_Soccer 3° GPR-2D 5x0 Pet_Soccer 4° GPR-2D 5x0 Pet_Soccer 5° GPR-2D 7x0 Pet_Soccer 6° GPR-2D 6x1 Pet_Soccer 7° GPR-2D 3x0 Pet_Soccer 8° GPR-2D 8x0 Pet_Soccer 9° GPR-2D 4x0 Pet_Soccer 10° GPR-2D 9x0 Pet_Soccer Total GPR-2D 61x1 Pet_Soccer To consider the validity of the proposed approach,

simulations were also performed against the champion of Robocup 2011 (WrightEagle). The results of the simulations are shown on table V. The GPR-2D team uses the same code as the Helios_Base team, with the exception of the use of reinforcement learning when the players are on areas 1 or 2 of the attack field, and with ball possession.

In the results against Wrigth Eagle (Table V), GPR-2D

team has scored 6 goals against 4 from the adversary, winning 5 games during regular time, and 2 during penalty shootout (7 wins against only 3) showing that the Q-Learning algorithm presents an advance, winning most of the games from the former champion of Robocup.

TABLE V. RESULTS FROM MATCHES OF GPR-2D AGAINST WRIGHT EAGLE.

Match Team Result Team 1° GPR-2D 0_2x0_0 WrightEagle 2° GPR-2D 0_0x0_1 WrightEagle 3° GPR-2D 0_4x0_1 WrightEagle 4° GPR-2D 2x0 WrightEagle 5° GPR-2D 1x0 WrightEagle 6° GPR-2D 0x1 WrightEagle 7° GPR-2D 0x1 WrightEagle 8° GPR-2D 1_2x1_1 WrightEagle 9° GPR-2D 1x0 WrightEagle 10° GPR-2D 1_2x1_0 WrightEagle Total GPR-2D 6x4 WrightEagle

VI. CONCLUSIONS This paper has presented a proposal to use the Q-learning

algorithm to provide a learning mechanism for simulated robotic agents of the RoboCup 2D simulation category. By this learning mechanism, these agents have shown improved behavior, adapting themselves individually to the different situations presented to them, in order to maximize the global performance of the team in scoring goals.

For future work in the area, we suggest the use of Q-

learning method also for players without ball possession, allowing these agents to better position themselves on the field, both in the center of the field or in defense positions, preventing the opposing team from scoring goals, even send or receive a pass. It is also suggested a study of further refinement of the states of the environment and the actions that can be taken by the simulated robot soccer players. The authors plan to apply this approach to other Robocup categories, such as 3D Simulation and Small Size.

REFERENCES

[1] Kitano H., Asada M., Kuniyoshi Y., Noda I.,. “The robocup synthetic agent challenge, 97”. International Joint Conference on Artificial Intelligence (IJCAI97), Nagoya, Japan (1997).

[2] Neri, J. R. F; Santos C. H. F.; Fabro, J. A. Sistema Especialista Fuzzy para posicionamento dos jogadores aplicado ao Futebol de Robôs. IX Brazilian Symposium of Intelligent Automation(Simpósio Brasileiro de Automação Inteligente - SBAI). Brasília - DF, (2009) (in portuguese).

[3] Sutton, R. S.; Barto, A. G. Reinforcement Learning: An Introduction. Massachusetts: MIT Press, Cambridge, (1998).

[4] Hidehisa Akiyama, Helios RoboCup Simulation League Team, online, available at: http://rctools.sourceforge.jp/pukiwiki/, Acessed: January/2012.

[5] Brazilian Robotics Competition Results Page, http://www.cbr2011.org/Sim2D.htm, Accessed: March/2012.

[6] Farahnakian, F.; Mozayani, N.;,"Reinforcement Learning for Soccer Multi-agents Sytem," Computational Intelligence and Security, 2009. CIS '09. International Conference on, vol.2, no., pp.50-52, 11-14 Dec. (2009).

[7] Li Xiong Chen Wei Guo Jing, A new passing strategy based on Q-learning algorithm in RoboCup, Computer Science and Software Engineering Wuhan Hubei 12-14 Dec (2008), 1 524-527.

[8] Azam Rabiee and Nasser Ghasem-Aghaee, A Scoring Policy for Simulated Soccer Agents using Reinforcement Learning, 2nd International Conference on Autonomous Robots and Agents, New Zealand, (2004).

[9] Dayan, P. Technical Note Q-learning, Centre for Cognitive Science, University of Edinburgh, Scotland: University of Edinburgh, (1992)

[10] Segundo,G. A. S., Favoreto,R. C., Silva,E. N., Quirino,G. K. S. and Floriano, A. S. P., PET Soccer 2D Team Description Paper, online: available at: http://julia.ist.tugraz.at/robocup2010/tdps/2D_TDP_ PET_ Soccer_2D.pdf Acessed: March/2012.

178

Documents

[IEEE 2012 Brazilian Robotics Symposium and Latin American Robotics Symposium (SBR-LARS) - Fortaleza, Brazil (2012.10.16-2012.10.19)] 2012 Brazilian Robotics Symposium and Latin American