Automated State Feature Learning ... - cmap.polytechnique.fr

Automated State Feature Learning for Actor-CriticReinforcement Learning through NEAT

Yiming Peng, Gang Chen, Sco� Holdaway, Yi Mei, Mengjie ZhangSchool of Engineering and Computer Science, Victoria University of Wellingtonyiming.peng,aaron.chen,sco�.holdaway,yi.mei,[email protected]

ABSTRACTActor-Critic (AC) algorithms are important approaches to solvingsophisticated reinforcement learning problems. However, the learn-ing performance of these algorithms rely heavily on good statefeatures that are o�en designed manually. To address this issue,we propose to adopt an evolutionary approach based on NeuroEvo-lution of Augmenting Topology (NEAT) to automatically evolveneural networks that directly transform the raw environmentalinputs into state features. Following this idea, we have success-fully developed a new algorithm called NEAT+AC which combinesRegular-gradient Actor-Critic (RAC) with NEAT. It can simultane-ously learn suitable state features as well as good policies that areexpected to signi�cantly improve the reinforcement learning per-formance. Preliminary experiments on two benchmark problemscon�rm that our new algorithm can clearly outperform the baselinealgorithm, i.e., NEAT.

CCS CONCEPTS•Computing methodologies→Neural networks; •Computersystems organization→ Embedded systems; Redundancy; Ro-botics;

KEYWORDSNeuroEvolution, NEAT, Actor-Critic, Reinforcement Learning, Fea-ture Extraction, Feature Learning

ACM Reference format:Yiming Peng, Gang Chen, Sco� Holdaway, Yi Mei, Mengjie Zhang. 2017.Automated State Feature Learning for Actor-Critic Reinforcement Learningthrough NEAT. In Proceedings of GECCO ’17 Companion, Berlin, Germany,July 15-19, 2017, 2 pages.DOI: h�p://dx.doi.org/10.1145/3067695.3076035

1 INTRODUCTIONReinforcement Learning (RL) aims to learn an optimal policy forsequential action selection while observing states in an unknownenvironment [7]. As an important RL algorithm family, Actor-CriticReinforcement Learning (ACRL) algorithms are designed to directlysearch e�ective policies (a.k.a., actor) guided by value functions(a.k.a, critic) [2].

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pro�t or commercial advantage and that copies bear this notice and the full citationon the �rst page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).GECCO ’17 Companion, Berlin, Germany© 2017 Copyright held by the owner/author(s). 978-1-4503-4939-0/17/07. . .$15.00DOI: h�p://dx.doi.org/10.1145/3067695.3076035

Many ACRL algorithms usually assume the availability of suit-able state features that are immediately accessible during reinforce-ment learning. However, for e�ective RL, these state features mustbe carefully designed with the support of domain experts via atime-consuming and error-prone procedure [1]. During the proce-dure, even for experienced domain experts, important state featureinformation may be overlooked, resulting in serious degradation oflearning performance [4, 8].

To address this important issue, state-of-the-art learning algo-rithms have considered switching across di�erent parametric func-tions (e.g. Radial Basis Function networks) [3] or optimizing someprede�ned score functions [5]. �ese techniques inevitably requiresubstantial domain knowledge. Additionally, when neural networksare chosen as the feature base, its topology also requires to be welldesigned prior to the activation of any learning algorithms.

�ese new issues motivate us to consider exploiting an Neu-roEvolution based approach towards fully automated state featurelearning which can be performed simultaneously with any ACRLalgorithm. Speci�cally, we are interested in a major EC methodfor NeuroEvolution, i.e, NeuroEvolution of Augmenting Topology(NEAT). �is is because of several reasons: 1) Neural Networks(NNs) are well recognized as good feature bases for various learn-ing paradigms including RL [1]. 2) NEAT has a strong capability ofevolving both structure and weights simultaneously for e�ectivereinforcement learning. 3) NEAT introduces a unique innovationnumber to each individual to preserve useful structural innovationsfor future learning. 4) NEAT adopts a strategy to evolve increas-ingly complicated NNs starting from the simplest structures. �eseproperties are very important for our feature learning tasks.

Goals: Motivated by this understanding, the overall goal ofthis research is to develop a new algorithm (NEAT+AC) basedon NEAT and Regular-gradient Actor-Critic (RAC) algorithm [2].�rough the seamless integration of NEAT and AC, we can learngood features, in the mean time use the learned features to identifydesirable policies.

2 NEAT+ACAs seen in Figure 1, our NEAT+AC algorithm consists of four phases,including initialization, evolution, evaluation, and termination.

Initialization: Similar to the standard NEAT, NEAT+AC alsostarts with a population with a �xed number of randomly generatedindividuals. Each individual is designed di�erently from that of thestandard NEAT. Since it is composed of three main parts, a NN, anactor and a critic.

Evolution: Aimed at searching good features, we use the stan-dard evolutionary operators de�ned in [6], including crossover andmutation, to evolve solely the state feature extraction function ϕ (~s ).

135

GECCO ’17 Companion, July 15-19, 2017, Berlin, Germany

TD errorInitial Population

Is termination condition met?

Select the best Individual

true

Actor

Environment

State Features action

reward

Critic

An actor

A critic

false

An individual

A Neural Network

Figure 1: �e proposed NEAT+AC algorithm.

Evaluation: In this phase, we compute the �tness value withrespect to each individual in hope of discovering good features, andin the meantime gradually improve the performance of the corre-sponding policies. �e �tness is de�ned as the average cumulativerewards obtainable through simulation over all training episodes,i.e., N . f itness ← R̃

eд . �e policy search is conducted by followingRAC.

Termination: Since NEAT+AC consists of two components(NEAT and RAC), each of them has a termination condition. �efeature learning process terminates either when the prede�nedmaximum number of generations is reached, or when the highest�tness value cannot be further improved over 50 consecutive gen-erations. Meanwhile, the RAC learning process terminates whenthe maximum number of training episodes is reached.

3 EXPERIMENTAL RESULTSTo verify any signi�cant performance di�erence, we conduct 30independent runs for both learning algorithms on each benchmarkproblem. In these runs, the population size and the number of gen-erations are both set to 100. Also, while evaluating any individual inone single generation, 5000 training episodes need to be performed.Each episode contains 200 steps. A�er every generation, 50 inde-pendent tests will be conducted to verify the learning e�ectivenessof the evolved NN with the highest �tness.

Promising experimental results have been collected on theMoun-tain Car problem [7] and the Cart Pole problem [7]. Figure 2 showsthat NEAT+AC performs signi�cantly be�er than NEAT on the CartPole problem. �is is supported by as a statistical test with a p-valueof 1.18366 × 10−11. �e Mountain Car problem, on the other hand,is much harder for NEAT+AC than for NEAT, as NEAT considersonly three optimal actions, while NEAT+AC must learn to selectsuitable actions from a continuous range. However, Figure 3 showsthat NEAT+AC performs comparable with NEAT. Both algorithmsreach the optimal solution very quickly (within 10 generations).

4 CONCLUSIONSIn this paper, we have successfully achieved the research goal ofevolving useful NNs as feature extrators which accept raw stateinformation as their input and subsequently produce a vector ofstate features to be subsequently utilized by RAC to learn desir-able policies. It is clearly evidenced in the experiment results thatNEAT+AC is an e�ective algorithm for reinforcement learning.Meanwhile, NEAT+AC is purposefully designed to ensure that ev-ery newly evolved NN will always be trained for the same numberof episodes, starting from identical initial se�ings. In view of this

0 10 20 30 40 50 60 70 80 90 100

Number of Generations

0

20

40

60

80

100

120

140

Ave

rag

e s

tep

s

Average Balance Steps Per Generation

NEAT NEAT+AC

Figure 2: Average Balancing Steps Per Generation on the CartPole problem

0 10 20 30 40 50 60 70 80 90 100

Number of Generations

0

200

400

600

800

1000

Ave

rag

e s

tep

s

Average Steps to Reach Goal Region

NEAT NEAT+AC

Figure 3: Average Steps Per Generation on the Mountain Carproblem

fact, the steady improvement of learning performance during theevolutionary process, as witnessed in our experiments, con�rmsthat NEAT+AC is capable of learning useful state features embodiedin NNs.

�ere is a big room for future research. We plan to conduct morecomprehensive experiments involving a wide range of benchmarkproblems to truly understand the real e�cacy of NEAT+AC.We willalso study the possibility of exploiting other cu�ing-edge reinforce-ment learning algorithms under the same NEAT-based learningframework.

REFERENCES[1] Yoshua Bengio, Aaron Courville, and Pierre Vincent. 2013. Representation learn-

ing: A review and new perspectives. Pa�ern Analysis and Machine Intelligence,IEEE Transactions on 35, 8 (2013), 1798–1828.

[2] Shalabh Bhatnagar, Richard S. Su�on, Mohammad Ghavamzadeh, and Mark Lee.2009. Natural actor-critic algorithms. Automatica 45, 11 (2009), 2471–2482.

[3] George Konidaris, Sarah Osentoski, and Philip �omas. 2011. Value FunctionApproximation in Reinforcement Learning using the Fourier Basis. Proceedingsof the Twenty-Fi�h Conference on Arti�cial Intelligence (2011), 380–385.

[4] Ishai Menache, Shie Mannor, and Nahum Shimkin. 2005. Basis function adapta-tion in temporal di�erence reinforcement learning. Annals of Operations Research134, 1 (2005), 215–238.

[5] Ronald Parr, Christopher Painter-Wake�eld, and Lihong Li. 2007. Analyzingfeature generation for value-function approximation. Proceedings of the 24thInternational Conference on Machine Learning (ICML) (2007), 737–744.

[6] Kenneth O Stanley and Risto Miikkulainen. 2002. Evolving neural networkthrough augmenting topologies. Evolutionary computation 10, 2 (2002), 99–127.

[7] Richard S Su�on and Andrew G Barto. 1998. Reinforcement Learning : AnIntroduction.

[8] M Wiering and M van O�erlo. 2012. Reinforcement Learning: State-of-the-Art.Springer Berlin Heidelberg.

136

Documents

Automated State Feature Learning ... - cmap.polytechnique.fr