13
2038 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 12, DECEMBER 2013 Goal Representation Heuristic Dynamic Programming on Maze Navigation Zhen Ni, Haibo He, Senior Member, IEEE , Jinyu Wen, Member, IEEE, and Xin Xu, Senior Member, IEEE Abstract— Goal representation heuristic dynamic program- ming (GrHDP) is proposed in this paper to demonstrate online learning in the Markov decision process. In addition to the (external) reinforcement signal in literature, we develop an adaptively internal goal/reward representation for the agent with the proposed goal network. Specifically, we keep the actor-critic design in heuristic dynamic programming (HDP) and include a goal network to represent the internal goal signal, to further help the value function approximation. We evaluate our pro- posed GrHDP algorithm on two 2-D maze navigation problems, and later on one 3-D maze navigation problem. Compared to the traditional HDP approach, the learning performance of the agent is improved with our proposed GrHDP approach. In addition, we also include the learning performance with two other reinforcement learning algorithms, namely Sarsa(λ) and Q-learning, on the same benchmarks for comparison. Further- more, in order to demonstrate the theoretical guarantee of our proposed method, we provide the characteristics analysis toward the convergence of weights in neural networks in our GrHDP approach. Index Terms— Adaptive dynamic programming, goal repre- sentation heuristic dynamic programming, maze navigation/path planning, Markov decision process, reinforcement learning. I. I NTRODUCTION M ARKOV decision process (MDP) is a long-standing research topic in decision making models in stochastic processes [1]. Maze navigation is one of the typical MDP problems where the control action is only related with the current state rather than the previous state. Value function (or state-action pair) is introduced to evaluate how good it is of the agent for a given state. If the state and the action spaces are finite, then it is called finite MDP, which is particular important Manuscript received December 21, 2012; revised May 3, 2013; accepted June 18, 2013. Date of publication July 22, 2013; date of current version November 1, 2013. This work was supported in part by the National Science Foundation under grant CAREER ECCS 1053717, the Army Research Office under grant W911NF-12-1-0378, the NSF-DFG Collaborative Research on “Autonomous Learning”, a supplement grant to CNS 1117314, the National Natural Science Foundation of China under grants 51228701 and 61075072, and the Program for New Century Excellent Talents in University under grant NCET-10-0901. Z. Ni and H. He are with the Department of Electrical, Computer and Biomedical Engineering, University of Rhode Island, Kingston, RI 02881USA (e-mail: [email protected]; [email protected]). J. Wen is with the State Key Laboratory of Advanced Electromagnetic Engineering and Technology, School of Electrical and Electronic Engineering, Huazhong University of Science and Technology, Wuhan 430074, China (e-mail: [email protected]). X. Xu is with the College of Mechatronics and Automation, National University of Defense Technology, Changsha 410073, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2271454 to the theory of reinforcement learning (RL) [2]. For instance, given any state and action, x and u , the probability of each possible next state x is defined as P u xx = Pr{x t +1 = x |x t = x , u t = u }. (1) While the expected value of the next reward is defined as R u xx = E {r t +1 |x t = x , u t = u , x t +1 = x }. (2) In optimal control area [3]–[5], Bellman’s optimality principle suggests that an optimal policy can be built for the tail sub- problem involving the last stage and extended backward until that the optimal strategy is built for the entire process. With the notations in (1) and (2), Bellman’s optimality equation for Q can be written as Q (x , u ) = x P u xx [ R u xx + γ max u Q (x , u )] (3) where Q (x , u ) refers to the value function of the current state x and Q (x , u ) refers to the value function of the possible next state x . In past decades, RL, especially Q-learning and temporal difference (TD) learning, was employed to solve Bellman’s equation in MDP. For instance, in [6], a robust RL approach, basically Q-learning approach, was proposed to help the agent to find the optimal control policy with minimum cost. In addition, Dyna-Q learning architecture was later introduced in [7]. This architecture can be integrated with trial-and-error (reinforcement) learning and execution-time planning, into a single process operation alternately on the world/maze and on a learned model of the world/maze. In addition, TD(λ) was also developed to improve the convergence speed on solving MDP problems in [1], [8], [9]. In recent years, adaptive dynamic programming (ADP) has demonstrated the capability to find the optimal control policy over time and solve the Bellman’s equation in a prin- cipled way. More recently, high-level understanding of ADP [10]–[12] also implied that ADP approaches could be able to learn and optimize the control policy over time, and find the solution for Bellman’s optimality equation efficiently. Heuris- tic dynamic programming (HDP), dual heuristic dynamic programming (DHP), and globalized dual heuristic dynamic programming (GDHP), together with their action-dependent versions, were proposed in [13], [14] to seek the optimal policy (solution for Bellman’s equation). The online model-free HDP was developed in [15]–[17], where the authors took the advantages of the potential scalability of the adaptive critic designs and the intuitiveness of Q-learning. It is also an online learning scheme that simultaneously updates the value function 2162-237X © 2013 IEEE

2038 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24… · 2013-12-31 · 2038 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 12, DECEMBER

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

2038 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 12, DECEMBER 2013

Goal Representation Heuristic DynamicProgramming on Maze Navigation

Zhen Ni, Haibo He, Senior Member, IEEE, Jinyu Wen, Member, IEEE, and Xin Xu, Senior Member, IEEE

Abstract— Goal representation heuristic dynamic program-ming (GrHDP) is proposed in this paper to demonstrate onlinelearning in the Markov decision process. In addition to the(external) reinforcement signal in literature, we develop anadaptively internal goal/reward representation for the agent withthe proposed goal network. Specifically, we keep the actor-criticdesign in heuristic dynamic programming (HDP) and include agoal network to represent the internal goal signal, to furtherhelp the value function approximation. We evaluate our pro-posed GrHDP algorithm on two 2-D maze navigation problems,and later on one 3-D maze navigation problem. Compared tothe traditional HDP approach, the learning performance ofthe agent is improved with our proposed GrHDP approach.In addition, we also include the learning performance with twoother reinforcement learning algorithms, namely Sarsa(λ) andQ-learning, on the same benchmarks for comparison. Further-more, in order to demonstrate the theoretical guarantee of ourproposed method, we provide the characteristics analysis towardthe convergence of weights in neural networks in our GrHDPapproach.

Index Terms— Adaptive dynamic programming, goal repre-sentation heuristic dynamic programming, maze navigation/pathplanning, Markov decision process, reinforcement learning.

I. INTRODUCTION

MARKOV decision process (MDP) is a long-standingresearch topic in decision making models in stochastic

processes [1]. Maze navigation is one of the typical MDPproblems where the control action is only related with thecurrent state rather than the previous state. Value function (orstate-action pair) is introduced to evaluate how good it is ofthe agent for a given state. If the state and the action spaces arefinite, then it is called finite MDP, which is particular important

Manuscript received December 21, 2012; revised May 3, 2013; acceptedJune 18, 2013. Date of publication July 22, 2013; date of current versionNovember 1, 2013. This work was supported in part by the National ScienceFoundation under grant CAREER ECCS 1053717, the Army Research Officeunder grant W911NF-12-1-0378, the NSF-DFG Collaborative Research on“Autonomous Learning”, a supplement grant to CNS 1117314, the NationalNatural Science Foundation of China under grants 51228701 and 61075072,and the Program for New Century Excellent Talents in University under grantNCET-10-0901.

Z. Ni and H. He are with the Department of Electrical, Computer andBiomedical Engineering, University of Rhode Island, Kingston, RI 02881USA(e-mail: [email protected]; [email protected]).

J. Wen is with the State Key Laboratory of Advanced ElectromagneticEngineering and Technology, School of Electrical and Electronic Engineering,Huazhong University of Science and Technology, Wuhan 430074, China(e-mail: [email protected]).

X. Xu is with the College of Mechatronics and Automation, NationalUniversity of Defense Technology, Changsha 410073, China (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TNNLS.2013.2271454

to the theory of reinforcement learning (RL) [2]. For instance,given any state and action, x and u, the probability of eachpossible next state x ′ is defined as

Pux x ′ = Pr{xt+1 = x ′|xt = x, ut = u}. (1)

While the expected value of the next reward is defined as

Rux x ′ = E{rt+1|xt = x, ut = u, xt+1 = x ′}. (2)

In optimal control area [3]–[5], Bellman’s optimality principlesuggests that an optimal policy can be built for the tail sub-problem involving the last stage and extended backward untilthat the optimal strategy is built for the entire process. Withthe notations in (1) and (2), Bellman’s optimality equation forQ∗ can be written as

Q∗(x, u) =∑

x ′Pu

x x ′ [Rux x ′ + γ max

u′Q∗(x ′, u′)] (3)

where Q∗(x, u) refers to the value function of the current statex and Q∗(x ′, u′) refers to the value function of the possiblenext state x ′.

In past decades, RL, especially Q-learning and temporaldifference (TD) learning, was employed to solve Bellman’sequation in MDP. For instance, in [6], a robust RL approach,basically Q-learning approach, was proposed to help the agentto find the optimal control policy with minimum cost. Inaddition, Dyna-Q learning architecture was later introducedin [7]. This architecture can be integrated with trial-and-error(reinforcement) learning and execution-time planning, into asingle process operation alternately on the world/maze and ona learned model of the world/maze. In addition, TD(λ) wasalso developed to improve the convergence speed on solvingMDP problems in [1], [8], [9].

In recent years, adaptive dynamic programming (ADP)has demonstrated the capability to find the optimal controlpolicy over time and solve the Bellman’s equation in a prin-cipled way. More recently, high-level understanding of ADP[10]–[12] also implied that ADP approaches could be able tolearn and optimize the control policy over time, and find thesolution for Bellman’s optimality equation efficiently. Heuris-tic dynamic programming (HDP), dual heuristic dynamicprogramming (DHP), and globalized dual heuristic dynamicprogramming (GDHP), together with their action-dependentversions, were proposed in [13], [14] to seek the optimal policy(solution for Bellman’s equation). The online model-free HDPwas developed in [15]–[17], where the authors took theadvantages of the potential scalability of the adaptive criticdesigns and the intuitiveness of Q-learning. It is also an onlinelearning scheme that simultaneously updates the value function

2162-237X © 2013 IEEE

NI et al.: GOAL REPRESENTATION HEURISTIC DYNAMIC PROGRAMMING 2039

Fig. 1. Diagram of interaction between the agent (left block) and the environment (right block). The agent obtains the state from the environment, executesthe corresponding action, and also receives the reward based on this action from the environment.

and the control policy. The model-based HDP is also proposedwith rigorous convergence proof to solve the optimal controlproblem for discrete-time nonlinear systems [18]. For model-based DHP/GDHP design, the authors in [19], [20] proposedthat the efficient learning can be achieved with differentweights for different error terms on the auto-lander helicopterproblem. In [21]–[23], the authors also demonstrated theconvergence analysis for model-based DHP/GDHP in termsof cost function and control law. In addition, the Levenberg-Marquardt method was proposed to be integrated into ADPdesign, to improve the online learning for both action networkand critic network from the algorithm viewpoint [24]–[26]. In[27]–[29], the authors proposed to improve the online learningof ADP design from the aspect of framework design, namelywith three-network/dual-critic HDP designs. Furthermore, thehierarchical HDP design, namely with multiple reference/goalnetworks, was proposed to show the significant differencewith respect to the control performance on numerical bal-ancing benchmarks [30]–[32]. The real-time simulation withthese three-network and dual-critic HDP designs was alsodeveloped and demonstrated on virtual reality platform in[33]–[35].

The typical MDP benchmark, namely the maze navigationproblem, was tested with adaptive-critic designs in a closed-loop form with simultaneous recurrent neural network (SRN)in [36], [37]. This learning process was later improved withcellular SRN and Kalman filter integrating into ADP design in[38]–[40]. On the other side, the comparisons among classicalQ-learning, Sarsa(λ), conventional actor-critic design and theproposed QV-learning on the maze navigation benchmarkwere provided in [41]. All of these approaches had pro-vided important insights and techniques for learning-basedmaze navigation problem. Although recent development ofADP research has demonstrated many critical applicationsacross different domains, it is widely recognized that themaze navigation problem is a significant challenge for thesociety [36]–[42].

Motivated by our previous work on three-network ADP[27] and dual-critic HDP design [28], we further investigatethe goal representation idea on the general MDP problem.That is, we proposed the goal representation heuristic dynamicprogramming (GrHDP) to tackle this challenge by developinga powerful internal reward signal for the learning system.In general, online model-free ADP design only includes anaction network and a critic network, and adopt either thediscrete reward signal (e.g., 0 or −1) or a fixed reward/utilityfunction over time. It is desirable to find a general rewardfunction that can be able to be adaptively tuned accordingto the possible change of the environment/system. Here weproposed to integrate one additional network, namely goalnetwork, into online model-free HDP design, to provide aninformative internal reward/goal signal that can be adaptivelytuned, and updated according the system state over time.Specifically, we build a general mapping between the systemstate (plus the control action) and the internal reward/goalsignal, hence we can update the internal reward/goal adap-tively corresponding to the system state over time. Theadvantage for this design is that when there is no sufficientinformation to define the reward/utility function at the verybeginning, our proposed GrHDP can start from scratch andlearn the proper mapping online. The key objective of ourmethod is to use the goal network to adaptively build aninternal goal/reward signal to guide the system’s decision-making process. To demonstrate the learning performanceof our GrHDP approach, we test both GrHDP and HDPapproaches on 2-D maze navigation problems (e.g., mazesize 5*5 and 16*16) and later on 3-D maze navigationproblem (e.g., maze size 5*5*5), with same parameter andenvironment settings (including same learning rates, internalstopping criteria, initial weights, and so on). Meanwhile, wealso include the solutions with the classical Q-learning from[43] and Sarsa(λ) from [2] for comparison. The learningcurves show that GrHDP and HDP can learn the value tableonline faster than the classical approaches. Comparing with

2040 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 12, DECEMBER 2013

HDP approach, our proposed GrHDP approach can not onlyshow faster convergence speed, but also achieve lower sumof squared errors in the end. In addition, we provide both thelearned value table with GrHDP approach and the referencevalue table so that others can easily verify and follow ourresults. The surface plot of the learned value table is providedto visualize the results learned by our proposed approach. Inaddition, we also discuss the characteristics analysis towardthe convergence of the weights in neural networks, to showthe theoretical analysis of our proposed GrHDP approach onmaze navigation benchmark.

The rest of this paper is organized as follows: Section IIshows the architecture design of our proposed GrHDP agenton the maze navigation problem, and also provides the learningalgorithms for the goal network, critic network and actionnetwork, respectively. Two 2-D maze navigation benchmarksand one additional 3-D maze navigation benchmark are testedwith GrHDP, HDP, Sarsa(λ), and Q-learning approaches underthe same environment settings from Section III-C to III-E.Simulation results are compared with learning curves and dis-cussed in the same section. In Section IV, we provide the char-acteristics analysis towards the convergence of our proposedcontroller. Finally, the conclusion is provided in Section V.The implementation-level pseudocode (Algorithm 2) is alsoincluded in the Appendix.

II. ONLINE LEARNING OF GRHDP

The interaction between our proposed GrHDP agent andthe maze/environment is shown in Fig. 1, where one can seethat the agent observes the system state from the environ-ment and provide the action based on the current state. Thecorresponding reward will be provided by the environmentbased on the performance of the action. In the agent block,we keep the similar design with the traditional HDP in [15].That is to say, we adopt model-free action dependent designfor our GrHDP and also use online learning for the neuralnetworks in our agent. Different from the traditional rewardassignment in maze navigation (i.e., 0 for regular movementbefore reaching the goal location and 1 for reaching the goal),our proposed GrHDP design integrates a goal network to learnfrom external reward r , and provide the critic network witha detailed internal reward s instantly. In this paper, we willdemonstrate the learning and control capability of such anarchitecture on the challenging 2-D maze navigation problem,and further on the 3-D maze navigation problem.

We employ the multilayer perceptron (MLP) structure forall the neural networks used in our proposed approach. Toclosely connect the critic network with the goal network, weset the internal reward s to be included in the inputs for thecritic network. Therefore, the input of goal network and criticnetwork can be denoted as xg = [X, u] and xc = [X, u, s],respectively. While the input for the action network is still thecurrent state vector. In the following part of this section, wewill first introduce the error (objective) function of these threenetworks, and then discuss the online learning rules with theirMLP structures, respectively.

The motivation of this design is to introduce the goalnetwork to approximate the discounted total future external

reward with internal reward signal s, and also hope thisinternal reward s can help the value function approximation.Therefore, we can define the internal goal s as

s(k) = r(k + 1)+ γ r(k + 2)+ γ 2r(k + 3)+ ..... (4)

where γ is the discounted factor. r is the (external) rewardsignal as that in [15], [17], [27], [44] and r(k + 1), r(k + 2),r(k + 3)... are the future reward signals. Therefore, the errorfunction for goal network is defined as

eg(k) = γ s(k)− [s(k − 1)− r(k)]. (5)

The objective function that we need to minimize can be writtenas

Eg(k) = 1

2e2

g(k). (6)

While the critic network here is to approximate the discountedtotal future internal reward s with value function J . Therefore,the value function J can be written as

J (k) = s(k + 1)+ γ s(k + 2)+ γ 2s(k + 3)+ ..... (7)

Following the error definition in (5), we can define the errorfunction for the critic network as

ec(k) = γ J (k)− [J (k − 1)− s(k)]. (8)

Here the internal goal/reward s is applied on critic networkrather than (external) reward signal r in literature. The errorfunction (8) is therefore different with those in literature. Theobjective function for the critic network can then be writtenas

Ec(k) = 1

2e2

c (k). (9)

The error function for the action network is defined as

ea(k) = J (k)−Uc. (10)

Ea(k) = 1

2e2

a(k) (11)

where Uc is the ultimate utility function. The value of Uc

is critical in ADP design and it could be variant in differentapplications.

A. Goal Representation

For maze navigation problems, people normally assignthe instant reward to be 0 unless the agent reaches thegoal location [2], [43]. While, in recent years, there seemsgrowing attention to see if there is any improvement ifa nonzero instant reward is assigned for the agent duringthe learning process [29]. In literature, different reward/costfunctions can be defined for different applications. How-ever, such reward/cost functions are strongly domain-oriented,and therefore it is difficult to define such a proper func-tion in general. In this paper, we propose to build ageneral mapping with neural network to represent s =f (X, u) and integrate such a network into the ADP designframework.

The inner structure of the goal network is provided in Fig. 2,where one can see that the input vector is xg = [X, u] andthe output is the internal reward s. The system state vector

NI et al.: GOAL REPRESENTATION HEURISTIC DYNAMIC PROGRAMMING 2041

Fig. 2. MLP structure of goal network. Sigmoid function is applied for bothhidden nodes and output nodes.

X = [x1, x2, ..., xn], where n is number of element, and thecontrol action u = [u1, u2, ..., um ], where m is the numberof element. The internal goal signal s is a scalar. Sigmoidfunction is defined as

φ(x) = 1− e−x

1+ e−x. (12)

to constrain the output into [−1, 1]. Here sigmoid functionsare applied on all hidden nodes and the output node, as shownin the node with both summation sign and sigmoid sign inFig. 2. The forward paths of the goal network are provided asfollows:

s(k) = φ(l(k))

l(k) =Ngh∑

i=1

ω(2)gi

(k)yi (k)

yi (k) = φ(zi (k)), i = 1, ..., Ngh

zi (k) =n∑

j=1

ω(1)gi, j

(k)x j (k)+m+n∑

j=1+n

ω(1)gi, j

(k)u j−n(k) (13)

where zi and yi refer to the input and the output of the i -thhidden node. l is the input for the output node. ω

(1)g and ω

(2)g

denote the weights of the input to hidden layer and the hiddento output layer in the goal network, respectively. Ngh is thenumber of hidden node in goal network. Fig. 2 includes allthe notations of these parameters.

We adopt the gradient descent method to minimize theapproximation error in (6) as

ωg(k + 1) = ωg(k)− ηg(k)[∂ Eg(k)

∂ωg(k)

]. (14)

where ηg is the learning rate for the goal network and willbe adaptively tuned during the learning process. Accordingto chain-backpropagation rules, we can derive the weightsupdating formula from hidden to output layer as

∂ Eg(k)

∂w(2)gi (k)

= ∂ Eg(k)

∂s(k)

∂s(k)

∂l(k)

∂l(k)

∂w(2)gi (k)

. (15)

Fig. 3. MLP structure of critic network. Sigmoid function is applied onlyfor hidden nodes.

The weights from input to hidden layer are then updated as

∂ Eg(k)

∂w(1)gi (k)

= ∂ Eg(k)

∂s(k)

∂s(k)

∂l(k)

∂l(k)

∂yi (k)

∂yi (k)

∂zi(k)

∂zi (k)

∂w(1)gi, j (k)

. (16)

In this paper, we tune the weights in the order of goal network,critic network, and action network. After the weights in goalnetwork are tuned, we fix the weights thereafter and start totune the weights in the critic network.

B. Value Function Approximation

The critic network in our design is different from the designsin literature [15], [17], [45]. We include one more input,internal reward signal s, for the critic network and aim tohelp the value function approximation. Here, we only applysigmoid function on the hidden nodes. The neural networkstructure of the critic network is shown in Fig. 3 and theforward paths for the critic network are provided as follows:

J (k) =Nch∑

i=1

ω(2)ci

(k)pi(k)

pi (k) = φ(qi (k)), i = 1, ..., Nch

qi (k) =n∑

j=1

ω(1)ci, j

(k)x j (k)+m+n∑

j=1+n

ω(1)ci, j

(k)u j−n(k)

+ ω(1)ci,(m+n+1)

s(k).

(17)

As shown in Fig. 3, qi and pi refer to the input and outputfor the i th hidden node. ω

(1)c and ω

(2)c are the notations for

the weights from input to hidden layer and the weights fromhidden to output layer. Nch is the number of hidden nodeshere. Here, we only apply the sigmoid function on hiddennodes in critic network. We adopt gradient descent method tominimize the approximation error in (9) as

ωc(k + 1) = ωc(k)− ηc(k)[∂ Ec(k)

∂ωc(k)

]. (18)

Chain backpropagation rule is applied to obtain the weightstuning from hidden to output layer as

∂ Ec(k)

∂w(2)ci (k)

= ∂ Ec(k)

∂ J (k)

∂ J (k)

∂w(2)ci (k)

. (19)

2042 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 12, DECEMBER 2013

Fig. 4. MLP structure of action network. Sigmoid function is applied for bothhidden nodes and output nodes. u1, ..., um refer to the m possible directionsthat the agent could take. WTA method is applied on the output nodes todecide the direction.

The weights from input to the hidden layer are tuned as

∂ Ec(k)

∂w(1)ci, j (k)

= ∂ Ec(k)

∂ J (k)

∂ J (k)

∂pi (k)

∂pi (k)

∂qi (k)

∂qi (k)

∂w(1)ci, j (k)

. (20)

For maze navigation benchmark, the Q-value is closely relatedto the state-action pair, which is the policy for the agent’sdecision making process. Here, we adopt the critic network toapproximate the Q reference value for each state-action pairwith J . After the weights in critic network are tuned undercertain conditions, we fix them thereafter and start to tune theweights in the action network.

C. Control Policy Optimization

The MLP structure of action network is shown in Fig. 4,where one can see that the action network is actually a multi-input multioutput neural network. The objective for the actionnetwork is to minimize the total future cost by seeking theoptimal control policy. Sigmoid function is applied on all thehidden nodes and output nodes. We adopt the winner-take-all(WTA) method for the output of action network, to determinethe direction for the agent. The forward learning paths for theaction network are provided as follows:

u(k) = φ(v(k))

v(k) =Nah∑

i=1

ω(2)ai

(k)gi(k)

gi (k) = φ(hi (k)), i = 1, ..., Nah

hi (k) =n∑

j=1

ω(1)ai, j

(k)x j (k) (21)

The notations in (21) can be directly obtained from Fig. 4,where one can see that hi and gi are the input and output fori -th hidden node. ω

(1)a and ω

(2)a refer to the weights from input

to hidden layer and the weights from hidden to output layer.Nah is the number of hidden node in action network.

We also adopt the gradient descent method to minimize theerror function in (11) as

ωa(k + 1) = ωa(k)− ηa(k)[∂ Ea(k)

∂ωa(k)

]. (22)

The same as above, we adopt the chain rule to obtain theweights updating from hidden to output layer as

∂ Ea (k)

∂w(2)ai (k)

= ∂ Ea (k)

∂ J (k)

∂ J (k)

∂u (k)

∂u (k)

∂v (k)

∂v (k)

∂w(2)ai (k)

(23)

The weights from input to hidden layer are tuned as

∂ Ea (k)

∂w(1)ai j (k)

= ∂ Ea (k)

∂ J (k)

∂ J (k)

∂u (k)

∂u (k)

∂v (k)

∂v (k)

∂gi (k)

∂gi (k)

∂hi (k)

∂hi (k)

∂w(1)ai j (k)

(24)

An online learning episode is regarded as completed oncethe errors of goal network, critic network and action networkare all tuned under threshold or the iterations of these threenetworks meet the maximum internal iteration number.

We would like to note that there are two possible ways totune the parameters of the action network. The first one is aswe discussed in this section, i.e., (23) to (24). In this way, weonly consider the backpropagation path from the J functionto the control signal u through the critic network. From Fig. 1we can see, the control signal u also goes through the goalnetwork and outputs the s signal, which in turn will go throughthe critic network and impact the J function. This means wecould also add this additional path to the backpropagationchain to update the weights for the action network. This isthe second approach to train the action network. To do this,we will need to modify (23) and (24) to add this additionalerror contribution raised by the path from u → s → J. Thismeans equations (23) and (24) will have the summations oftwo error terms. In this paper, we implement the algorithmbased on the first option to train the action network.

III. SIMULATION RESULTS AND ANALYSIS

A. Design of Maze Navigation Simulation

The 2-D maze navigation is a typical multistage decisionproblem, which can be solved using the techniques of RL anddynamic programming. In this simulation, we denote that theinstant reward between any two state x and x ′ as r(x, x ′). Wealso assume that there are N possible states in the maze andthe probabilities Px x ′ in (1) can only take the value of 0 or 1(i.e., the maze navigation problem is a deterministic and finiteMDP). Thus, the Bellman’s equation (3) can be rewritten as

J ∗(x, u) = arg maxu

(r(x, u

)+ γ

N∑

j=1

J ∗(x ′, u′))

(25)

where J ∗(x, u) is the maximum total reward at state x bytaking the action u. The objective for this maze navigation is toemploy learning algorithms to learn the value table of the mazeonline, hence the agent can move according to the directionthat maximizes the total reward (toward the goal location).

To provide the learning steps of our proposed GrHDP onmaze navigation benchmarks, we provide the flowchart for theentire simulation process in Fig. 5. The simulation steps canbe described as follows.

NI et al.: GOAL REPRESENTATION HEURISTIC DYNAMIC PROGRAMMING 2043

Fig. 5. Simulation setup: flowchart of the GrHDP approach on mazenavigation problem. Value table is updated at the end of each trial. Thelearning process will be terminated when the trial number reaches themaximum trial number.

1) Load the predefined updating sequence. Each updat-ing sequence is assumed to visit all the state enoughtimes.

2) Obtain the output from action network. Apply WTAmethod to decide the direction, and execute it.

3) Obtain the new state from the maze and updatethe inputs for the goal network, obtain internalgoal s.

4) Update the inputs for the critic network, and obtain theJ value. Update the J (x, u) table.

5) Check if the agent moves out of bound? If yes, turnto another trial (punishment is assigned) and load thenext initial state to start again; if no, move the agent toanother step (same trial).

6) Check if the agent reach the goal? If yes, turn to anothertrial (reward is assigned) and load the next initial state tostart again; if no, move the agent to another step (sametrial).

7) Terminate the entire learning process if the trial numbersatisfies the maximum trial number.

B. Algorithm Implementation

To compare the learning performance of our proposedGrHDP approach with HDP, Sarsa(λ) and Q-learningapproaches, we conduct the simulation on 2-D maze nav-igation benchmarks and further on 3-D maze navigationbenchmark. The implementation details are provided asfollows:

1) Q-learningThe Q-learning algorithm is one of earliest RL algo-rithms to find a reliable way to estimate training values

Algorithm 1 GrHDP on value table learning1) For each state-action pair (x ,u), initialize the table

entry J (x, u) to zero2) Observe the current state x3) Do forever4) Obtain action u, decide the direction with WTA, and

execute it;5) Receive an immediate reward r , and observe the new

state x ′;6) Obtain internal goal s and value function J ;7) Update the weights in goal network (14) to (16);8) Update the weights in critic network (18) to (20);9) Update the weights in action network (22) to (24);

10) Update the table entry for J (x, u);11) x ← x ′;

for Q, given only a sequence of immediate reward rspread out over time. Here, we implement the Q learningalgorithm based on [43] to build the Q-value table forthe maze navigation problem. We set discount parameterγ to be 0.95.

2) Sarsa(λ)TD learning is a combination of the Monte Carloand dynamic programming idea. TD can learn directlyfrom raw experience without a model of environment’sdynamics and also update estimates without waiting tothe final stage. Here, we implement one of the typicalalgorithms in TD learning, namely Sarsa(λ), based on[2] to build state-action pairs for the agent in the mazenavigation. The parameters setting are as: γ = 0.95,λ = 0.9, and α = 0.4.

3) HDPOnline model-free HDP is one of the typical ADPapproach proposed in [15]. The initial learning parame-ters are set as: ηc = 0.005 and ηa = 0.01, where ηc

and ηa refer to the learning rate of critic network andaction network, respectively. The stopping criteria are:Nc = 20, Na = 30, Tc = 1e − 4 and Ta = 1e − 4.That is to say, the learning process of critic/actionnetwork will be terminated either if the error drops intothe threshold Tc/Ta or the iteration number meets thethreshold Nc/Na .

4) GrHDPOur proposed GrHDP is implemented according to thekey pseudocode listed in Algorithm 1. The initial para-meters for the goal network are: ηg = 0.012, Tg = 1e−4and Ng = 25. For fair comparison, we also ensure thatthe GrHDP and HDP start with the same initial weightsbetween [−0.3, 0.3]. For all other parameters, we keepthe same as that in the HDP approach.

C. Simulation Study One

We assume that 1) every state in the maze is visited enoughtimes; 2) every action (up, down, left, and right) is takenenough times for each state; 3) for every initial state, theagent can go infinite steps forward unless it meets the goal

2044 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 12, DECEMBER 2013

Fig. 6. Diagram of 2-D maze (5*5) navigation benchmark. The agent willreceive reward 1 if reaches the goal, and receive punishment −0.2 if hits thebound. Otherwise, the agent can only get 0 reward.

or it hits the bound. The input for the action network isthe current state vector xa = [x1, x2], where x1 is thecoordinate of horizontal axis and x2 is the coordinate ofvertical axis. The input for the goal network and the criticnetwork are xg = [x1, x2, u1, u2, u3, u4] and xc =[x1, x2, u1, u2, u3, u4, s], respectively, where u1 is forthe direction of up, u2 is for the direction of down, u3 isfor the direction of left, u4 is for the direction of right. Thediagram of 2-D maze is presented in Fig. 6 and the externalreward r is defined as

r =⎧⎨

1,−0.2,

0,

reach the goalout of boundregular move

(26)

In our simulation, Uc is assigned as 1 and the inputs for theaction network are scaled to be in [0, 2]. We set 10 independentupdating sequences for 10 independent runs (i.e., the updatingsequence in each runs is independent). Each run includes 150trials (i.e., each updating sequence includes 150 initial startingstates) and each trial starts with the initial state loaded fromthe updating sequence. Each trial will be terminated when theagent meets the goal or hits the bound. Under this setting, thesteps that the agent move in each trial are not necessarily thesame. The J (x, u) table is updated after the agent finish eachtrial and is then normalized to [0, 1] to show the differencewith the reference value table.

For fair comparison, our proposed GrHDP approach and thetraditional HDP approach are set to start with the same initialweights (uniformly initialized in [−0.3, 0.3]) and the sameupdating sequence. The learning rates and internal stoppingcriteria for both approaches are also the same. Adaptivelearning rate (ALR) is used in our simulation. The initiallearning rates for the action network, critic network and thegoal network are set to be 1e − 2, 5e − 3 and 1.2e − 2, andthey will be decreased by dividing two for every 10 trials.We keep the learning rates to be 1e − 5 thereafter if theyare under 1e − 5. In addition, we also set a counter for thefour actions/directions taken for all the states. If, for a specificstate, any action (i.e., up, down, left, and right) is taken overa preset number (like 10 in this case study), we will randomlypick up a direction from the remaining choices as the finaldecision. We hope that all the directions can be tried enough

0 50 100 1500

1

2

3

4

5

6

7

Time step

Sum

of s

quar

e er

ror

Maze size: 5*5

GrHDPHDPSarsa(λ)Q−learning

Q−Learning

Sarsa(λ)

HDP

GrHDP

Fig. 7. Learning curves with GrHDP, HDP, Sarsa(λ) and Q-learningapproaches in 5*5 maze navigation. Both GrHDP and HDP approaches showfaster learning speed than those with Sarsa(λ) and Q-learning approaches.

times, to guarantee that the agent can learn from both failureand success trial to achieve the final value table.

Similar as that in literature [38]–[40], we assign Q referencevalue table according to the distance between the currentlocation and the goal. The values for the states right around(one step) from the goal are assigned as 1 and the values for theother states will drop 1/(L +W ) for each step, where L andW refer to the length and width of the maze, respectively. Inthis case, the difference of the value between two consecutivestates is set as 0.1. We define the sum of squared errors as

E = 1

2

N∑

i=1

(J (i)− Qre f (i))2 (27)

where N is the number of state in the maze.As presented in Fig. 6, we set the maze size as 5∗5 and the

goal locates at [5, 5]. The learning curves (sum of squarederrors) are shown in Fig. 7, where the x axis refers to thenumber of the trial and the y axis refers the sum of squarederrors. All the curves presented here are the average valuefrom ten independent runs.

From Fig. 7, one can see that the error curves with Sarsa(λ)and Q-learning algorithms drop slower than those with GrHDPand HDP approach, yet the steady error with Sarsa(λ) iscompetitive with those of GrHDP and HDP approach. WhileGrHDP and HDP approaches show competitive learningcurves and steady errors here. We also present the predefinedreference value table in Fig. 8(a) and the learned value tablewith our proposed GrHDP in Fig. 8(b). From both tables, wecan conclude that the optimal control policy is to move towardthe next state with maximum value, and the solution providedby GrHDP approach can help the agent to reach the goal onlinewith the shortest path.

D. Simulation Study Two

To clearly show the difference between our proposedGrHDP and HDP, together with the other two traditional

NI et al.: GOAL REPRESENTATION HEURISTIC DYNAMIC PROGRAMMING 2045

Fig. 8. Value tables for 5 ∗ 5 maze navigation: (a) the reference value tableand (b) the learned value table with the GrHDP approach.

RL algorithms, we further apply these four algorithms on arelatively large size maze navigation problem, namely 16*16.We set the goal at [16, 16] and still adopt ten independentruns in this case study. We extend the trial number to be 1000for each run and set the counter threshold to be 30 here.We adopt the same initial learning rates as above and thesimilar ALR as follows: the learning rates will be decreasedby dividing two for every 10 trials and will be kept to be1e− 10 thereafter if they are under 1e− 10. The Q referencevalue table is still assigned according to the distance betweenthe current location and the goal (i.e., the value differencebetween two consecutive states is 1

32 ). For the other environ-ment and parameter settings, we keep them the same as thatin Section III-C.

The same as case study one, all the learning curves inFig. 9 are the average value from 10 independent runs.From Fig. 9, we can see that GrHDP and HDP convergewithin a few hundred trials, while Sarsa(λ) and Q-learningrequire more than 1000 trials to learn. The learning curvesof both GrHDP and HDP approach drop quickly at the verybeginning, yet our proposed GrHDP can achieve lower steadyerror (i.e., the value table learned by GrHDP approach canbe more close to the reference value table). In addition,we also take the average of the value table learned in10 runs with our proposed GrHDP approach, and presentit in color surface in Fig. 10. It is clear that the valuessmoothly become higher as the agent approaches the goallocation.

We would like to discuss the two important observationsfrom Fig. 9. First, both HDP and GrHDP can convergefaster to the optimal policy compared with the Q-learningand Sarsa(λ) approaches. This may indicate that for the2-D maze navigation problem, ADP methods could be able toprovide better learning performance. Further observation alsosuggests that with the help of the goal network, our proposedGrHDP can converge faster than the regular HDP approach.Second, as far as the final sum of squared errors is concerned,our GrHDP approach can also achieve the best performance inthis case. The reason that Q-learning and Sarsa(λ) approachesdid not converge to a zero value in this case is because wehave adopted a different reference table, as we discussed incase study one (Fig. 8(a)). This type of reference table is rea-sonable for the maze-navigation benchmark as been discussed

Fig. 9. Learning curves with GrHDP, HDP, Sarsa(λ) and Q-learningapproaches in 16*16 maze navigation. GrHDP approach shows the fastestlearning speed and lowest final sum of squared errors than the other threeapproaches.

Fig. 10. Surface plot of the value table learned by GrHDP approach onmaze navigation problem (16*16). x and y axis refer to the coordinates ofthe agent while z axis refers to the J-value.

in [38]–[40]. Certainly, if one adopts the Q-reference tableas in the traditional RL literature [2], [43], we would expectthe final errors of the Q learning and Sarsa(λ) to approachzero. Our key interests from this perspective in this paper arethe convergence speed and the optimal policy, in which ourmethod achieves much better performance compared with theregular HDP, Q learning, and the Sarsa(λ) method.

E. Simulation Study Three

Here we follow our previous algorithm settings and testthem on 3-D maze navigation, with the size of 5*5*5. Thestructure of the 3-D maze is presented in Fig. 11, wherewe assume that the goal is located in the upper-right corner(i.e., [5, 5, 5]) and the agent starts from a random positionwithin this maze. This 3-D maze navigation benchmark ismore difficult than the 2-D maze navigation benchmarksabove, as the agent need to learn from more directions (i.e., theagent need more trials to learn). The agent has to try 6 actions

2046 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 12, DECEMBER 2013

Fig. 11. Diagram of 3-D maze (5*5*5) navigation benchmark. The goallocates at the upper-right corner of the maze and the agent needs to try sixdirections before it can learn the policy.

(i.e., forward, backward, up, down, left, and right) before it canmake the right decision. We keep the trial-and-error methodfor the agent in this case study and adopt 10 independentruns here. The only difference is that we extend the trialnumber to be 1500 for each run. The initial learning ratesand ALR settings are the same as those in case study two.The Q reference value table is still assigned according to thedistance between the current location and the goal (i.e., thevalue difference between two consecutive states is 1/15). Forthe other environment and parameter settings, we keep themthe same as those in III-D.

As the 3-D maze with the corresponding value functionis hyper-dimension, we restrict to provide the value surfaceas that in Section III-D. Instead, we provide the learningcurves with the four algorithms in Fig. 12. One can seethat our proposed GrHDP approach can achieve the fastestconvergence speed with respect to the sum of squared errorsand also the lowest final sum of squared errors. While forboth Q-learning and Sarsa(λ) approaches, convergence ten-dency can easily be obtained yet they need more trials tolearn.

IV. CHARACTERISTICS ANALYSIS OF GRHDP APPROACH

The characteristics of our proposed online GrHDP learn-ing approach is discussed in this section. Similar as otherADP approaches, there is no readily available training data(input/output pairs) to be used for approximating J in GrHDPapproach either. The control action u, the internal goal signals and the value function J are updated according to thecorresponding error functions that change from one time stepto another. Thus, the convergence argument for the steep-est descent algorithm does not hold valid for any of thethree networks (i.e., action, goal and critic network). Suttonet al proposed to formulate the TD(0) learning algorithm(parameter/weights updating rules) into the gradient descentway, and showed that the expected values of the predic-tion converge to their correct values, given enough samples

Fig. 12. Learning curves with GrHDP, HDP, Sarsa(λ), and Q-learningapproaches in 5*5*5 maze navigation. GrHDP approach shows the fastestlearning speed and lowest final sum of squared errors than the other threeapproaches.

and learning iterations [2], [8]. In addition, Dayan et alprovided stronger results for more general cases that thepredictions of TD learning could converge with probabilityone in [46], [47]. On the other hand, in [15], Si et alproposed to take the expected value (average) of the errorfunction, and formulate the optimal weights seeking rules asthe root seeking for the error equations. In this paper, wefollow this trend in the communities, and provide the analysistoward the convergence of the weights in goal network,critic network and action network, respectively. We adoptRobbins and Monro algorithm [48], [49] to show that theroots of the error function can be found with gradient descentmethod, and the roots provide the (local) minimum expectederror value for the three networks in our proposed GrHDPapproach.

A. Stochastic Approximation Algorithms

The original work in recursive stochastic algorithms wasintroduced by Robbins and Monro in [48], [49], where arecursive procedure for seeking the root of a real-valuedfunction g(ω) with a real variable ω is developed and analyzed.In general, the function is assumed to be unknown, but noise-corrupted observations could be taken for the values of ω.Let the function g(ω) be with the form g(ω) = Ex [ f (ω)],where Ex is the expectation. Then g(ω) can be regarded asthe regression function of f (ω) and f (ω) can be regarded asthe sample function of g(ω). The following assumptions weresummarized in [15] and are the necessary conditions to obtainthe Robbins and Monro algorithm.

1) g(ω) has a single root ω∗ such that g(ω∗) = 0, and then

g(ω) < 0 i f ω < ω∗ (28)

andg(ω) > 0 i f ω > ω∗. (29)

This assumption is with little loss of generality, sincemost functions of a single root not satisfying this

NI et al.: GOAL REPRESENTATION HEURISTIC DYNAMIC PROGRAMMING 2047

condition can be made to do so by multiplying thefunction with −1.

2) The variance of f (ω) from g(ω) is finite

σ 2(ω) = Ex [g(ω)− f (ω)]2 <∞. (30)

3) This is a relatively mild condition that

|g(ω)| < B1|ω − ω∗| + B0 <∞. (31)

B0 and B1 are not necessarily to be known to provethe validity of the algorithm, as long as the root lies incertain finite interval, the existence of B0 and B1 canalways be assumed.

If the conditions from (28)– (31) are satisfied, the Robbinsand Monro algorithm can be used to seek the solution ω∗ forthe function g(ω) as

ω(k + 1) = ω(k)− η(k) f [ω(k)] (32)

where η(k) is a sequence of positive numbers/step size (alsoknown as the learning rate in ADP/RL community). Here, wesummarize the similar conditions for η(k) as follows.

1) The step size/learning rate is dropping in the learningprocess, and assumed to approach zero given enoughlearning steps as

limk→∞ η(k) = 0. (33)

2) The sum of a sequence of η(k) is infinite as k goes toinfinite as ∞∑

k=0

η(k) = ∞. (34)

3) The sum of squares of the sequence is finite as∞∑

k=0

η2(k) <∞. (35)

In addition, ω(k) will converge toward ω∗ in the sense ofmean squared error and with probability 1, i.e.,

limk→∞ Ex [

∥∥ω(k)− ω∗∥∥2] = 0 (36)

Pr{ limk→∞ω(k) = ω∗} = 1. (37)

In this paper, the Robbins-Monro algorithm is applied tooptimization problems with ADP method. With the abovesettings, let g(ω) = ∂E

∂ω , where E is an objective functionto be optimized. If E has a local optimum at ω∗, g(ω) willsatisfy the condition (28) locally at ω∗; if E has a quadraticform, g(ω) will satisfy the condition (28) globally.

B. Toward Convergence Analysis for the Goal, Critic, andAction Networks

Here, we provide asymptotic convergence results for eachcomponent in our GrHDP controller. The Robbins-Monroalgorithm introduced in the Section IV-A is the main tool toobtain results in this subsection. In the following, we analyzeone component of the GrHDP system at a time. That is tosay, when one component (e.g., the goal network) is underconsideration, the other components (e.g., the critic network

and the action network) are considered to have completed theirlast learning step, namely their weights are fixed at this stage.

To study the learning process in goal network, we reviewthe definition of the objective function for the goal networkas follows:

eg(k) = γ s(k)− [s(k − 1)− r(k)]. (38)

where eg(k) is assumed as the sampled/instant error at timestep k. The instantaneous squared error of this residual can bewritten as

Eg(k) = 1

2eg(k)2. (39)

Instead of the instantaneous squared error, let

Eg(k) = Ex [Eg]. (40)

The partial derivative of Eg with respect to the weights in thegoal network is written as

∂ Eg

∂ωg= Ex

[eg

∂eg

∂ωg

]. (41)

With the Robbins-Monro algorithm, the root (can be a local

root) of ∂ Eg∂ωg

(the function of ωg) can be obtained with thefollowing recursive procedure:

ωg(k + 1) = ωg(k)− ηg(k) ·[eg

∂eg

∂ωg

](42)

given that the root exists and ηg meets conditions describedfrom (33) to (35). Equation (42) is actually equivalent to thelearning algorithms for the goal network in (14). From thisviewpoint, the online learning algorithm for the goal networkactually converges to a (local) minimum of the residualsquared of the expected objective function.

To study the learning process occurring in the critic network,we review objective function for the critic network as follows:

ec(k) = γ J (k)− [J (k − 1)− s(k)] (43)

and the instantaneous squared error of this residual can bewritten as

Ec(k) = 1

2ec(k)2. (44)

Instead of the instantaneous squared error, let

Ec(k) = Ex [Ec]. (45)

The partial derivative of Ec with respect to the weights in thecritic network is written as

∂ Ec

∂ωc= Ex

[ec

∂ec

∂ωc

]. (46)

With the Robbins-Monro algorithm, the root (can be a localroot) of ∂ Ec

∂ωc(the function of ωc) can be obtained by the

following recursive procedure:

ωc(k + 1) = ωc(k)− ηc(k) · [ec∂ec

∂ωc]. (47)

given that the root exists and ηc meets conditions describedfrom (33) to (35). Equation (47) is actually equivalent to thelearning algorithms for the critic network in (18). From thisviewpoint, the online learning algorithm for the critic network

2048 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 12, DECEMBER 2013

can converge to a (local) minimum of the residual squared ofthe expected error function.

To study the learning process taking place in the actionnetwork, we review the objective function for the actionnetwork as follows:

ea(k) = J − Uc. (48)

and the instantaneous squared error of this residual can bewritten as

Ea(k) = 1

2ea(k)2. (49)

Instead of the instantaneous squared error, let

Ea(k) = Ex [Ea]. (50)

The partial derivative of Ea with respect to the weights in theaction network is then written as

∂ Ea

∂ωa= Ex

[ea

∂ea

∂ωa

]. (51)

With the Robbins-Monro algorithm, the root (can be a localroot) of ∂ Ea

∂ωa(the function of ωa) can be obtained by the

following recursive procedure:

ωa(k + 1) = ωa(k)− ηa(k) ·[ea

∂ea

∂ωa

](52)

given that the root exists and ηa meets conditions describedfrom (33) to (35). Equation (52) is equivalent to the learningalgorithms for the action network in (22). From this viewpoint,the online learning algorithm for the action network canactually converge to a (local) minimum of the residual squaredof the expected error function.

V. CONCLUSION

In this paper, a goal representation heuristic dynamic pro-gramming approach was studied and analyzed for the classicalMDP problem, namely the maze navigation benchmark. Wepresented our GrHDP architecture and its learning algorithms.To demonstrate the improved results, we compared the learn-ing performance of our proposed GrHDP approach with thetraditional HDP approach, together with two classical RLalgorithms (Sarsa(λ) and Q-learning) at the same environ-ment settings. We started the simulations with a 5 ∗ 5 mazebenchmark to demonstrate how our approach can learn foran optimal policy. Then, we demonstrated a relatively large16 ∗ 16 maze and provided detailed comparative studies ofour approach with others. In addition, we applied these fouralgorithms on 3-D maze navigation problem, namely 5 ∗ 5 ∗ 5maze navigation problem, to show the improved learningperformance with our proposed GrHDP approach. These threecase studies demonstrated that our GrHDP approach canconverge much faster in comparison with the regular HDP,Sarsa(λ), and Q-learning approaches on the maze navigationapplications. In addition, we also provided the characteris-tics analysis toward the convergence of the weights in goalnetwork, critic network, and action network, respectively.Under certain assumptions and conditions mentioned above,the weights in neural networks can converge to the expectedoptimal weights recursively.

APPENDIX

PSEUDOCODE

Algorithm 2 Implementation of GrHDP on Maze Navigationu← fa (xa, ω), control action selection;fa : the action network;xa: input of action network;ωa: weights in action network;u: control action;s ← fg

(xg, ωr

), internal goal representation;

fg : the goal network;xg: input of goal network;ωg : weights in goal network;s: internal goal signal;J ← fc (xc, ωc), value function approximation;fc: the critic network;xc: input of critic network;ωc: weights in critic network;J : value function approximation;/*Note: the parameters Na , Ta , ηa , Nc , Tc, ηc, Ng , Tg ,ηg are all described in Sec.III; */

1) for 1 to MaxRun do2) Load initial starting point X(0);3) Uniformly initialize ωa(0), ωg(0), ωc(0) in[−0.3, 0.3];

4) Initialize J (x, u) and Qcounter(x, u) to 0, for all xand u;5) u(0)← fa (xa(0), ωa(0));6) Update inputs of goal network xg(0) = [xa(0) u(0)];7) s(0)← fg

(xg(0), ωg(0)

);

8) Update inputs of critic networkxc(0) = [xa(0) u(0) s(0)];

9) J (0)← fc (xc(0), ωc(0));10) for 1 to MaxTr do11) Drop ηc, ηg and ηa every 10 trial;12) if ηc, ηg or ηa is under 1e− 5 then13) Set it as 1e − 5 thereafter;14) end if15) while (1) //infinite step looking ahead16) if x(k) /∈ maze or x(k) == goal then17) break;18) end if19) Apply WTA method on u(k) and choose the

direction;20) Qcounter(x,u) = Qcounter(x,u) + 1; save the num

of direction taken in this specific state21) if Qcounter(x,u) > threshold then22) Uniformly select action/direction from remaining

choices;23) end if24) x(k)← (x(k− 1), u(k − 1)); execute action and

obtain current state//weights are carried on through each trial

25) ωa(k) = ωa(k − 1);26) ωc(k) = ωc(k − 1);27) ωg(k) = ωg(k − 1);

NI et al.: GOAL REPRESENTATION HEURISTIC DYNAMIC PROGRAMMING 2049

28) xa(k) = x(k); //update input of fa

29) u(k)← fa (xa(k), ωa(k));30) xg(k) = [x(k) u(k)]; //update input of fg

31) s(k)← fg(xg(k), ωg(k)

);

32) xc(k) = [x(k) u(k) s(k)]; //update input of fc

33) J (k)← fc (xc(k), ωc(k));34) Assign reward according to (26)

// online learning of goal network35) Eg(k) = 1

2 (γ s(k)− (s(k − 1)− r(k)))2;36) cyc = 0;37) while (Eg(k) > Tg&cyc < Ng ) do38) ωg(k) = ωg(k)+ωg(k) via (15) and (16);39) s(k)← fg

(xg(k), ωg(k)

);

40) Eg(k) = 12 (γ s(k)− (s(k − 1)− r(k)))2;

41) cyc = cyc + 1;42) end while

// online learning of critic network43) Ec(k) = 1

2 (γ J (k)− (J (k − 1)− s(k)))2;44) cyc = 0;45) while (Ec(k) > Tc&cyc < Nc) do46) ωc(k) = ωc(k)+ωc(k) via (19) and (20);47) s(k)← fg

(xg(k), ωg(k)

);

48) J (k)← fc (xc(k), ωc(k));49) Ec(k) = 1

2 (γ J (k)− (J (k − 1)− s(k)))2;50) cyc = cyc + 1;51) end while

// online learning of action network52) Ea(k) = 1

2 (J (k)−Uc)2;

53) cyc = 0;54) while (Ea(k) > Ta&cyc < Na ) do55) ωa(k) = ωa(k)+ωa(k) via (23) and (24);56) u(k)← fa (xa(k), ωa(k));57) s(k)← fg

(xg(k), ωg(k)

);

58) J (k)← fc (xc(k), ωc(k));59) Ea(k) = 1

2 (J (k)− Uc))2;

60) cyc = cyc + 1;61) end while62) end while (1)63) end for //corresponding to step 1064) Update J (x, u) table65) Calculate the sum of squared error according to (27)66) end for // corresponding to step 1

REFERENCES

[1] A. G. Barto, R. S. Sutton, and C. J. C. H. Watkins, “Learning andsequential decision making,” in Learning Computational Neuroscience.Cambridge, MA, USA: MIT Press, 1989, pp. 539–602.

[2] R. Sutton and A. Barto, Reinforcement Learning: An Introduction.Cambridge, U.K.: Cambridge Univ. Press, 1998.

[3] R. Bellman, Dynamic Programming. Princeton, NJ, USA: PrincetonUniv. Press, 1957.

[4] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal Control. New York,NY, USA: Wiley, 2012.

[5] D. P. Bertsekas, Dynamic Programming and Optimal Control, vol. 1,Belmont, MA, USA: Athena Scientific, 1995.

[6] S. Singh, A. Barto, R. Grupen, and C. Connolly, “Robust reinforcementlearning in motion planning,” in Proc. Adv. Neural Inf. Process. Syst.,1994, pp. 655–662.

[7] R. S. Sutton, “Integrated architectures for learning, planning, and react-ing based on approximating dynamic programming,” in Proc. 7th Int.Conf. Mach. Learn., 1990, pp. 216–224.

[8] R. Sutton, “Learning to predict by the methods of temporal differences,”Mach. Learn., vol. 3, no. 1, pp. 9–44, 1988.

[9] R. Sutton, Temporal credit assignment in reinforcement learning,Ph.D. dissertation, Dept. Comput. Sci., Univ. Massachusetts Amherst,Amherst, MA, USA, Jan. 1984.

[10] P. Werbos, “Reinforcement learning and approximate dynamic program-ming (RLADP)—foundations, common misconceptions and challengesahead,” in Reinforcement Learning and Approximate Dynamic Program-ming for Feedback Control. New York, USA: Wiley, 2013, pp. 3–30.

[11] P. J. Werbos, “ADP: The key direction for future research in intelligentcontrol and understanding brain intelligence,” IEEE Trans. Syst., Man,Cybern. B, Cybern., vol. 38, no. 4, pp. 898–900, Aug. 2008.

[12] P. J. Werbos, “Intelligence in the brain: A theory of how it works andhow to build it,” Neural Netw., vol. 22, no. 3, pp. 200–212, 2009.

[13] P. J. Werbos, Handbook of Itelligent Control. New York, NY, USA: VanNostrand, 1992.

[14] P. J. Werbos, “Consistency of HDP applied to a simple reinforcementlearning problem,” Neural Netw., vol. 3, no. 2, pp. 179–189, 1990.

[15] J. Si and Y.-T. Wang, “Online learning control by association andreinforcement,” IEEE Trans. Neural Netw., vol. 12, no. 2, pp. 264–276,Mar. 2001.

[16] R. Enns and J. Si, “Helicopter trimming and tracking control using directneural dynamic programming,” IEEE Trans. Neural Netw., vol. 14, no. 4,pp. 929–939, Jul. 2003.

[17] L. Yang, J. Si, K. S. Tsakalis, and A. A. Rodriguez, “Direct heuristicdynamic programming for nonlinear tracking conrol with filtered track-ing error,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 39, no. 6,pp. 1617–1622, Dec. 2009.

[18] D. Liu and Q. Wei, “Finite-approximation-error-based optimal controlapproach for discrete-time nonlinear systems,” IEEE Trans. Cybern.,vol. 43, no. 2, pp. 779–789, Apr. 2013.

[19] D. Prokhorov and D. Wunsch, “Adaptive critic designs,” IEEE Trans.Neural Netw., vol. 8, no. 5, pp. 997–1007, Sep. 1997.

[20] D. V. Prokhorov, “Adaptive critic designs and their applications,” Ph.D.Dissertation, Dept. Electr. Eng., Texas Tech. Univ., Lubbock, TX, USA,Oct. 1997.

[21] D. Liu, D. Wang, D. Zhao, Q. Wei, and N. Jin, “Neural-network-basedoptimal control for a class of unknown discrete-time nonlinear systemsusing globalized dual heuristic programming,” IEEE Trans. Autom. Sci.Eng., vol. 9, no. 3, pp. 628–634, Jul. 2012.

[22] D. Wang, D. Liu, Q. Wei, D. Zhao, and N. Jin, “Optimal control ofunknown nonaffine nonlinear discrete-time systems based on adaptivedynamic programming,” Automatica, vol. 48, pp. 1825–1832, Jul. 2012.

[23] D. Liu and D. Wang, “Optimal control of unkonwn nonlinear discrete-time systems using the iterative globalized dual heuristic programmingalgorithm,” in Reinforcement Learning and Approximate Dynamic Pro-gramming for Feedback Control. New York, NY, USA: Wiley, 2013,pp. 52–74.

[24] J. Fu, H. He, and X. Zhou, “Adaptive learning and control for MIMOsystem based on adaptive dynamic programming,” IEEE Trans. NeuralNetw., vol. 22, no. 7, pp. 1133–1148, Jun. 2011.

[25] Z. Ni, H. He, D. V. Prokhorov, and J. Fu, “An online actor-critic learningapproach with Levenberg-Marquardt algorithm,” in Proc. IJCNN, 2011,pp. 2333–2340.

[26] J. Fu, H. He, and Z. Ni, “Adaptive dynamic programming with balancedweights seeking strategy,” in Proc. IEEE Symp. ADPRL, Apr. 2011,pp. 210–217.

[27] H. He, Z. Ni, and J. Fu, “A three-network architecture for on-linelearning and optimization based on adaptive dynamic programming,”Neurocomputing, vol. 78, no. 1, pp. 3–13, 2012.

[28] Z. Ni, H. He, and J. Wen, “Adaptive learning in tracking control basedon the dual critic network design,” IEEE Trans. Neural Netw. Learn.Syst., vol. 6, no. 24, pp. 913–928, Jun. 2013.

[29] H. He, Self-Adaptive Systems for Machine Intelligence. New York, NY,USA: Wiley, 2011.

[30] Z. Ni, H. He, D. Zhao, and D. Prokhorov, “Reinforcement learningcontrol based on multi-goal representation using hierarchical heuristicdynamic programming,” in Proc. IJCNN, Jun. 2012, pp. 1–8.

[31] H. He, Z. Ni, and D. Zhao, “Learning and optimization in hierarchicaladaptive critic design,” in Reinforcement Learning and ApproximateDynamic Programming for Feedback Control. Piscataway, NJ, USA:IEEE Press, 2013, pp. 78–95.

2050 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 12, DECEMBER 2013

[32] H. He, Z. Ni, and D. Zhao, “Data-driven learning and controlwith multiple critic networks,” in Proc. 10th WCICA, Jul. 2012,pp. 523–527.

[33] Z. Ni, X. Fang, H. He, D. Zhao, and X. Xu, “Real-time tracking controlon adaptive critic design with uniformly ultimately bounded condition,”in Proc. IEEE Symp. ADPRL, Apr. 2013.

[34] X. Fang, H. He, Z. Ni, and Y. Tang, “Learning and control in virtualreality for machine intelligence,” in Proc. 3rd ICICIP, pp. 63–67,Jul. 2012.

[35] Virtual Reality (VR) Platform for Adaptive Learning and Con-trol Based on Adaptive Dynamic Programming [Online]. Available:http://www.youtube.com/watch-v=OeZEDBz6ki0&feature=youtu.be

[36] X. Pang and P. J. Werbos, “Neural network design for J functionapproximation in dynamic programming,” Math. Model. Sci. Comput.,vol. 5, nos. 2–3, pp. 1–3, 1996.

[37] D. Wunsch, “The cellular simultaneous recurrent network adaptive criticdesign for the generalized maze problem has a simple closed-formsolution,” in Proc. IEEE IJCNN, vol. 3. Jul. 2000, pp. 79–82.

[38] R. Ilin, R. Kozma, and P. Werbos, “Cellular SRN trained by extendedKalman filter shows promise for ADP,” in Proc. IEEE IJCNN, Jul. 2006,pp. 506–510.

[39] R. Ilin, R. Kozma, and P. Werbos, “Efficient learning in cellularsimultaneous recurrent neural networks—The case of maze navigationproblem,” in Proc. IEEE Int. Symp. ADPRL, Apr. 2007, pp. 324–329.

[40] R. Ilin, R. Kozma, and P. Werbos, “Beyond feedforward models trainedby backpropagation: A practical training tool for a more efficientuniversal approximator,” IEEE Trans. Neural Netw., vol. 19, no. 6,pp. 929–937, Jun. 2008.

[41] M. Wiering and H. Van Hasselt, “Two novel on-policy reinforcementlearning algorithms based on TD (λ)-methods,” in Proc. IEEE Int. Symp.ADPRL, Apr. 2007, pp. 280–287.

[42] P. Werbos and X. Pang, “Generalized maze navigation: SRN critics solvewhat feedforward or Hebbian nets cannot,” in Proc. IEEE Int. Conf.Syst., Man, Cybern., vol. 3. Oct. 1996, pp. 1764–1769.

[43] T. M. Mitchell, Machine Learning. New York, NY, USA: McGraw-Hill,1997.

[44] F. Liu, J. Sun, J. Si, W. Guo, and S. Mei, “A boundedness resultfor the direct heuristic dynamic programming,” Neural Netw., vol. 32,pp. 229–235, Aug. 2012.

[45] P. He and S. Jagannathan, “Reinforcement learning neural-network-based controller for nonlinear discrete-time systems with input con-traints,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 37, no. 2,pp. 425–436, Apr. 2007.

[46] P. Dayan and T. J. Sejnowski, “TD (λ) converges with probability 1,”Mach. Learn., vol. 14, no. 3, pp. 295–301, 1994.

[47] P. Dayan, “The convergence of TD (λ) for general λ,” Mach. Learn.,vol. 8, no. 34, pp. 341–362, 1992.

[48] H. Robbins and S. Monro, “A stochastic approximation method,” Anna.Math. Stat., vol. 10, no. 1, pp. 400–407, 1951.

[49] H. J. Kushner and G. G. Yin, Stochastic Approximation Algorithms andApplications. New York, NY, USA: Springer-Verlag, 1997.

Zhen Ni received the B.S. degree from the Depart-ment of Control Science and Engineering, HuazhongUniversity of Science and Technology, Wuhan,China, in 2010, and the M.S. degree from theDepartment of Electrical, Computer, and BiomedicalEngineering, University of Rhode Island, Kingston,RI, USA, in 2012. Currently, he is pursuing thePh.D. degree with the same department.

His current research interests include compu-tational intelligence and reinforcement learning,specifically in adaptive dynamic programming and

optimal/adaptive control.

Haibo He (SM’11) received the B.S. and M.S.degrees in electrical engineering from the HuazhongUniversity of Science and Technology, Wuhan,China, in 1999 and 2002, respectively, and thePh.D. degree in electrical engineering from OhioUniversity, Athens, OH, USA, in 2006.

He was an Assistant Professor with the Depart-ment of Electrical and Computer Engineering,Stevens Institute of Technology, Hoboken, NJ, USA,from 2006 to 2009. He is currently an Associate Pro-fessor with the Department of Electrical, Computer,

and Biomedical Engineering, University of Rhode Island, Kingston, RI, USA.His research focuses on national and international media, such as the IEEESmart Grid Newsletter, The Wall Street Journal, and Providence BusinessNews. He has published one research book (Wiley), edited one research book(Wiley-IEEE) and six conference proceedings (Springer). He has authored orco-authored over 120 peer-reviewed journal and conference papers. His currentresearch interests include adaptive dynamic programming, machine learning,computational intelligence, hardware design for machine intelligence, andvarious applications, such as smart grid and renewable energy systems.

Dr. He is an Associate Editor of the IEEE TRANSACTIONS ON NEURAL

NETWORKS AND LEARNING SYSTEMS and the IEEE TRANSACTIONS ONSMART GRID. He received the National Science Foundation CAREER Awardin 2011 and the Providence Business News Rising Star Innovator Award in2011.

Jinyu Wen (M’10) received the B.Eng. and Ph.D.degrees in electrical engineering from the HuazhongUniversity of Science and Technology (HUST),Wuhan, China, in 1992 and 1998, respectively.

He was a Visiting Student from 1996 to 1997and a Research Fellow from 2002 to 2003 withthe University of Liverpool, Liverpool, U.K., and aSenior Visiting Researcher with the University ofTexas at Arlington, Arlington, TX, USA, in 2010.From 1998 to 2002, he was a Director Engineer withXJ Electric Co., Ltd., Xuchang, China. In 2003, he

joined HUST and is currently a Full Professor at HUST. His current researchinterests include renewable energy integration, energy storage application,multiterminal HVDC, and power system operation and control.

Xin Xu (M’07–SM’12) received the B.S. degreein electrical engineering from the Department ofAutomatic Control, National University of DefenseTechnology (NUDT), Changsha, China, in 1996, andthe Ph.D. degree in control science and engineeringfrom the College of Mechatronics and Automation,NUDT, in 2002.

He is currently a Full Professor with the Collegeof Mechatronics and Automation, NUDT. He has co-authored more than 100 papers in international jour-nals and conferences, and co-authored four books.

His current research interests include reinforcement learning, approximatedynamic programming, machine learning, robotics, and autonomous vehicles.

Dr. Xu was a recipient of the 2nd class National Natural Science Awardof China, in 2012. He is currently an Associate Editor of the InformationSciences Journal and a Guest Editor of the International Journal of AdaptiveControl and Signal Processing. He is a Committee Member of the IEEE TCon Approximate Dynamic Programming and Reinforcement Learning and theIEEE TC on Robot Learning.