A Hierarchical Architecture for Sequential Decision-Making ... · A Hierarchical Architecture for Sequential Decision-Making in Autonomous Driving using Deep Reinforcement Learning

A Hierarchical Architecture for Sequential Decision-Making in AutonomousDriving using Deep Reinforcement Learning

Majid Moghadam 1 Gabriel Hugh Elkaim 1

AbstractTactical decision making is a critical feature foradvanced driving systems, that incorporates sev-eral challenges such as complexity of the un-certain environment and reliability of the au-tonomous system. In this work, we develop amulti-modal architecture that includes the envi-ronmental modeling of ego surrounding and traina deep reinforcement learning (DRL) agent thatyields consistent performance in stochastic high-way driving scenarios. To this end, we feed theoccupancy grid of the ego surrounding into theDRL agent and obtain the high-level sequentialcommands (i.e. lane change) to send them tolower-level controllers. We will show that divid-ing the autonomous driving problem into a multi-layer control architecture enables us to leveragethe AI power to solve each layer separately andachieve an admissible reliability score. Compar-ing with end-to-end approaches, this architectureenables us to end up with a more reliable systemwhich can be implemented in actual self-drivingcars.

1. IntroductionAdvanced Driving Assistance Systems (ADAS) are devel-oped to increase traffic safety by reducing the impact ofhuman errors. The evolution of various levels of driving au-tonomy has seen a significant speed up in last years aimingto enhance comfort, safety and driving experience. For along time, with a limited amount of technological resources,automotive stakeholders were focusing on steady-state ma-neuvers to achieve some sort of autonomy. However, inrecent years the major focus of research in the field of au-tonomous driving is being directed to the transition ma-neuvers, most importantly tactical lane changing, required

1Computer Science and Engineering Department, Univer-sity of California, Santa Cruz, USA. Correspondence to: MajidMoghadam <[email protected]>.

Real-world Sequential Decision Making workshop at ICML 2019,Long Beach, California. Copyright 2019 by the author(s).

Figure 1. A sketch of our hierarchical approach for the autonomousdriving problem

for both fully and partially autonomous driving systems(Alizadeh et al., 2019).

Before AI, control and orientation of ground vehicles weretackled using feedback control techniques (Falcone et al.,2007; Broggi et al., 1999; Moghadam & Caliskan, 2015) thatattempt to stabilize the vehicle using the information col-lected from sensory measurements. The controller used togenerate input commands to the actuators like steering angle,accelerator, and break to perform the driving tasks. Becauseof the extraordinary success of the neural networks (NN) inthe classification (Krizhevsky et al., 2012) and regression(Specht, 1991) problems, the researchers have decided toapply learning-based approaches to the control problems.Leveraging Deep Reinforcement Learning (DRL) agents toplay Atari games (Mnih et al., 2015) was one of the earli-est methods that gained considerable attention in the field.After the success of DRL on video game plays, researchersapplied DRL on real-world problems like control and ori-entation of autonomous systems which have shown a greatpotential to become a reliable alternative for the classicalcontrol approaches. There has been various deep learning(DL) techniques that applied to the problem of autonomousvehicles. One approach is to use an expert knowledge ofthe problem and train a neural network to learn the expertpolicies in an end-to-end manner. This method is known asimitation learning and has been applied to both self-drivingcars (Bojarski et al., 2016) and unmanned aerial vehicles(Bicer et al., 2019). Although the results are promising for

A Hierarchical Architecture for Decision-Making in Autonomous Driving using Deep RL

training an intelligent agent that shows off comparable per-formance as human, such agents may never outperform thehuman expert as the training dataset is being annotated usingsolely expert demonstrations. Similar end-to-end architec-ture is also used to train DRL agents to drive autonomouscars in simulation environments, like TORCS (Koutnık et al.,2013) and World Rally Championship (Perot et al., 2017).Here the term end-to-end indicates that the raw pixel infor-mation of on-board camera on the vehicles are being used asthe state representation and continuous steering angle andacceleration is being calculated via DRL algorithm. End-to-end DRL approaches have the ability to outperform thehuman expert in various driving tasks. This can be verifiesby the results published in the mentioned articles. However,the safety and reliability of such algorithms become criticalproblems when applied to real-world problems. In addition,the complexity of the environment in real-world applica-tions may impeach the performance of these algorithms. Inother words, combining different layers of the ADAS sys-tems which took long to reach the current performance inan end-to-end architecture which maps the observations di-rectly to the actions, does not seem to guarantee the similarperformance in actual systems.

1.1. Our approach

As discussed above, in most of the studies in the literaturethe raw sensory inputs like video frames of on-board cameraor RGB-D sensors on automobile have been used to traina neural network in order to estimate the required actionto control the vehicle (see Fig.2). Most of the studies val-idated the performance of their methods in a number ofunreal simulations or video games. We believe that when itcomes to the real-world implementation of these methods,the end-to-end learning techniques may suffer the lack ofreliability to control the agent as a result of huge number ofuncertainties while interacting in the actual world. To thisend, we divide the problem into different control level prob-lems and try to solve each level separately. In other words,the raw sensory inputs are going to be used to percept theenvironment and map it to a simpler two-dimensional world.Accordingly, the AI agent in decision making layer willgenerate high-level decision commands like left/right lanechange and send them to lower-level control layers whichdeal with steering angle and gas pedal commands to sta-bilize the vehicle while following the generated high-levelactions. This approach is summarized in Fig.1. Also theADAS architecture is provided in Fig.2. Note that we usethe occupancy grid as the environment model around theego vehicle. We believe that, this approach is more reliablein real-world implementations than end-to-end techniquesavailable in the literature because we do not perform a singlemapping from the highest level observations with consid-erable number of uncertainties to the lowest level actuator

control layer. This approach is much like the multi-loopcontrol techniques in control engineering which has provento show better performance in stabilizing complex systems(Slotine et al., 1991; Moghadam et al., 2018).

Figure 2. Hierarchical architecture of the general ADAS systemsvs. end-to-end approaches

In this study, we address the problem of high-level decisionmaking for an autonomous car using classical reinforcementlearning technique known as Q-learning. We implementthe ε−greedy algorithm to the problem defined in Deep-Cars simulation environment which is also designed andimplemented by the authors. After commenting on the per-formance of Q-learning algorithm, we implement the deepreinforcement learning method for the same problem. Fi-nally we present the results for both approaches and make acomparison between them.

1.2. Deep Q-Learning with real-time validation

The main idea in the reinforcement learning (RL) and dy-namic programming (DP) is to control an agent or a processwhile interacting with an environment using the observingstates and rewards received from the environment (Busoniuet al., 2017). DP and RL are algorithmic methods for solv-ing decision-making problem to achieve the desired goalthroughout the interaction with the world. DP methods re-quire the model of the system’s behavior, whereas RL is amodel-free approach which improves the produced policywhile interacting with the environment. RL uses MarkovDecision Process (MDP) (Busoniu et al., 2017) to math-ematically formalize discrete stochastic environment. AsMDP works in discrete time, states and actions in RL areusually discrete that leads to a sequential decision-makingproblem. Rewards also provide an informative metric ofthe agent’s performance, and the goal is to maximize theaccumulated long-term return over the track of the interac-tion with the environment. Although recent advances in RL(Lillicrap et al., 2015) enables us to apply to continuousproblems, here we leverage the discrete MDP to formalizethe decision-making problem in autonomous driving andtrain RL algorithms to achieve a favorable performance.


Q-learning evaluates how good taking an action might be ata particular state through learning the action-value functionQ(s, a). In Q-learning a memory table Q[s, a] is built tostore the Q-values for all the possible combinations of statesand actions. By sampling an action from the current state,the reward R and the new states are found out, by whichthe next action a that has the maximum Q(s′, a′) from thememory table is taken. Taking an action in a particular statehas a Q− value which is depicted in eq. 1.

Q(s, a) = R(s, a, s′) + γmaxa′Qk(s

′, a′) (1)

Where s′ and a′ are next state and action respectively. How-ever if the combinations of state and actions are too large orstates and actions are continuous, the memory and compu-tation requirement for action-value function Q will be toohigh. To address this issue, Deep Q-Network (DQN) (Mnihet al., 2015) is utilized that approximates the action-valuefunction Q(s, a).

In this work, as shown in Algorithm 1, (Mnih et al., 2015;Alizadeh et al., 2019), two networks, θ− and θ are createdand trained, one for retrieving Q values and one includingall updates in the training. Ultimately we synchronize θ−

and θ to fix the Q− value targets temporarily, so it keepsthe target funciotn from changing abruptly. The loss will becalculated as in eq. 2.

Li(θi) = Es,a,s′,r∼D

(r+γmax

a′Q(s′, a′; θ−i −Q(s, a; θi))

)2

(2)

where the transitions (s, a, s′, r) are retrieved from the ex-perience replay D. Experience replay is used as a buffer,which we sample a mini-batch of samples from it to trainthe deep neural network. Since we are randomly samplingfrom the replay buffer, the data is closer to i.i.d and moreindependent of each other that in turn makes the trainingmore stable. In addition, E indicates the expectation overthe probability distribution.

By utilizing the experience replay and target network, theinput and output of the model turns to be more stable to trainand the network behaves more like a supervised learningalgorithm.

We also appended the real-time validation phase to tradi-tional DQN in order to record the best trained model duringthe training. For this purpose, we define two periods, bywhich validation phase flag is activated and lates networkweights are being recorded and exploited during validation.Depending on which period, the agent’s performance isevaluated for a number of episodes and the achieved meanreward is compared to the latest maximum value. This en-

ables the agent to record the best trained model by validatingon unseen scenarios. Defining two various periods with dif-ferent number of episodes helps the training to be faster andrecord more generalized model at the same time.

Algorithm 1 Deep Q-Network (DQN) with experience re-play and real-time validationInput: Initialize:

replay memory D to capacity Maction-value function Q with random weights θtarget action-value function Q with wights θ− = θ

Output: Q∗initialize sequence s1 = {x1}initialize preprocesses sequence φ1 = φ(s1)for t = 1, ..., T do

at =

{random action ,with probability εargmax

aQ(φ(st), a; θ) , otherwise

rt, xt+1: apply(at)st+1 = st, at, xt+1

φt+1 = φ(st+1)D ← (φt, at, rt, φt+1)(φj , aj , rj , φj+1)← random(D)

yj = rj + γmaxa′Q(φj+1, a

′; θ−)

preform a gradient descent step on∂(yj−Q(φj ,aj ,;θ)

)∂θ

if validation phase thenfix network parameters θfor t′ = 1, ...,val episodes do

take greedy actions: a = maxQ(φ(st′), a; θ)record episode rewards

if mean validation reward is increased thenθbackup ← θ

every C steps; Q = Q

In order to improve the DQN performance, we also imple-mented Double DQN (DDQN) algorithm (Van Hasselt et al.,2016) which uses two Q−networks in order to deal with theoveroptimistic value estimations. Target value estimation inDouble Q-learning is performed using eq. 3

Q(s, a)DoubleQ = R(s, a, s′)+γQ(s′, argmaxQ(s′, a, θ); θ′)(3)

where, θ and θ′ represent different Q−network parameters,one to determine the greedy policy and the other to decideon its value.


2. DeepCars Simulation Environment 2

As discussed before, we are planning to map a 3-dimensional world into a much simpler 2D simulation en-vironment and form the occupancy grid of the surroundingactors around the ego vehicle. To make this possible, wedecided to design our own environment using the pygame(a free and open source python programming language li-brary) to build our gaming environment. The source code isavailable in our GitHub repository (see footnote). We calledthe environment DeepCars and a screen-shot from the gamescreen is shown in Fig.3.a.

Figure 3. A screen-shot from the DeepCars 2D environment andstate representation

The game receives high-level control commands as the inputvector and gives the game state and reward as the output. Inorder to be able to formalize the problem in MDP to be usedin RL algorithm, the states and actions are discrete. In factthe action space consist of three actions:

A ={left stay right

}(4)

switch to left/right lane or stay in the same lane. The ob-served Markov state for each RL is explained in correspond-ing sections.

3. ResultsWe have implemented a tabular Q-learning, DQN, andDouble-DQN (DDQN) algorithms in DeepCars and com-pared the results. We will see that DDQN will demonstratebetter results in terms of the performance-training speedtrade-off in this problem.

2https://github.com/MajidMoghadam2006/gym-deepcars

3.1. Tabular Reinforcement Learning on DeepCars

As a starting point, we are going to implement the tabularQ-learning algorithm (Fig.4) for the control of the agent inDeepCars environment. The observed Markov state for theclassical RL is defined as the following:

S =[ego lane ID x0 x1 ... xn

](5)

in which the first element is the ego lane ID and xi indicatesthe distance to the closest car in lane i. Note that the lanenumber and grid numbers start from 0. For instance thestate vector for the Fig.3.b is S =

[2 6 3 8 0 8

].

Also note that the line-of-sight of the vehicle is 9 (from0 to 8). This means that the furthest car that the agentcan percept can be 9 grids (number of rectangles) away.This constrain in fact simulates the sensor range that areimplemented on the vehicle.

Figure 4. Tabular Q-learning algorithm (Busoniu et al., 2017)where x and u indicate the observed state and input respectively

The main objective is to train the agent to avoidmaking collisions with other vehicles in the environment.Thus, we define a simple reward function

ρ(s, a, s′) =

{+1 s′ 6= sT−1 s′ = sT

(6)

Where sT indicates the terminal state that the agent makesa collision. For the hyper-parameter tuning, we performeda simple grid search in the parameter space and found theoptimal setting:

• discount factor: γ = 0.9

• learning rate: α = 0.1

• exploration: ε = 0.2

• exploitation: 1− ε = 0.8

We have trained the agent for 50000 steps. The resultingQ-table were recorded in order to be used in the evaluationphase. We evaluated the agent performance for 100’000steps and recorded the results. In order to comment on

https://github.com/MajidMoghadam2006/gym-deepcars


the performance of the agent we define the accuracy as theconsistent comparison metric between algorithms:

Accuracy =# of passed cars

# of all cars (passed + collided)× 100 (7)

The results were promising though imperfect. In fact for thetraining and evaluation phases, the results were as follows:

• Training: Accuracy = 98.28%

• Evaluation: Accuracy = 99.14%

This shows that after 50000 frames the agent has learnedto control the vehicle by avoiding other vehicles. However,this is not the ideal performance, because there are a numberof collisions in the test set which are not appealing. In factour aim is to design an agent that can avoid collisions forall cases. If we take a look at the recorded Q-table afterthe training is finished in Fig.5 we can see that most of therows are still zeros which indicates a sparse Q-table that isexpected in tabular Q-learning. This situation arises becausenot all of the state-action combinations were experiencedin the training phase. And, as the agent is not using anyfunction approximator, the similar situations cannot be ap-proximated and collisions are unavoidable. We expect toovercome this problem by approximating Q-values using adeep neural network architecture which indicates the DQNalgorithm. Note that the evaluation performance is betterthan training because in evaluation phase we use the besttrained model. In addition, the actions are greedy in evalu-ation phase in contrast with ε−greedy strategy in trainingphase.

Figure 5. A crop of sparse Q-table in tabular RL (# of lanes: 3).Red rectangle indicates the greedy optimal action where s =[1 6 3 0

]

3.2. Deep Reinforcement Learning on DeepCars3

As discussed in previous section, the tabular Q-learningapproach lacks the generalization property. In addition, thecourse of dimensionality is another problem while usingQ-learning techniques. In our case, the number of Markovstates ascend exponentially as the number of lanes increases.

3https://github.com/MajidMoghadam2006/deepcars-reinforcement-learning

In addition, we are planning to feed the occupancy gridof the environment into the agent which makes the useof classical RL impossible for this case. This leads us touse the DQN and DDQN algorithms. The source code forDQN and DDQN algorithms are provided in our GitHubrepository (see footnote). Here, we modified the originalDQN algorithm and added the real-time validation phase.This helps us to perform the evaluation of the most recentmodel during the training and select the best trained modelby far. This is possible by evaluating the latest model ona number of game episodes, say 100, and recording theagent’s performance. Finally, we record the model that hadthe best performance at the end of training.

Note that in DQN the observed MDP state is the oc-cupancy grid while in tabular Q-learning, in order to reducethe state space dimension, we defined the game state as ineq. 5. The occupancy grid incorporates the actors’ positionin a simple matrix with the same size as the environmentgrid (Fig.3.a). Empty grids are filled with zeros, and onesindicate the occupied grid with an actor. The flattenedversion of this matrix forms the state vector. Finally, wefuse the ego lane number as a binary code of the lane IDto create the MDP state. Including ego lane ID as a binarycode, enables us to have consistent binary values in the statevector, though it increases the observation space dimension.

In addition, in order to have a smooth graph for reward val-ues we keep track of the mean accumulated rewards every100 episodes as the performance metric. We implementedour algorithm in a PC equipped with CPU: Intel Core i92.60GHz and GPU: GeForce GTX 1080 Ti. We trained theagent for 500’000 steps for three different architecturesfor the neural network; shallow, medium, and deep. Eachdiffers in the number of layers of the neural network. Wehave used {32} for shallow, {32, 64, 32} for medium, and{64, 128, 128, 64} for deep network architectures as thenumber of layers. The rest of the hyper-parameters wereequal for all cases. The performance comparison of thesenetworks are summarized in Fig.6. Obviously, the shallow

Figure 6. DQN performance for three network architectures

https://github.com/MajidMoghadam2006/deepcars-reinforcement-learning

https://github.com/MajidMoghadam2006/deepcars-reinforcement-learning


network has shown a better performance comparing withtwo others. But the best comparison would be made in thevaluation phase where we save the trained models andevaluate them in consistent sets of evaluation episodesand keep track of the number of collisions. As shallownetwork promised a better performance, we use the samearchitecture to compare DQN and DDQN. However, weadjust the number of layers in network at the same timeto search for the best option. The performance of theseconfigurations in training phase are represented in Fig.7.Obviously, the deep version of DDQN achieved a higher

0

50

100

150

200

250

300

0 50000 100000 150000 200000 250000 300000 350000 400000 450000

steps

mean100episode rewards

DQN-16 DQN-16x16 DDQN-16 DDQN-16x16

Figure 7. DQN vs. DDQN with different network architectures.Here 16x16 indicates two dense layers with 16 neurons at each.

reward peak value comparing with others, while shallowDDQN converged much faster than its rivals. DQN hasshown a mediocre performance between these agents.Following the real-time validation policy in algorithm1, we recorded the best model in all training phases andevaluated their performance for 100’000 steps whereall of the agents achieved 100% accuracy (eq. 7). Thismeans that all agents are thoroughly trained and succeed toperform the evaluation phase without any collisions. Thisaccomplishment ascertains the generalization performanceof the DQN comparing with tabular Q-learning in previoussection. Among deep approaches, DDQN-16 has shown thebest performance-speed trade-off. Intuitively, the advancedRL algorithm (DDQN) is much faster than its predecessor(DQN) which was expected. Also, a shallow networkwith one hidden layer and 16 neurons seems a reasonablechoice for this setting where the size of the occupancy gridas the input is 8 × 5 = 40. A rule of thumb to choosethe number of neurons in the hidden layer is to select anumber between the input and output vector sizes. Thesefacts may justify the choice of DDQN-16 agent as thebest performance-speed RL algorithm for the DeepCarssimulation environment.

Achieving 100% evaluation accuracy for the unseenscenario in training phase strengthens the assertion that it’spossible to decompose the autonomous driving probleminto a hierarchical architecture and leverage the AI power to

solve each layer separately with an admissible reliabilityscore.

4. ConclusionIn this study, we addressed the problem of autonomous driv-ing in the highway scenario. We considered the hierarchicalarchitecture of the ADAS systems and focused on the deci-sion making layer, where the sequential high-level decisioncommand are being made using a deep reinforcement learn-ing algorithm. The agent receives the occupancy grid of theenvironment as the state representation and produces lanechange commands in order to avoid making collisions withother vehicles. The environment is stochastic in terms of thetraffic density and lane position of the other actors whichchallenges the agent’s performance and at the same time,imitates the fusion layer output that may occur in actualtest scenarios. We have shown that it’s possible to leveragethe provided multi-layer architecture to generate high-levelcommands using DRL with an acceptable reliability score.

Future WorkWe are planning to implement the multi-layer architecturein Fig.2 and tackle each layer separately using the sameapproach that we used in this study. We will exploit Deep-Cars as the 2D environmental model of the ego surrounding.We summarized this scheme in Fig.1. We are also planningto use Unreal Engine powered simulations like CARLA(Dosovitskiy et al., 2017) or Microsoft AirSim (Shah et al.,2018) to generate the occupancy grid of the ego surround-ing from the ground truth information that we receive fromthe simulator. This way, we may bypass the fusion layerin ADAS (Fig.2) and study the presented hierarchical ap-proach (Fig.1) performance in more realistic and complexsituations. In addition, we will compare the original DQNwith its most recent extensions like dueling double DQN(Wang et al., 2015), prioritized experience replay (Schaulet al., 2015), and rainbow agent (Hessel et al., 2018) whichis a combination of most recent improvements of the RLin recent years. We will also perform a hyper-parameteroptimization to search for the best parameter setting in theparameter space.


ReferencesAlizadeh, A., Moghadam, M., Bicer, Y., Ure, N. K., Yavas,

U., and Kurtulus, C. Tactical lane changing with deepreinforcement learning in dynamic and uncertain trafficscenarios. In 22nd Intelligent Transportation SystemsConference (ITSC2019-submitted), 2019.

Bicer, Y., Moghadam, M., Sahin, C., Eroglu, B., and Ure,N. K. Vision-based uav guidance for autonomous landingwith deep neural networks. In AIAA Scitech 2019 Forum,pp. 0140, 2019.

Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B.,Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller,U., Zhang, J., et al. End to end learning for self-drivingcars. arXiv preprint arXiv:1604.07316, 2016.

Broggi, A., Bertozzi, M., Fascioli, A., Bianco, C. G. L.,and Piazzi, A. The argo autonomous vehicles visionand control systems. International Journal of IntelligentControl and Systems, 3(4):409–441, 1999.

Busoniu, L., Babuska, R., De Schutter, B., and Ernst, D.Reinforcement learning and dynamic programming usingfunction approximators. CRC press, 2017.

Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., andKoltun, V. Carla: An open urban driving simulator. arXivpreprint arXiv:1711.03938, 2017.

Falcone, P., Borrelli, F., Asgari, J., Tseng, H. E., and Hrovat,D. Predictive active steering control for autonomousvehicle systems. IEEE Transactions on control systemstechnology, 15(3):566–580, 2007.

Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostro-vski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., andSilver, D. Rainbow: Combining improvements in deep re-inforcement learning. In Thirty-Second AAAI Conferenceon Artificial Intelligence, 2018.

Koutnık, J., Cuccu, G., Schmidhuber, J., and Gomez, F.Evolving large-scale neural networks for vision-basedtorcs. 2013.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenetclassification with deep convolutional neural networks.In Advances in neural information processing systems,pp. 1097–1105, 2012.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez,T., Tassa, Y., Silver, D., and Wierstra, D. Continuouscontrol with deep reinforcement learning. arXiv preprintarXiv:1509.02971, 2015.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-land, A. K., Ostrovski, G., et al. Human-level control

through deep reinforcement learning. Nature, 518(7540):529, 2015.

Moghadam, M. and Caliskan, F. Actuator and sensor faultdetection and diagnosis of quadrotor based on two-stagekalman filter. In 2015 5th Australian Control Conference(AUCC), pp. 182–187. IEEE, 2015.

Moghadam, M., Ure, N. K., and Inalhan, G. Autonomousexecution of aircraft supermaneuvers with switching non-linear backstepping control. In 2018 AIAA Guidance,Navigation, and Control Conference, pp. 1594, 2018.

Perot, E., Jaritz, M., Toromanoff, M., and De Charette,R. End-to-end driving in a realistic racing game withdeep reinforcement learning. In Proceedings of the IEEEConference on Computer Vision and Pattern RecognitionWorkshops, pp. 3–4, 2017.

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori-tized experience replay. arXiv preprint arXiv:1511.05952,2015.

Shah, S., Dey, D., Lovett, C., and Kapoor, A. Airsim: High-fidelity visual and physical simulation for autonomousvehicles. In Field and service robotics, pp. 621–635.Springer, 2018.

Slotine, J.-J. E., Li, W., et al. Applied nonlinear control,volume 199. Prentice hall Englewood Cliffs, NJ, 1991.

Specht, D. F. A general regression neural network. IEEEtransactions on neural networks, 2(6):568–576, 1991.

Van Hasselt, H., Guez, A., and Silver, D. Deep reinforce-ment learning with double q-learning. In Thirtieth AAAIConference on Artificial Intelligence, 2016.

Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanc-tot, M., and De Freitas, N. Dueling network architec-tures for deep reinforcement learning. arXiv preprintarXiv:1511.06581, 2015.

Documents

A Hierarchical Architecture for Sequential Decision-Making ... · A Hierarchical Architecture for Sequential Decision-Making in Autonomous Driving using Deep Reinforcement Learning