Online Markov Chain-based energy management for a hybrid

lable at ScienceDirect

Energy 160 (2018) 544e555

Contents lists avai

Energy

journal homepage: www.elsevier .com/locate/energy

Online Markov Chain-based energy management for a hybrid trackedvehicle with speedy Q-learning

Teng Liu a, b, Bo Wang c, *, Chenglang Yang c

a State Key Laboratory of Automotive Simulation and Control, Jilin University, Changchun 130025, Chinab Department of Mechanical and Mechatronics Engineering, University of Waterloo, Ontario N2L3G1, Canadac School of Mathematics and Statistics, Beijing Key Laboratory on MCAACI, Beijing Institute of Technology, No. 5 South Zhongguancun Street, HaidianDistrict, Beijing 100081, China

a r t i c l e i n f o

Article history:Received 26 June 2017Received in revised form21 June 2018Accepted 7 July 2018Available online 11 July 2018

Keywords:Hybrid tracked vehicleMarkov chainInduced matrix normOnboard learning algorithmReinforcement learningSpeedy Q-learning

* Corresponding author.E-mail address: [email protected] (B. Wan

https://doi.org/10.1016/j.energy.2018.07.0220360-5442/© 2018 Elsevier Ltd. All rights reserved.

a b s t r a c t

This brief proposes a real-time energy management approach for a hybrid tracked vehicle to adapt todifferent driving conditions. To characterize different route segments online, an onboard learning al-gorithm for Markov Chain models is employed to generate transition probability matrices of powerdemand. The induced matrix norm is presented as an initialization criterion to quantify differencesbetween multiple transition probability matrices and to determine when to update them at specific roadsegment. Since a series of control policies are available onboard for the hybrid tracked vehicle, theinduced matrix norm is also employed to choose an appropriate control policy that matches the currentdriving condition best. To accelerate the convergence rate in Markov Chain-based control policycomputation, a reinforcement learning-enabled energy management strategy is derived by using speedyQ-learning algorithm. Simulation is carried out on two driving cycles. And results indicate that theproposed energy management strategy can greatly improve the fuel economy and be employed in real-time when compared with the stochastic dynamic programming and conventional RL approaches.

© 2018 Elsevier Ltd. All rights reserved.

1. Introduction

Hybird electric vehicles (HEVs) seem to be the most promisingsolution to overcome the increasing energy crisis and environ-mental pollution in recent decades [1]. Two types of energy sour-ces, electricity and gasoline, have been placed in a HEV to make itpossible to improve fuel economy and reduce exhaust emissions[2]. The energy management strategies are the critical technologyfor HEV to achieve the best performance and energy efficiencythrough power-split control [3]. One major difficulty to achieve thisgoal is how to adapt to multiple driving cycles. Along with thedevelopment of HEV, an effective and real-time energy manage-ment strategy is necessary for HEV to accommodate differentdriving conditions.

1.1. Literature review

Currently, the energy management strategies for HEV are

g).

mainly optimization-enabled strategies considering necessaryphysical constraints [4], such as restraints on state of charge inbattery, torque and rotational speed of engine and output power ofbattery and engine. Optimization-based energy managementstrategies can be further divided into global optimization and real-time optimization cases. Since the complete knowledge of thedriving cycle is predefined, dynamic programming (DP) algorithmis employed to make a globally optimal control decision. Ref. [5]leveraged DP to optimize the fuel economy for a velocity couplingHEV system with eleven modes. Serrao etc. [6] compared the DPwith other two methods to demonstrate its optimality. Addition-ally, Pontryagain’s Minimum Principle (PMP) technique is alsoadopted to improve the energy efficiency of the propulsion systemvia the global optimal control. A piecewise linear approximationstrategy is combined with the PMP to derive the optimal control forplug-in HEV in Ref. [7]. Zhang etc. [8] applied PMP to optimize thecontrol strategy for a dual-motor-driven electric bus under threedifferent driving cycles. Convex programming (CP) is another globaloptimizationmethod that derives the energymanagement strategybased on the convex modeling and rapid solution search. In Ref. [9],CP is used to implement a framework of simultaneous optimalenergy storage systems sizing and energymanagement. Hu etc. [10]

mailto:[email protected]

http://crossmark.crossref.org/dialog/?doi=10.1016/j.energy.2018.07.022&domain=pdf

www.sciencedirect.com/science/journal/03605442

http://www.elsevier.com/locate/energy

https://doi.org/10.1016/j.energy.2018.07.022



T. Liu et al. / Energy 160 (2018) 544e555 545

presented a high-efficiency CP framework to construct the swiftlyadapting charging/power management controls to wind intermit-tency. However, the global optimization strategies can only befeasible in off-line simulation since the driving cycle is generallyunknown in practical application.

Besides, the stochastic dynamic programming (SDP) algorithmis also developed to search the optimal energy managementstrategy for HEV through taking the random characteristics of thevehicle speed and drivers’ behaviors into account. Ref. [11]employed SDP to address the energy management for a serieshybrid tracked vehicle based on the Markov chain driver model. Xietc. [12] co-optimizes the use of energy storage for multiple ap-plications with SDP while accounting for market and system un-certainty. In real-time optimization, equivalent consumptionminimization strategy (ECMS) [13] and model predictive control(MPC) [14] are two most representative optimization-based ap-proaches. The ECMS explores the precise co-state value to achievethe local optimization, which strongly depends on the validity ofvelocity predictions [15]. For MPC, the controller presents an en-ergy management strategy via DP, genetic algorithm, quadraticprogramming, or nonlinear programming. For example, the infor-mation provided by the onboard navigation system is utilized in theMPC framework [16]. An adaptive approach is developed basedMPC to consider the load torque estimation and prediction in en-ergy management problem [17]. Genetic algorithm and MPC iscombined in Ref. [18] to minimize the energy consumption.Furthermore, a multi-layer perception is presented based on MPCand it is proved to guarantee globally-bounded closed-loop sta-bility [19]. Nevertheless, the performance of the MPC control ishighly influenced by the future information, such as prospectivespeed or power prediction [20].

Two inspiring innovative techniques, named reinforcementlearning (RL) and game theory (GT), are also proposed to build anoptimal controller for HEVs. RL can derive a model-free andapdative control for energy management problem [21]. And theglobal optimality of the GT is evaluated in Ref. [22] via comparingwith the DP method. Liu etc. [23] proposed a bi-level controlframework to combine the predictive learning with RL to formulatethe energy management strategy. Ref. [24] presents a GT controllerwith the cost penalizing fuel consumption, NOx emissions, batterystate of charge deviation, and vehicle operating conditions devia-tion. Over the new European driving cycle, the GT controller ac-quires the closest control performance to the existing DP controller.Markov Chain (MC) models are quite well-suited to represent theuncertainty in the driving environment, which can lower both theinformation required for implementation and the on-boardcomputing burden [25]. Based on the MC models, Liu etc.compared the control performance of RL and SDP as well as twodifferent RL-based algorithms [26], and the results indicated theadvantages of RL over the SDP in fuel economy and computationaltime [27]. However, the issue that the popular Q-learning algorithmoverestimates action values under certain conditions is notconsidered in previous energy management of HEV [28]. Mean-while, to the best of our knowledge, combining RL algorithm withthe on-board learning MC models has not been surveyed, and theexisting RL-enabled energy management strategy cannot guar-antee adaptive to various driving conditions.

1.2. Motivation and innovation

The main purpose of this brief is to construct a real-time energymanagement strategy by a collaboration of MC-based onboardlearning algorithm and speedy Q-learning (SQL) algorithm. Threeprimary contributions are presented in this paper. Firstly, an on-board learning algorithm is proposed for MC models to learn the

transition probability of power demand in real-time. Secondly, theinduced matrix norm (IMN) is served as an initialization criterionforMCmodels learning. Thus, a set of models representing differentsegments of power demand can be evolved and the IMN is appliedto select control policy that matches the current driving conditionbest. Finally, the SQL algorithm is developed to evaluate the on-board learning algorithm and avoid selecting overestimated valuesin control policy computation. In addition, the proposed energymanagement strategy is compared with the SDP and conventionalQ-learning algorithm to estimate its performance in differentdriving conditions.

1.3. Organization

The remainder of this paper is organized as follows: the inducedmatrix norm and the recursive algorithm for updating the transi-tion probability matrix are illuminated in Section 2; In Section 3,the onboard learning algorithm for MC models learning and theSQL algorithm are discussed; the comparative research betweendifferent energy management strategies are conducted in Section4; conclusions are given in Section 5.

2. Problem formulation and background

The vehicle being studied is a hybrid tracked vehicle (HTV) witha series topology. The powertrain configuration is sketched in Fig. 1.The main power components consist of a battery pack, an engine-generator set (EGS), and two traction motors. EGS and batteryconstitute the main power sources to propel the powertrain. ForEGS, the rated power of engine is 52 kWat the speed of 6200 rpm.The rated output power of generator is 40 kW within the speedrange from 3000 rpm to 3500 rpm. Power split controls betweenthe EGS and battery are the key technologies to realize the fuelefficiency improvement. The elementary parameters of the pow-ertrain are shown in Table 1. The modeling of the EGS and battery isintroduced in subsection 2.1. Since the historical vehicle speed isknown in real-time, the power demand Pdem can be calculated asfollows8>><>>:

Pdem ¼ ðFr þ Fi þ FaÞvþMuFr ¼ mg$fFi ¼ maFa ¼ ðCDA=21:15Þv2

(1)

where Fr, Fi and Fa are the rolling resistance, inertial force andaerodynamic drag, respectively. m is the vehicle mass, g is thegravity acceleration, a is the vehicle acceleration, CD is the aero-dynamic coefficient and A is the fronted area.M is the resisting yawmoment, v and u are the average velocity and rotational speed forthe tracked vehicle.

2.1. Optimization objective

The generator speed is selected as the state variable that can becalculated according to the torque equilibrium constraint

8>><>>:

dngdt

¼

Teie�g

� Tg

!,0:1047

Je

i2e�gþ Jg

!

ne ¼ ng�ie�g

(2)

where ng and ne are the rotational speeds, Tg and Te are the torquesof the generator and engine, respectively, and Te is decided by thethrottle variable th(t) using the expression Te¼ th*interp(ng, Te, max),wherein interp indicates the interpolation function and Te, max is the

Engine

Generator

EGS

Battery Integrated

Power

Electronic

Module

Motor

Motor

Electric Transmission

Mechanical Transmission

Fig. 1. Configuration of the series HTV powertrain.

Table 1Main parameters of the HTV powertrain.

Name Value Unit

Vehicle mass Mv 2500 kgGenerator inertia Jg 0.1 kg$m2

Engine inertia Je 0.2 kg$m2

Gear ratio parameter ie-g 1 /Electromotive force parameter Ke 0.8092 Vsrad�2

Electromotive force parameter Kx 0.0005295 NmA�2

Minimum State of Charge SoCmin 0.5 /Maximum Sate of Charge SoCmax 0.9 /Battery capacity Cb 37.5 Ah

T. Liu et al. / Energy 160 (2018) 544e555546

maximum value of engine torque. 0.1047 is the transformationfactor that denotes 1 r/min¼ 0.1047 rad/s. The torque and outputvoltage of the generator can be derived as follows:

�Tg ¼ KeIg � KxI2gUg ¼ Keng � KxngIg

(3)

where Ke is the electromotive force coefficient, Ug and Ig are thegenerator voltage and current, respectively. Furthermore, Kxng isthe electromotive force, and Kx¼ 3YLg/3.14, in which Lg is thearmature synchronous inductance, and Y is the poles number.

The state of charge (SoC) in the battery is chosen as another statevariable, which is computed by:

dSoCdt

¼ �IbðtÞCb

; (4)

where Ib and Cb denote the current and rated capacity of battery,respectively. According to the internal resistance model [29], thederivative of SoC and battery output voltage can be computed by

8>>><>>>:

dSoCdt

¼

�Voc �

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiV2oc � 4rPbðtÞ

q �2Cbr

Ubat ¼ Voc � IbrðSoCÞ

(5)

where Voc is the open circuit voltage and Pb is the battery power.Furthermore, Ubat is the battery output voltage and r depicts theinternal resistance. As the onboard energy sources are the EGS andbattery, the power demand in Eq. (1) can also be calculated asfollows

Pdem ¼ �Ug$Ig þ Ubat$Ib�$hm

±1

¼ �Pg þ Pb�$hm

±1 (6)

where hm is the motor efficiency, the positive sign means the pro-pulsion and the negative sign indicates the regenerative braking.Hence, the battery power can be derived by Eq. (6) based on theoperation conditions of motor and generator.

The optimal control strategy is derived through minimizing thecost function that represents a trade-off between the fuel con-sumption and charge sustaining as follows [29]：

8>>>><>>>>:

J ¼ZtþDt

t

h_mf ðtÞ þ bðDSoCðtÞÞ2

idt

DSoCðtÞ ¼�SoCðtÞ � SoCpre SoCðtÞ< SoCpre

0 SoCðtÞ � SoCpre

(7)

where Dt is the specific time interval to trigger for updating thecontrol policy or not, _mf is the fuel consumption rate, b is a positiveweight coefficient (presumed as 1000 in this manuscript), SoC is thestate of charge in the battery and SoCpre is a pre-defined constant tokeep the SoC within a reasonable range [30]. The charge sustainingconstraint is achieved at each time step using the second expres-sion in Eq. (7).

Setting the state variables as the combination of SoC and therotational speed of the generator ng [26], s¼ [SoC, ng]T. The costfunction is observably influenced by the power split between theengine and battery, so the control variable is configured as thethrottle variable th(t) in the engine [27], a¼ [th]T. The followinginequality constraints should be obeyed for control policycomputation:8>>>>>><>>>>>>:

SoCmin � SoCðtÞ � SoCmaxng;min � ngðtÞ � ng;max0 � TeðtÞ � Te;maxne;min � neðtÞ � ne;maxIb;min � IbðtÞ � Ib;max0 � IgðtÞ � Ig;max

; (8)

2.2. Online updating for transition probability

The power demand vary with the vehicle speed is modeled as a


finite-state MC and denoted as S¼ { Pidem |i¼ 1, …, L}3R, its tran-sition probability pio,j can be approximated based on the frequencycount as the speed trajectory is available:

8>>>><>>>>:

pio;j ¼Cio;jCio

Cio ¼XLj¼1

Cio;j

; (9)

where Cio,j denotes the number of observed transitions from Pidem toPjdem at the vehicle velocity vo, and Cio is the total number of tran-sitions originating in Pidem at speed vo. Since the power demandmeasurements are taken in real-time and have the length k, the Eq.(9) can be rewritten for on-board application [31] as:

pio;jzCio;jðkÞ=kCioðkÞ=k

¼ Fio;jðkÞFioðkÞ

(10)

where Fio,j(k) is the mean frequency of transition events, fio,j(k),from Pidem to Pjdem at the speed vo and Fio(k) is the mean frequencyof the transition events, fio(k), that are initiated from Pidem at thevehicle velocity vo within a specific window with kmeasurements:

8>>>>>><>>>>>>:

Fio;jðkÞ ¼ Cio;jðkÞ=k ¼ 1k

Xkt¼1

fio;jðtÞ

FioðkÞ ¼ CioðkÞ=k ¼ 1k

Xkt¼1

fioðtÞ ¼1k

Xkt¼1

XLj¼1

fio;jðtÞ(11)

and fio,j(t)¼ 1 if a transition occurs from Pidem to Pjdem at speed voand time instant t; fio(t)¼ 1 if a transition is initiated from Pidem atvehicle velocity vo and time instant t; otherwise either has a zerovalue. For the on-board use, a recursive expression for calculatingthe mean frequencies can be deduced as follows:

Fio;jðkÞ ¼1k

Xkt¼1

fio;jðtÞ ¼1k

hðk� 1ÞFio;jðk� 1Þ þ fio;jðkÞ

i

¼ Fio;jðk� 1Þ þ 1k

hfio;jðkÞ � Fio;jðk� 1Þ

i¼ ð1� fÞFio;jðk� 1Þ þ ffio;jðkÞ

(12)

FioðkÞ ¼1k

Xkt¼1

fioðtÞ ¼1k½ðk� 1ÞFioðk� 1Þ þ fioðkÞ�

¼ Fioðk� 1Þ þ 1k½fioðkÞ � Fioðk� 1Þ�

¼ ð1� fÞFioðk� 1Þ þ ffioðkÞ

(13)

where 42(0, 1) is called forgetting factor, resulting in exponentialforgetting of the older transition events. Subsequently, a recursiveform of the transition probability is formulated by substituting Eqs.(12) and (13) into Equation (10):

pio;jzFio;jðkÞFioðkÞ

¼ ð1� fÞFio;jðk� 1Þ þ ffio;jðkÞð1� fÞFioðk� 1Þ þ ffioðkÞ

(14)

During specific time interval Dt, the transition probability ma-trix (TPM) of the power demand can be updated online according tothe Eq. (14).

2.3. Induced matrix norm

The similarity of two TPMs can be quantified by using theinduced matrix norm (IMN) to determine when to update thecontrol policy in real-time and to choose control policy thatmatches the current driving condition best. For finite state spaces,the IMN regarding to two TPMs, P and B, is indicated as:

IMNðPkBÞ ¼ kP � Bk2 ¼ supx2RL=f0g

jðP � BÞxjjxj ; (15)

where x is a L� 1 dimension non-zero vector, and sup denotes thesupremum of a scalar. For online applications, the second-ordernorm can be reformulated in the following expression as inRef. [32]:

IMNðPkBÞ ¼ jjP � Bjj2 ¼ max1�i�L

jliðP � BÞj

¼ max1�i�L

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffili

hðP � BÞTðP � BÞ

ir(16)

where li(B) denotes the eigenvalue of matrix B for i¼ 1,…, L, and BT

denotes the transpose of B. It is obvious that the closer the IMN(P||B) is to zero, the more similar the transition probability matrix P isto B. Thus, the scalar IMN can be employed to quantity the differ-ence when B is used for old power demand transitions events whilethe current transitions are governed by P.

3. Onboard learning for Markov Chain models and speedy Q-Learning algorithm

The onboard learning algorithm and the SQL algorithm areillustrated in this section. The former algorithm makes the optimalcontrol adapt to the current driving condition via updating theTPMs of power demand. And the latter one accelerates theconvergence rate of action-value function in the RL framework.First, the storing and updating processes of the onboard learningalgorithm are introduced based on the IMN. Then, the update ruleof the action-value function in the SQL algorithm is given in detail.Finally, the proposed energy management strategy is constructedvia combining these two algorithms.

3.1. Onboard learning algorithm for Markov Chain models

In the real world, the variation in the road characteristics is ableto be captured by appropriate adjustment of the forgetting factor 4in Eqs. (12)e(14) to adapt gradually the MC model. For example, atransition from a highway region to an urban area occurs, thelearning algorithm could slowly transform the MC model into anew one while forgetting the previously identified characteristics.In order to update previously learned MC models, a set of thesemodels Bi, i¼ 1, 2,…,m, are stored to provide a characterization forthe whole frequently travelled area of the HTV. The online learningalgorithm for MC-based TPMs is divided into two procedures: First,a TPM of power demand represents the current driving condition isselected by IMN; Second, the real-time TPM is compared with theexisting models to realize real-time control.

In the first procedure, the IMN is applied as a measure of simi-larity between two TPMs to determine the convergence of the on-board learning process as:

IMNðPðtÞkPðt � DtÞÞ< εconv (17)

where P(t), P(t-Dt) denote the TPMs learned at the current timeinstant t and t-Dt sampling time instant. Dt is the specific time

Initialization: , t, conv

Compute TPM: P(t), P(t- t)

Criterion: IMN(P(t)||P(t- t) ) < conv

YesNo

Current driving condition TPM: P(t)

Procedure 1 Compute Current TPM

Tuning t or

Decide time instant: t, t- t

Procedure 2: Apply control in real time

Initialization threshold: sim

IMN(P(t) ||Bi ) sim

NoYes

Append TPM: P(t), m=m+1j arg min[IMN(P(t) ||Bi )]

Applyy jy j-j-jj thhh control policy Compute control basedd PP(P tP((P t)

Store neww control policy

Current P(t)

Fig. 2. Computational workflow of the learning algorithm for MC models.


interval that is tunable, εconv is called as the convergence thresholdand it is a small positive parameter to define the closeness betweentwo continuous learned TPMs. As the condition (11) is satisfied, thecurrent TPM P(t) provides an adequate representation of the regioninwhich the HTV is being operated. This TPMwill be added into theexistent TPMs Bi if it is significantly different from other models inthe set, andm is incremented suggesting the set of TPMs evolves tomatch the current transition events of power demand. Otherwisethe set of stored TPM models remains unchanged.

In the second procedure, the IMN is once again utilized to checkthe similarity between the current TPM P(t) and the stored TPMmodels Bi in real-time as follows:

IMNPðtÞ

Bi�< εsim i ¼ 1;2; :::;m; (18)

Fig. 3. The standard RL interaction framew

Table 2Pseudo-code of the SQL algorithm.

Algorithm: Speedy Q-learning (SQL)

1. Initialize s, g, initial action-value function Q0, number of iteration N2. Repeat k¼ 0,1, 2, 3, …, define Q-1¼Q0, ak ¼ 1/(k þ1)3. Repeat each (s, a), based on psa,s�, observe r(s, a) and s�4. Compute zQk-1(s, a)¼ r(s, a)þ gmina�Qk-1(s�, a�) and zQk(s, a)¼ r(s, a)þ gminb�Qk(5. Update Qkþ1(s, a) as: Qkþ1(s, a)¼ Qk(s, a)þ ak*(zQk-1(s, a)-Qk(s, a))þ(1-ak)*(zQk(s6. End7. End8. Return QN

where εsim is a similarity threshold. If the condition (12) fails,meaning the currently learned TPM does not match any of theexisting TPMs in the database, then the previous database Bi isappended with the TPM P(t) and m is incremented. While the IMNdoes not exceed the similarity threshold in Eq. (18), the model Bj

that is most similar to current model P(t) is selected to describe thecurrent route as:

j2arg minhIMN

PðtÞ

Bi�i i ¼ 1;2;…;m (19)

Once themodel Bj best matching the current transition events ofpower demand has been decided, the relevant optimal controlpolicy can be also activated for online control. It can be discernedthat the updating algorithm for a set of TPMs Bi and the two pro-cedures of onboard learning algorithm could both be implementedin real-time [33]. In practical application, the evolving set of TPMs,

ork between agent and environment.

s�, b�), a)-zQk-1(s, a))

Fig. 4. Flow diagram of the SQL-based control strategy.


Bi, i¼ 1, 2, …, m, are computed by Eqs. (12)e(14) and (17) (thethreshold value is different). The calculational process was usuallyexecuted dynamically based on the multiple pre-collected drivingcycles (representing the driving conditions) for HTV. Eqs. (12)e(14)

0 200 400 600 800 1000 1200 1400Time (s)

0

5

10

15

20

25

30

Speed(km/h)

Driving cycle 1Track 1Track 2a

a

00

0.5

transition

probability

Driving cycle 1: v=25km/h

50

1

125

Pdem (current)

100

Pdem (next)

7510050

25150

Fig. 5. Two driving cycles for learning MC mod

are applied to determine the TPMs for various driving cycles onlineand Eq. (17) is used to store the different cases of Bi. The values ofthe convergence threshold and similarity threshold are discussed inSection 4.1. Finally, the on-board learning algorithm for MC-based

0 200 400 600 800 1000 1200Time (s)

0

5

10

15

20

25

30

35

40

Speed(km/h)

Driving cycle 2

Track 1Track 2

b

b

00

0.5

transition

probability

Driving cycle 2: v=25km/h

50

1

125

Pdem (current)

100

Pdem (next)

75100 5025

150

els: (a) Driving cycle 1; (b) Driving cycle 2.

Fig. 6. IMN value variation versus time for two driving cycles: (a) Driving cycle 1; (b) Driving cycle 2.


TPM is implemented in Matlab and its main workflow is summa-rized in Fig. 2.

3.2. Speedy Q-learning algorithm

The standard reinforcement learning (RL) framework [34] inwhich a learning agent interacts with a stochastic environment isshown in Fig. 3. The interaction can bemodeled as a quintuple (S, A,

150 vs 300 300 vs 450 450 vs 600 600Compar

0

0.5

1

1.5

IMN

150 vs 300 300 vs 450 450 vs 600Compar

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

IMN

(a)

(b)

Fig. 7. IMN values for two driving cycles with time inter

P, R, g), where S and A are assemblages of states and actions, P is theTPM, R is the reward function, and g2(0, 1) is a discount factor. Thetransition probability from state s to next state s�using action a andthe corresponding reward is denoted as psa,s’ and r(s, a). In thispaper, the set of state variables is s2S¼ {(SoC(t), ng(t))|0.5� SoC(t)�0.9, 0�ng(t)�3100}, the set of control variables isa2A¼ {th(t) | 0� th(t)�1}, and the reward function is r2R¼ { _mf (s,a)þ b(DSoC(t))2}.

vs 750 750 vs 900 900 vs 1050 1050 vs 1200ison pairs

v=0km/hv=5km/hv=10km/hv=15km/hv=20km/hv=25km/hv=30km/hIMN=0.25

600 vs 750 750 vs 900 900 vs 1050ison pairs

v=0km/hv=5km/hv=10km/hv=15km/hv=20km/hv=25km/hv=30km/hv=35km/hIMN=0.25

val is 150 s: (a) Driving cycle 1; (b) Driving cycle 2.


The control policy p is the distribution of the control action a atspecific current state s. The value function is defined as the ex-

pected future reward VðsÞ ¼ Ep

�PtgtrðsÞ

�. Then, the optimal value

function V*(s) is displayed as a recursion expression of the rewardsas:

V*ðsÞ ¼ mina

rðs; aÞ þ g

Xs02S

psa;s0V*ðs0Þ!

cs2S (20)

As the optimal value function is determined, the optimal controlpolicy is calculated as follows:

p*ðsÞ ¼ argmina

rðs; aÞ þ g

Xs02S

psa;s0V*ðs0Þ!

(21)

Additionally, the action-value function Q(s, a) and its optimalvalue Q*(s, a) are expressed as the following formula:

8<:

Qðs; aÞ ¼ rðs; aÞ þ gXs02S

psa;s0Qðs0; a0Þ

Q*ðs; aÞ ¼ rðs; aÞ þ gXs02S

psa;s0mina02A

Qðs0; a0Þ: (22)

The variable V*(s) is the value of s, assuming that an optimalaction is taken initially; therefore, V*(s)¼Q*(s, a) and j*(s)¼ arg

1 vs 2 2 vs 3 3 vs 4 4 vs 5 5 vs 6 6Comparison

0

0.2

0.4

0.6

0.8

1

1.2

IMN

1 vs 2 2 vs 3 3 vs 4 4 vs 5 5 vs 6 6Comparison

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

IMN

(a)

(b)

Fig. 8. IMN values for two driving cycles with time inter

mina Q*(s, a). The action-value function at time step k is denoted as

Qk(s, a), then the update rule of standard Q-learning at time instantk is written as follows:

Qkþ1ðs; aÞ ¼ Qkðs; aÞ þ ak

�rðs; aÞ þ gmin

a0Qkðs0; a0Þ � Qkðs; aÞ

�(23)

where ak2[0, 1] is a decaying factor in the Q_learning algorithm.Defining zQk(s, a)¼ r(s, a)þ gmina� Qk(s�, a�), the Eq. (23) has anidentical deformation form:

Qkþ1ðs;aÞ ¼ Qkðs;aÞþ akðzQkðs;aÞ �Qkðs;aÞÞ¼ Qkðs;aÞ þakðzQk�1ðs;aÞ �Qkðs;aÞÞþakðzQkðs;aÞ � zQk�1ðs;aÞÞ

(24)

To avoid slow convergence when the discount factor g is close to1, the SQL algorithm is introduced in this manuscript. In the speedylearning perspective, ak decays linearlywith time, i.e., ak¼ 1/(kþ1),and replace the second ak in Eq. (24) with (1-ak) to formulate theupdate rule of SQL algorithm as [35]:

Qkþ1ðs; aÞ ¼ Qkðs; aÞ þ akðzQk�1ðs; aÞ � Qkðs; aÞÞþ ð1� akÞðzQkðs; aÞ � zQk�1ðs; aÞÞ (25)

Comparing Eqs. (24) and (25), it is noticed that the same terms

vs 7 7 vs 8 8 vs 9 9 vs 10 10 vs 11 11 vs 12pairs (*100s)

v=0km/hv=5km/hv=10km/hv=15km/hv=20km/hv=25km/hv=30km/hIMN=0.25

vs 7 7 vs 8 8 vs 9 9 vs 10 10 vs 11pairs (*100s)

v=0km/hv=5km/hv=10km/hv=15km/hv=20km/hv=25km/hv=30km/hv=35km/hIMN=0.25

val is 100 s: (a) Driving cycle 1; (b) Driving cycle 2.

0 1000 2000 3000 4000 5000 6000 7000Time (s)

0

10

20

30

40

50

60

70

Speed(km/h)

Track 1Track 2

0 100 200 300 400 500 600 700 800 900 1000Time (s)

0

5

10

15

20

25

30

35

Speed(km/h)

Track 1Track 2

(a)

(b)

Fig. 9. Two driving cycles for TPMs computation (a) and controls comparison (b).

Table 3Total number of the MC-based TPMs for two driving cycles.

Time interval △t Store timesa Store times b

150 s 3 4100 s 5 8

a Denotes Driving cycle 1.b Denotes Driving cycle 2.


zQk-1(s, a)-Qk(s, a) and zQk(s, a)-zQk-1(s, a) are appeared in updaterules of standard Q-learning and SQL algorithms. However, Q-learning employs the same learning rate ak for both terms, and SQLuses ak for the first term and a bigger learning rate (1-ak)¼ k/(1þ k)for the second one. Since for small k, akz 1 and (1-ak)z 0, thesecond term does not work at this time. While iterative time k in-creases, the second term zQk(s, a)-zQk-1(s, a) verges to zero as Qk

approaches its optimal value Q*, it is not necessary that its learningrate approaches zero. The different learning rate makes SQL has afaster convergence rate than Q-learning, which has been demon-strated mathematically in Ref. [35].

At each time instant k and for each state-action pair (s, a), SQLalgorithm works as follows: 1) generates the next state s�based onthe transition probability psa,s�; 2) computes zQk(s, a) and zQk-1(s, a)to estimate the action-value function Q(s, a) at the previous andcurrent time steps; 3) finally, updates Qkþ1(s, a) using the Eq. (25).The SQL algorithm is carried out in Matlab using the Markov de-cision process (MDP) toolbox introduced in Ref. [36], wherein theinputs are the iteration times, transition probability and rewardmatrices, and the outputs are the learned action-value function,value function and optimal policy. The transition probability matrixis obtained in Section 3.1 and the reward matrix is generated basedon the powertrain modeling, definition of the cost function andtransition probability matrix. The calculated optimal policy isrelated with the state variables, power demand and vehicle ve-locity. The pseudo-code of the SQL algorithm is described in Table 2.Energy management strategy is computed by combining the on-board learning algorithm for MC-based TPM and SQL algorithm,and its calculation flow diagram is depicted in Fig. 4. The onlineupdating for transition probability integrates the various powerdemand information into the TPM computation. And the IMN isapplied to measure the similarity of multiple TPMs and to chooseappropriate control policy. To expound the merits of the proposedenergy management strategy in fuel economy emphatically, theroad gradient is not considered in the MC model. The decayingfactor ak is correlatedwith the time step k and taken as 1/(kþ1), thenumber of iteration N is 10000, and the sample time is 1 s.

4. Simulation and discussion

Two aspects of simulation studies are executed to evaluate theeffectiveness of the proposed energy management strategy. Theonboard learning algorithm for MC-based TPM models is firstverified on different driving cycles that represent mutative drivingconditions. Then, the SDP and conventional RL-based controlstrategy are adopted as the benchmark strategy to demonstrate thesuperiority of SQL-based control strategy in optimality andcomputation time.

4.1. Verification of onboard learning algorithm for Markov Chainmodels

The steering power is a significant component of power demandfor HTV and it is able to be computed based on the speeds of twotracks (Eq. (1)). Two representative real-world driving cycles shownin Fig. 5 are chosen for the development of MC-based TPM modelsvia onboard learning algorithm. The specific TPMs at speedv¼ 25 km/h for these two driving cycles are also depicted in Fig. 5.Fig. 6 illustrates the IMN variation versus elapsed time for these twodriving cycles. According to the definition of IMN, the spikes inFig. 6 mean that the difference of the fore-and-aft TPMs is enor-mous, which is frequently influenced by the speed variation. As theforgetting factor 4 and the similarity threshold value εsim aredefined as 0.01 and 0.075, the convergence rates of the IMN in thesetwo situations are different. For driving cycle 1, the learned MC

models converge quickly due to the regular nature of the speedtransitions. Oppositely, the IMN has a spike around 400 s in Fig. 6(b)that is caused by the transition from low speed to high speed inFig. 5(b). Thus, the convergence rate of IMN is influenced by thespeed saltation and the differences of the road characteristics couldbe represented by the variation of IMN. Hence, the existing TPMsand optimal control policy can be employed to accommodate thecurrent driving conditions when the IMN value does not exceed thesimilarity threshold value.

While the time interval Dt is assumed with 150 and 100 s, thecorresponding alternation of TPMs at specific time instant fordifferent speed grades are shown in Figs. 7 and 8, respectively. Itindicates that the TPMs of power demand are compared every 150and 100 s along the driving cycle to clarify their differences. Sincethe IMN convergence threshold value is 0.25, the stored number of

0 100 200 300 400 500 600 700 800 900 1000Time (s)

0.675

0.68

0.685

0.69

0.695

0.7

0.705

0.71

0.715

SoC

SoC:SQLSoC:RLSoC:SDP

SOC:0.7009SOC:0.7040SOC:0.7000

0 100 200 300 400 500 600 700 800 900 1000Time (s)

-200

0

200

Power(kW) Power split in SQL

0 100 200 300 400 500 600 700 800 900 1000Time (s)

-200

0

200

Power(kW) Power split in RL

0 100 200 300 400 500 600 700 800 900 1000Time (s)

-200

0

200

Power(kW) Power split in SDP

Power of EnginePower of Battery

Fig. 10. SoC trajectories and power split for three control strategies under the simulation cycle.


the MC-based TPMs and control strategy for these two drivingcycles are decided. Table 3 depicts the total number of the TPMs fordifferent driving cycles and time intervals. It can be discerned thatthe total number is related with the time interval Dt and theconvergence threshold value εconv. To synthetically consider thecontrol performance and computation efficiency, the Dt and εconv

are predefined as 100 s and 0.3 for different control strategiescomparison in next section.

UpdateUpdateUpdate

Fig. 11. IMN values with time interval is 100 s in SQL control.

4.2. Comparison of different control strategies

The Dyna and Q-learning algorithms are compared with DP andSDP in Refs. [26] and [27], respectively. The comparative fuel con-sumption indicates that these two algorithms are better than SDPand close to DP. Based on these discussions, the paper aims toevaluate the optimality of SQL algorithm via comparing with thestandard Q-learning and SDP methods. The SQL-based controls areclose to the globally optimal DP-based controls can be demon-strated by concluding that the SQL algorithm is better than thesetwo benchmark methods. The standard Q-learning algorithm in Eq.(23) and the SDP-based control policy is calculated by minimizingthe cost function J over an infinite span:

JpðtÞ ¼ _mf ðtÞ þ b�SoCðtÞ � SoCpre

�2 þ ε

Xtþ1

pt;tþ1Jpðt þ 1Þ (26)

where Jp(t) indicates the resulting expected cost when the systemstarts at the given time and follows the policy p thereafter. Thevariable ε2[0, 1] is a discount rate to weight the expected cost.

200

200200

210

210

210210

220

220220

230230

230

240240

240

250

250

250

260

260

260

280

280

280

300

300

3501000 1500 2000Engine speed (rpm)

200

400

600

800

1000

1200

1400

1600

1800Torque

fuel consumption curvesworking points in SDQL

200

20000

2

210

210

21001

2

220

220220

230230

230

240240

240

250

250

250

260

260

260

280

280

280

300

300


200

400

600

800

1000

1200

1400

1600

1800

Torque

fuel consumption curvesworking points in RL

200

20000

2

210

210

21001

2

220

220220

230230

230

240240

240

250

250

250

260

260

260

280

280

280

300

300


200

400

600

800

1000

1200

1400

1600

1800

Torque

fuel consumption curvesworking points in SDP

200

2102

Lower Fuel Area

SQL

Fig. 12. Engine working points for three control strategies.

Table 4Results of fuel consumption after SoC-correction.

Control strategies Fuel consumption (g) Relative increase (%)

SQL 2821.3 e

RL 2903.6 2.92SDP 2978.5 5.57

Table 5Computation time comparison for three control strategies.

Control strategies Computation timea (seconds) Relative increase (%)

SQL 1.35 e

RL 2.12 371.11SDP 4.26 846.67

a A 2.4 GHz microprocessor with 12 GB RAM was used.


In SDP, a policy iteration algorithm is used to solve the energymanagement problem, which consists of two steps: the policyevaluation and the policy improvement. In the policy evaluationstep, through giving a desired power request Pddem and initialcontrol policy p0, the corresponding cost function Jp0 (t) is calcu-lated by (24). A new control policy is computed by

pPdreq

�¼ arg min

�Jp0ðtÞ

(27)

In the policy improvement step, the cost function updates withthe new policy. These two steps repeated until the cost functionconverges within a selected tolerance level.

A long driving cycle for a set of TPMs (Bi, i¼ 1, 2, …, m)computation and another driving cycle for three methods com-parison are shown in Fig. 9. For SQL-based controls, the starting setof TPMs is chosen randomly among Bi. Once the correspondingcontrols are not suitable for the current driving conditions, theworkflow described in Fig. 2 is applied to update the TPM andcontrols. The Q-learning-based and SDP-based controls are calcu-lated based on the special set of TPMs among the Bi, i¼ 1, 2, …, m,which is approximate to the current driving conditions andselected by Eq. (19). The SoC evolutions and power split resultsbetween the engine and battery are indicated in Fig.10. It is obviousthat the variational tendency of the SoC trajectory in SQL-enabledstrategy is different from that in conventional RL-based strategyand the SDP-based strategy. This differences in SoC trajectory canbe ascribed to the update of the TPMs in real-time, which promptsthe control strategy to match the current driving conditions suit-ably. Fig. 11 describes the IMN values at different speed gradesunder the simulation cycle. As the IMN surpasses the thresholdvalue at 200 s, 600 s and 700 s, the SQL-based control strategy istriggered to update to accommodate the current power demandbetter. This update also helps engine and battery achieve moreappropriate power split described in Fig. 10, resulting in muchhigher fuel economy.

The results of the engine working area in different controlstrategies are given in Fig. 12. Compared with the SDP-basedstrategy, the operating points under SQL-enabled strategy workfrequently in the lower fuel-consumption area. The results inTable 4 depicts the fuel consumption after SoC-correction (linearinterpolation) [37] for the three control strategies. It can berecognized that the fuel consumption of the SQL-enabled strategyis lower than that of other two strategies, which demonstrates theoptimality of the proposed control strategy. This high fuel economyis contributed by the onboard learning algorithm for MC models,which results in the updating of the TPMs. The relevant SQL-basedcontrols canmatch the current driving conditions best and producelower fuel consumption. The computation time under these controlstrategies is contrasted in Table 5. This time represents that usingthese three algorithms to derive the optimal policy based on thesame TPM. It also equals to the number of iterations needed forthese algorithms to converge. From Table 5 it can be seen that theproposed control strategy is fastest among the three kinds of


control strategies, which results from the bigger learning rate forthe second term in the SQL algorithm. The final target computationtime for an application in real vehicle is about 1 s, and thus thespeedy learning operation in SQL algorithm makes it possiable toapply the corresponding control strategy in the real-timeapplication.

5. Conclusion

In this article, the onboard MC models learning algorithm andSQL algorithm are proposed for real-time energy management for aHTV. In this approach, the MC-based TPM models are updated inreal-time via an onboard learning algorithm. Then the IMN isapplied as the critical criterion for selecting control policy that bestmatches the current driving condition best. Moreover, the SQL al-gorithm is adopted to accelerate the convergence rate of the RLframework and to derive corresponding optimal control strategy.The simulation results illustrate that the onboard learning algo-rithm for MC models can adapt to different driving conditionsproperly. Furthermore, the proposed SQL-enabled strategy iscompared with SDP and conventional RL-based strategy to validateits optimality and potential in real-time control. Further simulativeand experimental investigations is underway to demonstrate theproposed control strategy in real world with different working anddriving conditions.

Acknowledgements

The work was in part supported by Foundation of State KeyLaboratory of Automotive Simulation and Control (Grant No.20171108), NNSF (Grant No. 11701027) and Beijing Institute ofTechnology Research Fund Program for Young Scholars.

References

[1] Caux S, Gaoua Y, Lopez P. A combinatorial optimisation approach to energymanagement strategy for a hybrid fuel cell vehicle. Energy 2017;133:219e30.

[2] Shen P, Zhao Z, Zhan X, Li J, Guo Q. Optimal energy management strategy for aplug-in hybrid electric commercial vehicle based on velocity prediction. En-ergy 2018;155:838e52.

[3] Khayyam H, Bab-Hadiashar A. Adaptive intelligent energy management sys-tem of plug-in hybrid electric vehicle. Energy 2014;69:319e35.

[4] Du J, Chen J, Song Z, Gao M, Ouyang MG. Design method of a power man-agement strategy for variable battery capacities range-extended electric ve-hicles to improve energy efficiency and cost-effectiveness. Energy 2017;121:32e42.

[5] �Skugor B, Deur J. Dynamic programming-based optimisation of charging anelectric vehicle fleet system represented by an aggregate battery model. En-ergy 2015;92:456e65.

[6] Serrao L, Onori S, Rizzoni G. A comparative analysis of energy managementstrategies for hybrid electric vehicles. J Dyn Syst 2011;133(3):1e9.

[7] Hou C, Ouyang M, Xu L, Wang H. Approximate Pontryagin’s minimum prin-ciple applied to the energy management of plug-in hybrid electric vehicles.Appl Energy 2014;115:174e89.

[8] Yang C, Li L, You S, Yan B, Du X. Cloud computing-based energy optimizationcontrol framework for plug-in hybrid electric bus. Energy 2017;125:11e26.

[9] Hu XS, Murgovski N, Johannesson LM, Egardt B. Comparison of three elec-trochemical energy buffers applied to a hybrid bus powertrain with simul-taneous optimal sizing and energy management. IEEE T Intell Transp Syst2014;15(3):1193e205.

[10] Hu XS, Zou Y, Yang Y. Greener plug-in hybrid electric vehicles incorporatingrenewable energy and rapid system optimization. Energy 2016;111:971e80.

[11] Zou Y, Kong Z, Liu T, Liu DX. A real-time Markov chain driver model fortracked vehicles and its validation: its adaptability via stochastic dynamicprogramming. IEEE T Veh Technol 2017;66(5):3571e82.

[12] Du Y, Zhao Y, Wang Q, Zhang Y, Xia H. Trip-oriented stochastic optimal energymanagement strategy for plug-in hybrid electric bus. Energy 2016;115:1259e71.

[13] Tulpule P, Marano V, Rizzoni G. Energy management for plug-in hybridelectric vehicles using equivalent consumption minimization strategy. Int JElectr Hybr Veh 2010;2(4):329e50.

[14] Son D, Yeung RW. Optimizing ocean-wave energy extraction of a dual coaxial-cylinder WEC using nonlinear model predictive control. Appl Energy2017;187:746e57.

[15] Musardo C, Rizzoni G, Guezennec Y, Staccia B. A-ECMS: an adaptive algorithmfor hybrid electric vehicle energy management. Eur J Contr 2005;11(4):509e24.

[16] Johannesson L, Asbogard M, Egardt B. Assessing the potential of predictivecontrol for hybrid vehicle powertrains using stochastic dynamic program-ming. IEEE T Intell Transp Syst 2007;8(1):71e83.

[17] Hou C, Sun J, Hofmann H. Adaptive model predictive control with propulsionload estimation and prediction for all-electric ship energy management. En-ergy 2018;150:877e89.

[18] Reynolds J, Rezgui Y, Kwan A, Piriou S. A zone-level, building energy opti-mization combining an artificial neural network, a genetic algorithm, andmodel predictive control. Energy 2018;151:729e39.

[19] Dong Z, Zhang Z, Dong Y, Huang X. Multi-layer perception-based modelpredictive control for the thermal power of nuclear superheated-steam sup-ply systems. Energy 2018;151:116e25.

[20] Sun C, Hu X, Moura SJ. Velocity predictors for predictive energy managementin hybrid electric vehicles. IEEE T Coltr Syst Technol 2015;23(3):1197e204.

[21] Liu T, Hu X, Li S, Cao D. Reinforcement learning optimized look-ahead energymanagement of a parallel hybrid electric vehicle. IEEE/ASME T Mechatronics2017;22(4):1497e507.

[22] Chen H, Kessels J, Donkers MCF. Game-theoretic approach for completevehicle energy management. In: Proceedings of vehicle power and propulsionconference (VPPC); 2014. p. 1e6.

[23] Liu T, Hu X. A bi-level control for energy efficiency improvement of a hybridtracked vehicle. In: IEEE T Ind. Informat; 2018.

[24] Dextreit C, Kolmanovsky IV. Game theory controller for hybrid electric vehi-cles. IEEE T Coltr Syst Technol 2014;22(2):652e63.

[25] Filev DP, Kolmanovsky I. Generalized markov models for real-time modelingof continuous systems. IEEE T Fuzzy Syst 2014;22(4):983e98.

[26] Liu T, Zou Y, Liu DX, Sun FC. Reinforcement learning-based energy manage-ment strategy for a hybrid electric tracked vehicle. Energies 2015;8:7243e60.

[27] Zou Y, Liu T, Liu DX. Reinforcement learning-based real-time energy man-agement for a hybrid tracked vehicle. Appl Energy 2016;171:372e82.

[28] Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double Q-learning. In: Proceedings of AAAI; 2016. p. 2094e100.

[29] Liu T, Zou Y, Liu DX, Sun FC. Reinforcement learning of adaptive energymanagement with transition probability for a hybrid electric tracked vehicle.IEEE T Ind Electron 2015;62:7837e46.

[30] Zheng J, Zhang L, Shellikeri A, Cao W, Wu Q, Zheng J. A hybrid electrochemicaldevice based on a synergetic inner combination of Li ion battery and Li ioncapacitor for energy storage. Sci Rep 2017;7:41910.

[31] Filev DP, Kolmanovsky I. A generalized Markov chain modeling approach foron board applications. In: Proceedings of neural networks (IJCNN); 2010.p. 1e8.

[32] Wang CR, Shi RC. Matrix analysis. first ed. BeiJing: Beijing Institute of Tech-nology Press; 1989 (in Chinese).

[33] Hoekstra A, Filev D, Szwabowski S. Evolving Markov chain models of drivingconditions using onboard learning. In: Proceedings of IEEE internationalconference; 2013. p. 1e6.

[34] Sutton RS, Barto AG. Reinforcement learning: an introduction. A bradfordbook. Cambridge, Massachusetts; London, England: The MIT Press; 2011.

[35] Azar MG, Munos R, Ghavamzadeh M, Kappen HJ. Speedy Q-learning. In:Proceedings of NIPS; 2011. p. 2411e9.

[36] Chades I, Chapron G, Cros MJ, Garcia F, Sabbadin R. MDPtoolbox: a multi-platform toolbox to solve stochastic dynamic programming problems. Ecog-raphy 2014;37:916e20.

[37] Chen R. Energy management strategy for hybrid electric tracked vehicle basedon dynamic programming (In Chinese). Master’s dissertation. Beijing Instituteof Technology Press; 2011.

http://refhub.elsevier.com/S0360-5442(18)31317-3/sref1


































































































































Documents

Online Markov Chain-based energy management for a hybrid