Multiobjective Reinforcement Learning-Based Intelligent

Received December 24, 2018, accepted January 5, 2019, date of publication January 24, 2019, date of current version February 14, 2019.

Digital Object Identifier 10.1109/ACCESS.2019.2894756

Multiobjective Reinforcement Learning-BasedIntelligent Approach for Optimizationof Activation Rules in AutomaticGeneration ControlHUAIZHI WANG 1, (Member, IEEE), ZHENXING LEI1, XIAN ZHANG2, (Member, IEEE),JIANCHUN PENG1, (Senior Member, IEEE), AND HUI JIANG 31College of Mechatronics and Control Engineering, Shenzhen University, Shenzhen 518060, China2Department of Electrical Engineering, The Hong Kong Polytechnic University, Hong Kong3College of Optoelectronic Engineering, Shenzhen University, Shenzhen 518060, China

Corresponding author: Xian Zhang ([email protected])

This work was supported in part by the National Natural Science Foundations of China under Grant 51707123, in part by the NaturalScience Foundations of Guangdong Province under Grant 2017A030310061 and Grant 2016A030313041, and in part by the Foundation ofShenzhen Science and Technology Committee under Grant JCYJ20170302153607971.

ABSTRACT This paper proposes a novel hybrid intelligent approach to solve the dynamic optimizationproblem of activation rules for automatic generation control (AGC) based on multiobjective reinforcementlearning (MORL) and small population-based particle swarm optimization (SPPSO). The activation rule forAGC is to dynamically allocate the AGC regulating commands among various AGC units, and subsequently,the secondary control reserve of those units can be activated. Therefore, the activation rule for AGCis vital to ensure the overall control performance of AGC schemes. In this paper, MORL is applied toprovide a customized platform for interactive self-learning to maximize the long-run discounted reward,i.e., minimize the generation cost, regulating error, and emission from a long-term viewpoint. SPPSO isutilized to effectively and efficiently obtain the optimality of activation rule with a fast convergence speedto fulfill the real-time requirement of AGC activation. Furthermore, a novel analytic hierarchy process-based coordination factor is introduced to identify the optimum multi-objective tradeoff in various powersystem operation scenarios. At last, the validation of the proposed hybrid method has been demonstrated viacomprehensive tests using practical data from the dispatch center of China Southern Power Grids.

INDEX TERMS Activation rule, automatic generation control, multiobjective reinforcement learning,particle swarm optimization.

I. INTRODUCTIONAutomatic generation control (AGC) plays an essential rolefor restoring the system frequency and minimizing powerflow deviations over network interchanges [1] by regulatingthe generation outputs of AGC units to accommodate thesupply and demand imbalance. A critical issue concerningAGC scheme is the activation rules representing how thereal-time central regulating commands are optimally allo-cated to each dispatchable generating unit, and thus the sec-ondary control reserves of those units can be activated [2].Therefore, the activation rule for AGC is vital to the over-all control performance enhancement of AGC schemes, andcan be viewed as a real-time nonlinear multiobjective opti-mization problem considering the generation cost, regulating

performance, and emission because of the air pollution andenvironmental considerations. Consequently, this paperfocuses on investigating new and advanced activation rulefor AGC to solve the dynamic multiobjective optimal alloca-tion of AGC generation signal among various types of AGCproviders.

Over the years, various mathematical and intelligent con-trol strategies such as optimal control [3]–[4], robust con-trol [5]–[8], distributed control [9]–[11], soft computingmodel [12]–[16], and reinforcement learning [19]–[23] havebeen developed in literatures as existing AGC solutions.However, despite many works on AGC strategies, optimalactivation rule for AGC signal are rarely investigated. In prac-tice, the activation is usually simplified by using either

174802169-3536 2019 IEEE. Translations and content mining are permitted for academic research only.

Personal use is also permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

VOLUME 7, 2019

https://orcid.org/0000-0001-5458-9154

https://orcid.org/0000-0002-6257-8145

H. Wang et al.: Multiobjective Reinforcement Learning-Based Intelligent Approach for Optimization of Activation Rules in AGC

fixed participating factor or one equivalent AGC unit ineach control authority. This simplification indeed reducesthe implementation efforts but cannot ensure the optimalperformance, considering multiple AGC participating unitsexist and may vary from time to time. Moreover, previousexperiments in dispatch centers of China Southern PowerGrid (CSG) as well as the analysis in [2] showed that differ-ent activation rules may lead to different quality of control.In fact, the current practice to handle the activation rule is toutilize either the pro-rata rule, in which the AGC regulationparticipating factor for each unit is determined proportionalto the adjustable reserve capacity of the unit [2], [23], or themerit-order list [24], in which the AGC units with the lowestenergy price [25] should be prioritized. The pro-rata rulehas been widely accepted by power utilities in Switzerland,France [2] and China [23]. Nevertheless, this method cannotaddress the cost and emission concerns involved in dispatch-ing the AGC units, thus leading to non-optimal results. Powerutilities in Germany and Italy [2] adopt merit-order list toactivate their secondary control. However, this method cannotensure the optimal performance either [2]. In [26] and [27],regulating performance of AGC participating units for theprovision of the frequency regulation service was consideredin AGC signal allocation process. Yet, these efforts fail toaddress emissions of providers. As presented in [28], powerutilities all over the world are facing ever increasing pressureson emission reduction due to e.g. Industrial Emissions Direc-tive in Europe and other countries, which reinforces the needof low emission resources.

In general, the primary difficulty to determine the activa-tion rule comes from real-time requirement since the energymanagement system (EMS) usually use 2-8 seconds for thedecision cycles of the AGC system [1], [26]. Besides varioustechnical and economic constraints [1], the requirements ofsystem reliability and security [22], together with externalenvironmental uncertainties [13], make the activation ruleproblem involving multiple objectives hardly solvable byconventional or heuristic optimization techniques [29]–[30].Recently, the authors proposed in [31] a novel hierarchicalQ-learning algorithm for such problem, which proved to havecompetitive performances comparing to other methods.

In recent years, a new branch of RL theory, multiobjec-tive reinforcement learning (MORL), has received increasedattentions and been successfully applied in a variety offields, including collaborative decision support systems, dis-tributed control, robotic teams and economics [32]. Previ-ous applications have demonstrated that MORL can providea customized platform for interactive self-learning rules tomitigate the environmental uncertainties and to maximizethe discounted rewards from the long-term viewpoint. In thispaper, the rewards are interpreted as regulating performance,generation cost and emission associated with AGC signalallocation process. Based on our achievements in [31], thiswork is further devoted to investigate an optimal activationrule framework and a hybrid intelligent approach based onMORL and SPPSO that can enhance the AGC performance

and dynamic optimization efficiency. The main contribu-tions of this paper include: (1) The formulation of optimalactivation rule with considerations of regulating performance,generation cost and emission; (2) Transformation of theresponse of participating providers to AGC instructions intoMarkov Decision Processes (MDPs); (3) The development ofa hybrid intelligent approach based on MORL and SPPSO,which is suitable to solve the MDPs, and the establishment ofthe framework for activation rule optimization. The proposedhybrid approach has been thoroughly tested and benchmarkedon CSG model under various operational scenarios.

A. BASIC RL ALGORITHMAs originally inspired from animal intelligence, RL is anovel algorithm for enabling agents to learn an optimal pol-icy by trial-and-error interactions with an unknown environ-ment [33]. Due to the advantages such as model free, RL hasbeen growing rapidly and developed into a major branchof machine learning for solving sequential decision makingproblems.

Generally, RL features four basic elements: a model ofthe environment, a reward function, value functions and anaction policy [33]. Here, the model of the environment ischaracterized as a group of system states termed as statespace S. The reward function converts the observed states intoa single number to better express the desirability of previousexecuted action. The value function represents the discountedsum of the future sequence of rewards starting from the stateand the action policy thereafter. Lastly, the action policyspecifies a stimulus-response rule to choose and execute anappropriate action based on value functions as to maximizethe expected long-term rewards in each state-action pair. TheRL based multi-step Q(λ), R(λ) and correlated equilibriumbased CEQ(λ) has been applied in [19]–[21] by the authorsas to mitigate the effect of long-time delay of thermal plantsand to provide a satisfactory AGC performance over variousoperation scenarios, where the activation rule optimization,however, has not been considered at all.

FIGURE 1. Basic architecture of MORL.

B. MULTIOBJECTIVE RLSignificantly different from basic RL algorithms, MORL hastwo or more objectives to be optimized by the learning agent,where each objective has its own reward signal. Given Nobjectives, the basic architecture of MORL is shown in Fig.1,where ri (1≤ i ≤ N ) denotes the ith indicator of the agent’sreward offered by the external unknown environment. Obvi-ously, this basic architecture describes the case of a singleagent that is simultaneously faced with a set of differentobjectives.

VOLUME 7, 2019 17481


With respect to any objective, there always exists a corre-sponding state-action value function in the form of Q lookuptable to express the long-term desirability of the state-actionpairs. The state-action (s, a) value function, that is Q function,can be updated based on various rules to satisfy BellmanEquations, such as single agent Q-learning algorithm adoptedin this paper, as follows,

Qi (s, a)← (1−α)Qi (s, a)+α(ri +max

a′Qi(s′, a′

))(1)

where α denotes the learning rate parameters. If α decaysappropriately while obeying the usual stochastic approxima-tion conditions [33] andmeanwhile all state-action pairs (s, a)are encountered infinitely, then the Q lookup table can beconverged. Consequently, the optimal action policy couldbe finalized and executed based on the converged Q lookuptable.

Regarding to the entire set of objectives, the state-actionvalue function can be defined in a vector form, as follows

MQ (s, a) = [Q1 (s, a) ,Q2 (s, a) . . .QN (s, a)] (2)

where MQ(s, a) is the vectored state-action value function.The definition of optimal policy π∗ for MORL is similarto the optimal policy definition of single-agent RL [34],as follows

π∗ (s) = argmaxa

[maxπ

MQπ (s, a)]

(3)

Apparently, the multiobjective optimization problem ofmaximizing MQ(s, a) can be solved using single-policyapproaches and multiple-policy approaches [34]. The for-mer strives to obtain the best single policy that satisfies thepreference among multiple objectives, as specified accordingto the problem domain. The latter seeks to find a group ofpolicies that approximate the Pareto front in order to providediversity within the policy space. As for the activation rule,the optimal policy that minimizes the objectives is deter-mined every 4 seconds in one AGC decision cycle. Therefore,a single-policy approach called weighted-sum approach [35],is applied in this paper. The weight termed as coordina-tion factor is determined using preset preference informationapplying analytic hierarchy process (AHP) approach [36].

II. PROBLEM FORMULATIONA. OVERVIEW OF AGC IMPLEMENTATIONIn today’s AGC schemes, the activation rule and the controlpulses of each control authority for interconnected powergrids are generally determined and generated by a central gridfacility, i.e. power dispatch center [11]. In this center, the con-trol authority’s AGC scheme is always materialized by twomain control modules: optimal AGC controller and activationrule determination, as depicted in Fig.2. The formermodule isa closed-loop feedback control unit as to optimize the solutionof total regulating command 1PC∑ in response to unin-structed deviations from scheduled. To date, extensive AGCschemes [3]–[22], including several RL based AGC strate-gies under Control Performance Standard (CPS), have been

FIGURE 2. Functional diagram of AGC implementation in each controlauthority.

proposed to generate the optimal overall command 1PC∑over various operation scenarios. However, the AGC controlsystem in real-life’s power utilities are commonly estab-lished based on the PI control methodologies and mostpower dispatch centers in China have adopted an improved-PIbasedAGC control system developed byNanjingAutomationResearch Institute (NARI) [23].

The second module is to optimize the participation factorsfor each secondary control reserve (SCR) provider accordingto the activation rule. The overall AGC command 1PC∑ isweighted with the obtained participation factors, resulting ina set of reference signal 1PCi, each of which will be trans-mitted to the SCADA system of a remote SCR provider andtrigger the change of its production [2]. However, to the bestof the authors’ knowledge, there are few published papers totackle the activation rule optimization problem.

It should be stressed that AGC, in which the activation ruleis involved, is distinguished from the economic dispatch (ED)because of different time horizons and control objectives theyhave. The AGC is an EMS function operated in a mannerof seconds to mitigate uninstructed deviations that manifestsin area control error (ACE) [27], according to the capacitymargin procured in day-aheadmarket and hour-aheadmarket.While, ED is performed every 5-15 minutes to distributethe system base load economically amongst all availableresources based on the dispatch instructions issued by real-time market [27]. In the case studies of the paper, the AGCdecision cycle is set to 4 seconds in the CSG power systemmodel. Adopting AGC decision cycle as 4s does not meanAGC system will dispatch control command to each SCRproviders at every 4s because load cannot change at that ratedue to the turbine and generator inertia. Actually, the SCRprovider will not receive any dispatch signal whenever ACEis within an acceptable MW range.

B. ACTIVATION RULE OBJECTIVESIn the proposed activation rule optimization architecture,multiple objectives have been considered and designed. Theprimary objective is to minimize the regulating cost of allparticipating units and can be represented as follows

Obj1 =∑M

i=1Ci (pfi ×1PC6) (4)

17482 VOLUME 7, 2019


where Ci denotes the regulating cost function of the ith AGCunit, and pfi is its participating factor, 1PC∑ is the overallAGC reference signal, M is the number of participatingunits.

The second objective should be considered in activationrule problem is the AGC regulating performance, which hasthe potential to improve operational and economic efficiencyof the power grid. In this paper, the regulating performancefor all participating units is interpreted as to minimize theintegral of time multiplied absolute errors (ITAE), that is

Obj2 =∑M

i=1

∫ Ti

0t × |pfi ×1PC6 − PGi| dt (5)

where PGi is real-time output of the provider i, Ti denotesthe time required for the ith provider to follow its referencesignal. In [27], the performance of a provider for the provisionof the frequency regulation service is quantified by rampingability and the accuracy for following the AGC dispatchsignal. In this paper, those two abilities are measured by ITAEsince smaller ramp rate or worse accuracy correspond to anincreasing ITAE error.

Due to the concerns over global warming and air pollu-tion, policy makers are promoting low emission technologiesand developing national emissions limits since 1990 [37].Power utilities have been required to reduce emissions fromtheir units [38]. This emission reduction pressure faced bypower utilities are increasing ever high since the start ofseveral Directives Controlling Emissions, including e.g. theKyoto Protocol, the Large Combustion Plant Directive andthe National Emissions Ceilings Directive [28]. Considerableresearches were reported to minimize emission in ED [38].However, little has been done in relation to AGC dispatch.As demonstrated in our case studies, the emission of AGCparticipating units can be reduced by 5%-10% by adopt-ing the proposed hybrid activation rule compared to theconventional pro-rata method, which indicates a valuableimprovement in this regard. Therefore, the activation ruleoptimization objective concerning emission reduction can beformulated as follows

Obj3 =∑M

i=1Ei (pfi ×1PC6) (6)

where Ei denotes the discharged emission function.It is clear that those three objectives may conflict with

each other. With respect to regulating performance, the typ-ical time delay in AGC for thermal units ranges from0.5-2 minutes since the command for turbine-boiler controlsystem is executed slowly with large time constants. While,liquefied natural gas (LNG), hydro units and other sustain-able units with power electronic interfaces can follow thereference signal with a far less time delays. Regarding theregulating cost of same type units, units with smaller capacitygenerally exhibits low efficiency and high cost. As for dif-ferent types of unit, the engineering practice reveals that theregulating costs per unit are sized fromLNGdown to thermal,hydro and wind turbine [29]. With concern over emission,

sustainable units are given priority to take part in the regulat-ing activities compared to fossil-fueled units because of theirlower emissions. Therefore, a preference tradeoff among thethree conflicting objectives should be preset as to optimizethe overall performance in current system state, as specifiedin section IV.

C. ACTIVATION RULE CONSTRAINTSThe following constraints are required to be considered in theoptimal activation rule problem:

D. PARTICIPATING FACTOR CONSTRAINTSAll the participating factors should be set within its lower andupper limits and meanwhile their sum should be equal to 1.This is to guarantee the sum of the reference signal for allSCR providers should be equal to the overall AGC dispatchsignal 1PC∑, as follows

0 ≤ pfi ≤ (1 ≤ i ≤ M) (7)∑M

i=1pfi = 1 (8)

On the other hand, engineering practice manifests that theparticipating factor of a certain SCR provider should be set0 if the provider is still involved in responding to dispatchsignal from previous AGC decision cycles. The criterion forthis involvement is determined as∣∣pfi ×1PC∑ − PGi∣∣ > ξi (9)

where ξi denotes any appropriate small positive number. Thisis to say, if (9) is satisfied, AGC system will not dispatch anycommand to the ith SCR provider even when ACE is beyondthe acceptable range and needs to be reduced and managed.SCR providers with fast ramping ability may contribute morefrequently in response to AGC commands because of theirhigh operational flexibility.

1) CAPACITY CONSTRAINTSThe real power generated by SCR providers should bebounded within their minimum and maximum capacity,

PminGi ≤ PGi + pfi ×1PC6 ≤ P

maxGi (10)

where PminGi and Pmax

Gi are the minimal and maximal adjustablecapacities of the ith provider, PGi denotes its correspondingoutput power.

2) GOVERNOR DEAD BAND CONSTRAINTSSpeed governor dead band will inevitably produce a delay inunit response [39] if the absolute value of the given referencepower for certain provider is lower than its dead band value,resulting in a worse AGC performance because of the timecriteria used in ITAE. Therefore, as to mitigate the effect ofgovernor dead band on unit response, the given reference forthe ith provider to follow should be greater or equal to its deadband limit, described as follows

PDi ≤ pfi ×1PC6 (11)

where PDi denotes the dead band limit for the ith provider.

VOLUME 7, 2019 17483


FIGURE 3. Basic activation rule determination architecture based on proposed hybrid intelligent approach.

III. PROPOSED HYBRID INTELLIGENT ALGORITHMIt is obvious that the activation rule performance depends onlyupon the present decisions, not on the sequence of decisionevents that preceded it. This indicates Markov property issatisfied for the activation rule optimization problem. In otherwords, the activation rule problem can be successfully trans-formed into a dynamicMDP that can be solved usingMORL.It has been mathematically proved in [33] that RL pursuesto maximize the discounted reward from the long-term per-spective, that is optimization of the overall AGC performancegiven the reward is related to the performance. Therefore, thatimplies the converged Q function for all providers specifiestheir abilities to follow the AGC dispatch signal. However,considering all the providers and all objectives, the Q func-tions can be formulated as a high-dimension problem space,whose objectives cannot be minimized using conventionaloptimization method due to its non-differential attributes.Therefore in this research, SPPSO is applied to efficientlyobtain the optimality of high-dimension Q-function spaces.Meanwhile, the real-time requirement of the activation ruleproblem can also be satisfied due to its very fast executionspeed. The proposed hybrid approach based on MORL andSPPSO not only can provide the AGC controller with capa-bility of self-learning by experiencing the consequences ofactions, but also would be flexible in accommodating vari-ous control scenarios and control objectives for solving thischallenging activation rule optimization problem.

In every AGC cycle, the activation rule optimizationsection observes the current system state, updates theQ-functions, then seek to find the optimal policy, that is com-bination of participating factors, using SPPSO. Therefore,the design of optimal activation rule involves system state andaction space discretization, the definitions of reward function,the design of objection function and SPPSO exploitation asto fully explore the benefits of proposed hybrid algorithm.The proposed overall control structure based on MORL andSPPSO is presented in Fig.3.

A. SYSTEM STATE AND ACTION SPACE DISCRETIZATIONFrom the analysis in [40], it is clear that frequency restorationprocess decisively depend on the imbalance between supplyand demand. Then, the state space S of MORL algorithm isdescribed in this paper by a set of ranges to which the imbal-ance belongs. The power imbalance 1PIm can be real-timeestimated as [40],

1PIm = −(1/Rsys + D

)1f (12)

where Rsys is the equivalent droop characteristics, D is theequivalent damping coefficient, and1f is the frequency bias.

In our previous work [31], participating factors wereadopted as the action variable. However, they are discretizedvalues rather than ranges since specific signal is required foreach provider to follow. Obviously, the participation factorcan make AGC activation simple, but further optimization ofthe activation rule becomes difficult. In other words, the par-ticipating factors can only be chosen from a preset discretizedaction space, limiting the possibility of achieving furtheroptimized performance. Therefore, in this paper, the controladjustment per unit for each SCR provider, defined as theratio of regulating power each AGC decision cycle to itsacceptable maximum power, is adopted as the action variable.The relationship between participating factor and the controladjustment per unit is described as

dPi = (pfi ×1PC6)/1Pmax i (13)

where dPi is the control adjustment per unit for provider i,and 1Pmaxi is its maximum regulation power. Because it ispfi×1PC6 that decides the reference signal for each providerto follow, therefore dPi can be defined as a manner of ranges.This indicates the participating factors are more accuratelyexpressed because the participating factors are not confinedto a set of discretized values but a continuous and controllablevariable that can be adaptively optimized.

Thereafter, the state and action space should be reason-ably quantized since the degree of state and action space

17484 VOLUME 7, 2019


discretization can have a decisive influence upon the opti-mization performance of the AGC signal allocation process.By means of space discretization, the state-action space isclassified into a set of finite regions, termed as state-actionpairs. A small number of state-action pairs would degradethe AGC performance because of the given low-resolutionsolution. Nevertheless, the real-time requirement with thedecision cycle of 4 seconds may not be satisfied when thereexists a too large number of state-action pairs. Therefore,discretization of the system state and action space shouldbe appropriately determined. The specific state and actionspaces for the activation rule optimization are detailed in thecase studies in Section V.

B. REWARD FUNCTIONThe choice of an reward function for each objective is essen-tial for the implementation of the MORL algorithm to theactivation rule optimization, so that the Q-functions can beiteratively converged using (1). In [19], a piecewise relaxedreward function was proposed to avoid the over-compliantproblem in existing AGC strategies. In [21], multi-criteriapiecewise reward function under CPS was presented to effec-tively enhance the overall AGC control performance. How-ever, those previous defined reward functions in [19]–[21]cannot directly implemented in activation rule since theyinvolve different units and use fixed participating factorsin each control authority for simplification, which cannotreflect the practical operation scenarios. Therefore, in orderto apply MORL in activation rule, the reward function shouldbe defined in a decentralized manner, i.e. defined on the basisof the regulating performance per each provider, rather thana centralized manner in [19]–[21], as follows,

Robj1−i = −Ci (pfi ×1PC6)/(pfi ×1PC6) (14)

Robj2−i = −∫ Ti

0t × |pfi ×1PC6 − PGi| dt/(pfi ×1PC6)

(15)

Robj3−i = −Ei (pfi ×1PC6)/(pfi ×1PC6) (16)

where Robj1−i, Robj2−i and Robj3−i are the immediate rewardfunctions of the ith provider for each objective. The rewardfunctions in (14)-(16) are normalized by pfi × 1PC6 sothat the rewards can be measured per MW and thus theaccumulated performance per WM for each provider can belearned accordingly. Smaller value of Robj1−i, Robj2−i andRobj3−i corresponds to high regulating cost, worse ITAE, or alot of emissions per MW for the ith provider, larger onesimply better regulating performance per MW.

The immediate rewards of ith provider are calculatedwhenever the provider has just completed its reference com-mand, and so its Q-functions can be iteratively converged.The Q-functions for all providers can be obtained in a similarmanner. Obviously, theQ-functions for each provider accountfor its long-term regulating performance per MW.

At each AGC decision cycle, the optimal AGC con-troller generates an overall command 1PC∑. Subsequently,

the participation factors are optimized in the activation rulemodule as to achieve the minimization of the cost, regulatingerror and emissions, based on the objective function, availableproviders and their respective long-term regulating perfor-mance per MW, i.e. Q-functions. The action with larger Qvalue is more likely to be assigned with a larger participationfactor so that the objective can be minimized.

C. OBJECTIVE FUNCTIONComprehensive performance evaluation criteria are wellestablished and presented in section III. As to guaranteethe quality of proposed approach, the participating factorsare optimized to account for cost, response performance,and emission simultaneously, which can be considered as amulti-objective optimization problem. The proposed hybridapproach adopts MORL to provide a customized platformfor interactive self-learning rules as to maximize the long-rundiscounted reward for each objective.

A multi-objective function is developed based on theMORL platform and well-established performance evalua-tion criteria, as described

minπ−

∑M

i=1pfi ×1PC6

×

[τobj1Qobj1−i (s, a)+ τobj2Qobj2−i (s, a)+τobj3Qobj3−i (s, a)

]s.t. 0 ≤ pfi ≤ 1 (1 ≤ i ≤ M)∑M

i=1pfi = 1

PminGi ≤ PGi + pfi ×1PC6 ≤ P

maxGi

PDm ≤ pfi ×1PC6 (17)

where Qobj1−i, Qobj2−i and Qobj3−i are the Q functions ofthe ith provider for objective Obj1, Obj2 and Obj3, respec-tively, τobj1, τobj2 and τobj3 are the corresponding coordinationfactors, which can be used as a trick to express preferenceinformation [34]. The Q functions can be obtained usingQ learning rules specified in (1) according to the givenrewards (14)-(16).Apparently, the relative importance between objectives

depends on the system states. When the frequency bias andarea control error (ACE) suffer serious deviations, the tran-sient response performance should be given priority to pursueby the AGC controller. In this paper, the criterion for thedeviations is determined by whether the ACE is boundedwithin the Balancing Authority ACE Limit (BAAL) [41] ornot. The BAAL has been released recently to replace theCPS2 in North America to limit the balancing authority’sunscheduled power flow, defined as follows

BAAL = −10Bi × (3ε1)2/1f (18)

where Bi is the frequency bias setting for a balancing author-ity, and ε1 is a targeted frequency bound for the power system.On the other hand, due to the concerns with respect to globalwarming and air pollution, there has been a reduction ofnational emissions levels assigned by policy makers [28].

VOLUME 7, 2019 17485


Therefore, system operator’s primary objective is tominimizethe system emissions when it is still beyond the assigned lev-els. Otherwise, the system operator’s cardinal rule is rightlyto minimize the regulating cost for each regulating authority.With the aid of those preference information, the prefer-ence condition can be determined in the form of qualitativelanguage, as presented in Fig.4. Therefore, the coordinationfactors can be determined using the qualitative languageapplying analytic hierarchy process (AHP) approach in [36].

FIGURE 4. Flow chart for the determination of preference condition.

TABLE 1. Evaluation of the relative importance between two objectives.

The AHP method divides the relative importance for pairsof objectives into 6 grades, and each grade is assigned anappropriate value, as described in Table 1, in which the cijis the relative importance of objective i to objective j. Fromthe relative importance matrix, the coordination factor ofobjective i (i ∈ 1, 2, 3) can be determined as follows,

τobji =∑3

j=1,j 6=icij/

∑3

i=1

∑3

j=1,j 6=icij (19)

D. SPPSO ALGORITHMParticle swarm optimization is a heuristic and populationbased stochastic search method that aims to mimic thebehavior of flocks of birds or fishes, and has proved to bean efficiency and robust optimization algorithm. However,PSO requires large number of particles and thus involves sig-nificant amount of computation time [42]. This will undoubt-edly cause a serious problem for real-time optimizationsystems such as activation rule optimization problem, whichcall for fast convergence. Therefore SPPSO is applied hereinfor objective function optimization as to further avoid burdenon time and satisfy the real-time requirement.

The SPPSO is a classical PSO algorithm but with a mini-mized population. In classical PSO, each particle consistingof velocity and position represents a potential solution to theproblem. The particles changes the position within the search

space via keeping track of previous best position and thecorresponding fitness. Given the previous best position withthe smallest objective function of the ith particle pbi , the bestposition of all the particles in the swarm pbg, then the velocityand position of the particles are updated as the followingequations,

υi (k + 1) = ψ · υi (k)+ a1 · κ1(pbi − xi (k)

)+ a2 · κ2

(pbg − xi (k)

)(20)

xi (k + 1) = xi (k)+ υi (k + 1) (21)

where xi(k) and υi(k) are the current position and velocityof the ith particle at instant k , a1 and a2 are two accelera-tion constants standing for the particle’s desirability to movetoward pbi and p

bg, ψ is the inertia weight which takes control

of exploration and exploitation of the search space, κ1 and κ2are two random numbers within [0, 1].

SPPSO takes advantage of the concept of regeneration tovest the particles the ability of keep performing the searchwith a small population. After every NR iterations, all theparticles except for the best ones are regenerated with randomvelocities and positions. The choice of the value of NR playsa key role in realizing an efficient SPPSO algorithm since alow value of NR will drive the particles to move erratically,and a higher value will delay the search process [43]. Theregeneration process benefits the particles in avoiding thedilemma of local minima and achieving faster convergencespeed compared to conventional PSO.

E. PROCEDURES OF THE HYBRID ALGORITHMThe proposed hybrid algorithm devotes to optimize the par-ticipating factors with respect to objective function (17)on the established activation rule determination architectureusing SPPSO. The main procedures of the developed hybridintelligent algorithm are illustrated as follows,Step 1) Initialize the Q-functions, reward, learning parame-

ters, PSO optimization parameters and AGC deci-sion cycle [19]–[21]. Set the initial system state,preference condition and establish the optimal AGCcontroller [23].

Step 2) At each time interval, divide all the AGC providersinto two groups based on the criterion (9), the firstgroup consist of the units in which (9) is satisfiedand their participating factors are set to 0. The sec-ond group includes the other providers which havejust completed their previous regulating command.

Step 3) For each provider in second group, doI) Evaluate its transient response performance,

regulating cost, and emission.II) Obtain an immediate reward from (14)-(16).III) Compute the system state using (12).IV) Update the Q function according to (1).

Step 4) Calculate the BAAL value (18), determine the pref-erence condition applying AHP approach, and com-pute the coordination factors using (19).

17486 VOLUME 7, 2019


Step 5) Set the iteration Counter NL =0 and NR.Step 6) Randomize a population of particles in the search

space.Step 7) WHILE sufficient good fitness or maximum num-

ber of iterations has not been reached, doa) For each particle, evaluate the particle’s objective

function and desired fitness function.b) Compare every fitness evaluation with its previ-

ous best position pbi . Set pbi equal to the current

location if the current value is better than thatof pbi .

c) Compare the updated pbi value with the previouspbg value in the swarm. Update pbg value and itsparameters if any of pbg position are better thanthat of pbg.

d) Calculate the new positions and velocities of eachparticles according to (20) and (21).

e) Examine whether decision variables in all thenewborn particles are within their valid bound-aries. If not, takes the value of its correspondingboundary.

f) Incremental the iteration counter NL = NL + 1.g) Regenerate a new population of particles with

random positions and velocities whenNL is divis-ible by NR.

Step 8) END WHILEStep 9) According to the optimized participating factors,

allocate the optimal reference command to eachunit.

IV. CASE STUDIESA. MODEL DESCRIPTIONThe performance of the proposed hybrid approach for acti-vation rule optimization has been comprehensively evalu-ated through practical CSG power system simulator, whichwas previously developed and validated by Guangdong (GD)power dispatch center. The topology of CSG network canbe found in [44]. The CSG system contains four provin-cial control areas, GD, Guangxi (GX), Yunnan (YN) andGuizhou (GZ), interconnected by five parallel HVDC-HVACtransmission systems. The HVDC systems are established asa first-order constant-power model. In the established CSGmodel, there has 35 generators in GD area, 16 in GX, 19 inYN, 13 in GZ. The generator model for fossil-fuel fired (ther-mal), LNG, hydro generators [45] andwind turbine (WT) [46]are included and presented in Fig.5. Each generator outputdepends on the governor and the set-point of regulating com-mand from the AGC controller according to their respectiveoptimized participating factors. The typical parameters ofAGC units in GD area are partly shown in Table 2.

In addition, the frequency bias coefficient Bi are set to−255 in GD, −35 in GX, −37.5 in YN and −40 in GZ.The equivalent inertia constantH , equivalent damping coeffi-cientD, and tie-line synchronizing coefficient T are tabulatedin Table 3. Other relevant system data and parameters aredetailed in [29].

FIGURE 5. Transfer function block diagram for thermal unit, LNG, hydroand WT.

TABLE 2. Model parameters for AGC units in GD power grid.

TABLE 3. CSG power system parameters.

In CSG simulator, the AGC controller adopts NARI’simproved PI control system [47], which has been imple-mented in the EMS system of CSG dispatch center. Thecontrol system classifies the ACE into four operating areas,including the dead band, normal, near-emergency, and emer-gency area respectively. In each operating area, a well-tunedproportional and integral power component as well as aCPS based power component comprises the total regulatingpower1PC6 . The CPS parameters are set as the same in [19].

B. LEARNING SCENARIO DESCRIPTIONThe proposed hybrid algorithm for activation rule shouldbe scheduled to experience a series of off-line trial-and-error procedures termed as pre-learning process before itsonsite operation [8]. As a preconditioning technique, the pre-learning technique involves numerous exploration iterationsin system state space to consummate the Q functions,which describe the long-term operating performance for eachprovider. GD is selected as the research area to elaboratethe pre-learning process. Herein, the system state variableby means of (12), that is the estimated power imbalance,is discretized to the following 12 range: (−∞,−500], (−500,−300], (−300, −200], (−200, −100], (−100, −50], (−50,0], (0, 50], (50, 100], (100, 200], (200, 300], (300, 500],(500, +∞], accounting for HVDC-HVAC fault, generatortripping, network events and load disturbances in real system,respectively. However, in this paper, the power imbalance

VOLUME 7, 2019 17487


triggered by all the faults are transformed into load distur-bances with the same amplitude for simplification. Similarly,the action space [−1, 1] for each provider is divided evenlyinto 20 intervals in each system state. As a result, the totalnumber of state-action pairs of a certain AGC unit for eachobjective is 240.

The pre-learning process starts with a power imbalancesampling, followed by determining the load disturbanceaccounting for the imbalance, and applying the disturbanceto the GD power model. The load disturbance leads to anunacceptable ACE which needs be taken care by the AGCcontroller. Hence, at each AGC decision cycle, the rewardscan be computed and the Q functions can be updated accord-ingly as presented in section IV-E. Given the power imbalanceequals −629.51MW, Fig.6, in which one point indicates oneQ function update, illustrates the detailed participating fre-quency for all AGC providers in GD area. It can be seen thateach Q function updates itself approximately 6.77 steps in theentire AGC process, which takes 51.92 seconds to run. Thesimulations are carried out with an Intel(R) Core i7-3612QM2.1-GHz CPU and 16.00 GB of RAM.

FIGURE 6. The participating frequency for all AGC providers in GD.

FIGURE 7. Q-function convergence of G7/G19/G28 for objective Obj2.

The power imbalance sampling should be repeated in thepre-learning process until the all the concerning Q func-tions have been converged. The termination criterion for theQ-function convergence in the pre-learning process can bedetermined using the matrix 2-norms of Q-function differ-ences ‖qk+1(s, a) − Qk (s, a)‖2 ≤ ϑ where ϑ is a givensmall precision factor [19]. Fig.7 demonstrates the 2-normsQ-function convergence of AGC provider G7, G19 andG28 for objective Obj2 during the pre-learning process. Fol-lowing the initialization rules in [33], all the Q-functionsare initialized as zero matrix and the learning rate α is setto 0.1 with an decrease rate of 0.00001. In addition, ϑ is

adopted as 0.3 and the governor dead band for all providersis set to 0.5 MW. As seen from Fig.7, all the three esti-mated Q-functions tend to be stable after going throughapproximately 10000 Q-learning steps, implying the optimalstrategy has been learned. At this time, the Q functions ofother providers for each objective have also been converged.The average computation time over 10 trials for pre-learningprocess was 21.05 hours with a minimal of 20.64 hours anda maximum of 21.39 hours. Furthermore, all AGC providersare required to be online in every imbalance sampling. Oth-erwise, the pre-learning process will take more time sincethe regulating performances for the providers have not beenlearned in the offline state.

All parameters, such as Q-function matrix, should bestored after the completion of the pre-learning process, andthen the proposed hybrid algorithm for activation rule opti-mization can be put into normal operation thereafter. Fur-thermore, the MORL will continue to make steady onlineoptimization and so the behaviors of the proposed hybridalgorithm can be improved by interaction with the real powersystem.

TABLE 4. Sensitivity analysis on discrete number of state-action space.

C. PARAMETER SENSITIVITY ANALYSISThe performance of the hybrid algorithm mostly dependson several parameters, such as discrete numbers of state-action space, number of particles in SPPSO swarm, and NR.A series of sensitivity analysis regarding those parameterswas conducted as to fully explore the potential of the pro-posed algorithm. In Table 4, the regulating performance andthe mean execution time of the proposed algorithm over20 trials are presented for a varying discrete number of state-action space. The Q functions have been learned as specifiedin section V-B. In addition, the values of parameters used inthis simulation are set as follows: τobj1 = τobj2 = τobj3 = 1/3,a1 = a2 = 2, ψ = 0.7, NR = 6. The disturbance adoptsstep load disturbance with the amplitude of 100MW andthe number of particles in SPPSO is set to 4. The results

17488 VOLUME 7, 2019


from Table 4 demonstrate that small number of state-actionpair could lead to worse AGC performance because of thelow-resolution solution that it could offer only. Furthermore,the performance of the hybrid algorithm increases no morebut leads to an expected increase in execution time of thealgorithm, when the number of state and action space arelarger than 12×20.

TABLE 5. Sensitivity analysis on number of particles and number ofiterations before regeneration in SPPSO.

Table 5 lists the regulating performance of the hybridalgorithm and its computation time over 20 trials for a vary-ing number of particles and NR. The number of state andaction space adopts 12×20, and other parameters remainunchanged. It can be seen that the regulating performance ofhybrid algorithm will not increase if the number of particlesis larger than 4 and NR > 6. Meanwhile, the mean execu-tion time of the algorithm will increase 30% approximately.Therefore, taking the real-time requirement for activation ruleoptimization and the regulating performance variation trendsinto account, the number of particles and NR in SPPSO adopt4 and 6 respectively. In summary, this group of well-tunedparameters are employed in the following experiments.

D. STUDY ON COORDINATION FACTORSThe regulating performance of the hybrid algorithm canbe appropriately adjusted using coordination factors in (17)since they reflect the preference of the system operators.As to validate the effects of different coordination factors onalgorithm performance, simulations with different preferenceconditions under a typical step load disturbance with theamplitude of 400MW were carried out. Here, the output ofWT is stochastically determined according to the annual windprofile applying output curve [48], and its tuning range inAGC cycle is limited to [−20%, 5%] to its initial outputpower. It is assumed that the output of WT remains constantin AGC regulation process. The results are comprehensivelycompared with practical used pro-rata method and merit-order method presented in [2].

Fig.8 shows the plot of the frequency response of the entireparticipating units under four preference conditions. Thedetailed regulating cost and emission are tabulated in Table 6.

FIGURE 8. The frequency response under different preference conditions.

TABLE 6. Simulation results under different preference conditions.

Merit-order I, II, III indicates the list of secondary reserveproviders are sorted based on energy prices, ramp rate capa-bilities and emissions, respectively. The results from theexperiment imply that the proposed hybrid algorithm worksbetter than classical pro-rata method not only on the dynamictransient response performance but also on regulating costand emissions under all four preference conditions. Thisis because the regulating objectives are further optimizedaccording to the given generator information in Q table.In addition, merit order I exhibits worst transient response butleads to an minimal regulating cost among all the tested acti-vation rule determination approaches. This can be explainedby the coal-fired thermal units, which have lower energy costbut worse ramping abilities. Furthermore, it can be readilyobserved from Fig.8 that merit-order II perform better in tran-sient response than the proposedmethod under condition C,Dbut worse than that under condition A, B. This is owing tothe truth that there exists more participating units with highramp rate in the proposed algorithm under conditionA, B thanmerit-order II in each AGC decision cycle.Meanwhile, merit-order III exhibits a slightly worse regulating performanceon response than merit-order II due to their sorting list indifference. Furthermore, it can be concluded that each merit-order list has relatively worse performances on two controlobjectives because of only one objective that it concerns.

On the other hand, the proposed algorithm performs bestin transient response under preference condition D than otherthree preference conditions because more LNG units arerequired to participate in the AGC cycle for emission mit-igation. In addition, the proposed algorithm has exhibitedminimal regulating cost in preference condition B since morecoal-fired units with less regulating cost are given prior-ity in the AGC tuning process. Furthermore, the preference

VOLUME 7, 2019 17489


condition A possesses minimal emissions compared to otherconditions due to the provided higher weight in objectivefunction (17) on emissions. Obviously, the proposed methodunder each condition usually provides a compromise solutionin which two objectives are relatively optimal according tothe preset preference information. Consequently, the dynamicperformance of AGC could be online regulated through coor-dination factors adjustment in AGC decision cycle.

TABLE 7. AGC performance metrics of GD power grid over 24-hours.

TABLE 8. AGC performance metrics of GX power grid over 24-hours.

TABLE 9. AGC performance metrics of GZ power grid over 24-hours.

E. STATISTICAL EXPERIMENTS ON CSG SYSTEMIn addition, the long-term performance of activation ruleshould be completely evaluated based on statistical resultsof comparative experiments, in which CSG simulators havebeen implemented with the preset disturbance scenarios over24-hours [21]. The adaptability and dynamic optimizationof the proposed approach can be examined and analyzedunder the representative stochastic load disturbance [49].The performance of the proposed hybrid method has beenbenchmarked and compared with real-world pro-rata andmerit-order method. The obtained statistics with assess-ment period of 10-min for GD, GX, GZ, and YN powergrid on various AGC performance indices are presented inTables 7-10, where |1f | and |ACE| are the averages of abso-lute values of frequency deviation and ACE over the entire

TABLE 10. AGC performance metrics of YN power grid over 24-hours.

simulation period, BAAL is the daily compliance percentage.It is clear from the statistical experiments that the proposedhybrid approaches are more adaptable to the changing scenar-ios, and revealing its superior online self-learning capabilitythan the classical pro-rata and merit-order methods. In com-parison with the parallel pro-rata activation rule, the pro-posed hybrid method can take full account of the preferenceinformation and current system states, and thus their regulat-ing performance can be improved accordingly. In addition,the hybrid method under anyone condition generally stayedahead of the merit-order method in the regulating perfor-mance on two optimized objectives, which is consistent withthe conclusions in section V-D. Furthermore, as supportedfrom Table 7-10, the regulating performance of the proposedmethod on each objective can be adjusted adaptively via pref-erence condition selection, demonstrating the high flexibilityof the proposed method.

As presented in [21], communication and computa-tion infrastructures with high design flexibility are thecore technologies required in future smart dispatch cen-ter. An intelligent automatic generation control frameworkwith grid-to-grid coordination has been proposed in [21].However, the activation rule for secondary control reservehas not been considered, which may render the advancedcontrol framework unable to be implemented in real powergrids because of the changing scenarios that the grid faced.Therefore, the hybrid optimal activation rule method usingMORL and SPPSO has been proposed herein to achievea performance-adaptive control framework. As outlinedin Fig.3, the framework consists of six subparts, includingpreference state determination, system state estimation, per-formance evaluation, reward calculation, Q function update,SPPSO algorithm. It can be concluded that only the per-formance evaluation subpart may need extra infrastructureinvestment since the other five subparts can be completelyfulfilled using real-time dataset and software equipmentalready existed in EMS. As described in section IV-E, the per-formance evaluation subpart is designed to calculate theenergy cost, ITAE, and emission objectives for the providersin second group. However, the computation of energy costhas already been existed in most current dispatch centersincluding CSG. In addition, the necessary data used tocalculate ITAE (15) have also been accessible. Similarly,the discharged emission of each provider can be evaluatedusing the existed database and communication infrastructure.

17490 VOLUME 7, 2019


Therefore, no extra control infrastructure investment but thedevelopment cost is required to implement the hybrid acti-vation rule approach, making the proposed method morepractical in real power grids.

V. CONCLUSIONSIn this paper, a novel hybrid intelligent algorithm combiningMORL and SPPSO is developed and successfully appliedfor dynamic optimization of activation rule, which is crit-ical to AGC performance enhancement. A novel objectivefunction based on coordination factor accounting for reg-ulating cost, transient performance and emission is con-structed to obtain optimal AGC regulating performance. Thecoordination factors are designed according to the prefer-ence information applying analytic hierarchy process. Theeffectiveness of the proposed hybrid intelligent algorithm hasbeen comprehensively validated via comparative tests withseveral well-established benchmarks using practical CSGdataset. The proposed approach provides a customized plat-form of activation rule problem with high flexibility becauseof the self-learning attributes that MORL algorithm couldoffer. In addition, it has been demonstrated that the real-time requirement has also been satisfied using the proposedapproach.

REFERENCES[1] N. Jaleeli, L. S. VanSlyck, D. N. Ewart, L. H. Fink, and A. G. Hoffmann,

‘‘Understanding automatic generation control,’’ IEEE Trans. Power Syst.,vol. 7, no. 3, pp. 1106–1122, Aug. 1992.

[2] I. Avramiotis-Falireas, P. Zolotarev, A. Ahmadi-Khatir, and M. Zima,‘‘Analysis and comparison of secondary frequency control reserve activa-tion rules: Pro-rata vs. merit order,’’ presented at the Power Syst. Comput.Conf. (PSCC), Wroclaw, Poland, Aug. 2014, pp. 1–7.

[3] C. E. Fosha and O. I. Elgerd, ‘‘The megawatt-frequency control problem:A new approach via optimal control theory,’’ IEEE Trans. Power App. Syst.,vol. PAS-89, no. 4, pp. 563–577, Apr. 1970.

[4] Y. Hain, R. Kulessky, and G. Nudelman, ‘‘Identification-based power unitmodel for load-frequency control purposes,’’ IEEE Trans. Power Syst.,vol. 15, no. 4, pp. 1313–1321, Nov. 2000.

[5] D. Rerkpreedapong, A. Hasanovic, and A. Feliachi, ‘‘Robust load fre-quency control using genetic algorithms and linear matrix inequalities,’’IEEE Trans. Power Syst., vol. 18, no. 2, pp. 855–861, May 2003.

[6] H. Bevrani and T. Hiyama, ‘‘On load–frequency regulation with timedelays: Design and real-time implementation,’’ IEEE Trans. Energy Con-vers., vol. 24, no. 1, pp. 292–300, Mar. 2009.

[7] H. Bevrani, Y. Mitani, and K. Tsuji, ‘‘Robust decentralised load-frequencycontrol using an iterative linear matrix inequalities algorithm,’’ IEE Proc.-Gener. Transmiss. Distrib., vol. 151, no. 3, pp. 347–354, May 2004.

[8] S. Saxena and Y. V. Hote, ‘‘Load frequency control in power systems viainterval mode control scheme and mode-order reduction,’’ IEEE Trans.Power Syst., vol. 28, no. 3, pp. 2749–2757, Aug. 2013.

[9] T.-C. Yang, H. Cimen, and Q. M. Zhu, ‘‘Decentralised load-frequencycontroller design based on structured singular values,’’ IEE Proc.-Gener.,Transmiss. Distrib., vol. 145, no. 1, pp. 7–14, Jan. 1998.

[10] B. Tyagi and S. C. Srivastava, ‘‘A decentralized automatic generationcontrol scheme for competitive electricity markets,’’ IEEE Trans. PowerSyst., vol. 21, no. 1, pp. 312–320, Feb. 2006.

[11] M. H. Variani and K. Tomsovic, ‘‘Distributed automatic generation controlusing flatness-based approach for high penetration of wind generation,’’IEEE Trans. Power Syst., vol. 28, no. 3, pp. 3002–3009, Aug. 2013.

[12] F. Beaufays, Y. Abdel-Magid, and B. Widrow, ‘‘Application of neuralnetworks to load-frequency control in power systems,’’ Neural Netw.,vol. 7, no. 1, pp. 183–194, 1994.

[13] L. D. Douglas, T. A. Green, and R. A. Kramer, ‘‘New approaches to theAGC non-conforming load problem,’’ IEEE Trans. Power Syst., vol. 9,no. 2, pp. 619–628, May 1994.

[14] A. E. Gegov and P. M. Frank, ‘‘Decomposition of multivariable systemsfor distributed fuzzy control,’’ Fuzzy Sets Syst., vol. 73, no. 3, pp. 329–340,Aug. 1995.

[15] J. Talaq and F. Al-Basri, ‘‘Adaptive fuzzy gain scheduling for load fre-quency control,’’ IEEE Trans. Power Syst., vol. 14, no. 1, pp. 145–150,Feb. 1999.

[16] Y. L. Abdel-Magid andM.M.Dawoud, ‘‘Optimal AGC tuningwith geneticalgorithms,’’Elect. Power Syst. Res., vol. 38, no. 3, pp. 231–238, Sep. 1996.

[17] S. P. Ghoshal, ‘‘Application of GA/GA-SA based fuzzy automatic genera-tion control of a multi-area thermal generating system,’’ Electr. Power Syst.Res., vol. 70, no. 2, pp. 115–127, Jul. 2004.

[18] T. P. I. Ahamed, P. S. N. Rao, and P. S. Sastry, ‘‘A reinforcement learn-ing approach to automatic generation control,’’ Electr. Power Syst. Res.,vol. 63, no. 1, pp. 9–26, Aug. 2002.

[19] T. Yu, B. Zhou, K. W. Chan, L. Chen, and B. Yang, ‘‘Stochastic optimalrelaxed automatic generation control in non-Markov environment basedon multi-step Q(λ) learning,’’ IEEE Trans. Power Syst., vol. 26, no. 3,pp. 1272–1282, Aug. 2011.

[20] T. Yu, B. Zhou, K. W. Chan, Y. Yuan, B. Yang, and Q. H. Wu, ‘‘R(λ) imi-tation learning for automatic generation control of interconnected powergrids,’’ Automatica, vol. 48, no. 9, pp. 2130–2136, Sep. 2012.

[21] T. Yu, H. Z. Wang, B. Zhou, K. W. Chan, and J. Tang, ‘‘Multi-agentcorrelated equilibrium Q(λ) learning for coordinated smart generationcontrol of interconnected power grids,’’ IEEE Trans. Power Syst., vol. 30,no. 4, pp. 1669–1679, Jul. 2015.

[22] I. Nasiruddin, P. Kumar, and D. P. Kothari, ‘‘Recent philosophies ofautomatic generation control strategies in power systems,’’ IEEE Trans.Power Syst., vol. 20, no. 1, pp. 346–357, Feb. 2005.

[23] Y. Xichang and Z. Quanren, ‘‘Practical implementation of theSCADA+AGC/ED system of the Hunan power pool in the centralChina power network,’’ IEEE Trans. Energy Convers., vol. 9, no. 2,pp. 250–255, Jun. 1994.

[24] G. A. Angelidis and A. D. Papalexopoulos, ‘‘On the operation and pricingof real-time competitive electricity markets,’’ in Proc. Power Eng. Soc.Winter Meeting, vol. 1, Jan. 2002, pp. 420–427.

[25] H. Singh and A. Papalexopoulos, ‘‘Competitive procurement of ancillaryservices by an independent system operator,’’ IEEE Trans. Power Syst.,vol. 14, no. 2, pp. 498–504, May 1999.

[26] (Feb. 22, 2012). Pay for Performance Regulation, Draft Final ProposalAddendum, CAISO. [Online]. Available: http://www.caiso.com/Documents/Addendum-DraftFinalProposal-Pay_PerformanceRegulation.pdf

[27] A. D. Papalexopoulos and P. E. Andrianesis, ‘‘Performance-based pricingof frequency regulation in electricity markets,’’ IEEE Trans. Power Syst.,vol. 29, no. 1, pp. 441–449, Jan. 2014.

[28] E. Denny and M. O’Malley, ‘‘Wind generation, power system opera-tion, and emissions reduction,’’ IEEE Trans. Power Syst., vol. 29, no. 1,pp. 341–347, Feb. 2006.

[29] The Operation Mode of China Southern Power Grid, (in Chinese), ChinaSouthern Power Grid Co. Ltd., Beijing, China, 2013.

[30] M. A. Abido and J. M. Bakhashwain, ‘‘Optimal VAR dispatch using amultiobjective evolutionary algorithm,’’ Int. J. Elect. Power Energy Syst.,vol. 27, no. 1, pp. 13–20, Jan. 2005.

[31] T. Yu, Y. M. Wang, W. J. Ye, B. Zhou, and K. W. Chan, ‘‘Stochasticoptimal generation command dispatch based on improved hierarchicalreinforcement learning approach,’’ IET Gener., Transmiss. Distrib., vol. 5,no. 8, pp. 789–797, Aug. 2011.

[32] P. Stone and M. Veloso, ‘‘Multiagent systems: A survey from a machinelearning perspective,’’Autom. Robots, vol. 8, no. 3, pp. 345–383, Jun. 2000.

[33] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.Cambridge, MA, USA: MIT Press, 1998.

[34] C. Liu, X. Xu, and D. Hu, ‘‘Multiobjective reinforcement learning: A com-prehensive overview,’’ IEEE Trans. Syst. Man, Cybern, Syst., vol. 45, no. 3,pp. 385–398, Mar. 2015.

[35] D. C. K. Ngai and N. H. C. Yung, ‘‘A multiple-goal reinforcement learningmethod for complex vehicle overtaking maneuvers,’’ IEEE Trans. Intell.Trans. Syst., vol. 12, no. 2, pp. 509–522, Jun. 2011.

[36] Y. Zhao, Q. W. Chen, and W. Hu, ‘‘Multi-objective reinforcement learningalgorithm for MOSDMP in unknown environment,’’ in Proc. 8th WorldCongr. Int. Control Autom., Jul. 2010, pp. 3190–3194.

[37] (2001). The Promotion of Electricity Produced From Renewable EnergySources in the Internal Electricity Market, Directive 2001/77/EC ofthe European Parliament and of the Council. [Online]. Available:http://www.europa.eu.int/scadplus/Ieg/en/lvb/l27035.htm

VOLUME 7, 2019 17491


[38] M. Benidris and J. Mitra, ‘‘Reliability and sensitivity analysis of compositepower systems under emission constraints,’’ IEEE Trans. Power Syst.,vol. 29, no. 1, pp. 404–412, Jan. 2014.

[39] I. Egido, F. Fernandez-Bernal, L. Rouco, E. Porras, and A. Saiz-Chicharro,‘‘Modeling of thermal generating units for automatic generation controlpurposes,’’ IEEE Trans. Control Syst. Technol., vol. 12, no. 1, pp. 205–210,Jan. 2004.

[40] H. Bevrani, Robust Power System Frequency Control. New York,NY, USA: Springer, 2010.

[41] NERC, Real Power Balancing Control Performance, BAL-001-1.Accessed: Feb. 2010. [Online]. Available: http://www.nerc.com/docs/standards/sar/Project_2010-14-1_BAL-001-1_Standard_Clean_20120604_final_rev1.pdf

[42] T. K. Das, G. K. Venayagamoorthy, and U. O. Aliyu, ‘‘Bio-inspiredalgorithms for the design of multiple optimal power system stabilizers:SPPSO and BFA,’’ IEEE Trans. Ind. Appl., vol. 44, no. 5, pp. 1445–1457,Sep. 2008.

[43] J. Zhang, J. Wang, and C. Yue, ‘‘Small population-based particle swarmoptimization for short-term hydrothermal scheduling,’’ IEEE Trans. PowerSyst., vol. 27, no. 1, pp. 142–152, Feb. 2012.

[44] T. Yu, B. Zhou, K. W. Chan, L. Chen, and E. Lu, ‘‘Stochastic opti-mal CPS relaxed control methodology for interconnected power systemsusing Q-Learning method,’’ J. Energy Eng., vol. 137, no. 3, pp. 116–129,Sep. 2011.

[45] P. Kundur, Power System Stability and Control. Cambridge, MA, USA:McGraw-Hill, 1993.

[46] D.-J. Lee and L. Wang, ‘‘Small-signal stability analysis of an autonomoushybrid renewable energy power generation/energy storage system part I:Time-domain simulations,’’ IEEE Trans. Energy Convers., vol. 23, no. 1,pp. 311–320, Mar. 2008.

[47] G. Zonghe, T. Xianliang, and T. Liqun, ‘‘Hierarchical AGCmode and CPScontrol strategy for interconnected power systems,’’ (in Chinese), Autom.Elect. Power Syst., vol. 28, no. 1, pp. 78–81, Jan. 2004.

[48] Z. Bie, P. Zhang, G. Li, B. Hua, M. Meehan, and X. Wang, ‘‘Reliabilityevaluation of active distribution systems including microgrids,’’ IEEETrans. Power Syst., vol. 27, no. 4, pp. 2342–2350, Nov. 2012.

[49] Y. Manichaikul, Industrial Electric Load Modeling. Cambridge,MA, USA: MIT Press, 1978.

HUAIZHI WANG (M’16) received the B.Eng. and M.Eng. degrees in con-trol science and engineering from Shenzhen University, Shenzhen, China,in 2009 and 2012, respectively, and the Ph.D. degree in electrical engineeringfrom the South China University of Technology, Guangzhou, China, in 2015.He was a Research Assistant with the Department of Electrical Engineering,The Hong Kong Polytechnic University, Hong Kong, from 2014 to 2015.He is currently anAssistant Professor with ShenzhenUniversity. His researchinterest includes automatic generation control in cyber physical powersystems.

ZHENXING LEI received the B.Eng. degree in automation science fromZhengzhou University, Zhengzhou, China, in 2016. He is currently pursu-ing the M.Eng. degree with the Department of Mechatronics and ControlEngineering, Shenzhen University, Shenzhen, China. His research interestincludes automatic generation control in smart grid.

XIAN ZHANG received the B.Sc. degree in electrical engineering fromNorth China Electric Power University, China, in 2009, and theM.Sc. degreein electrical engineering from Tsinghua University, China, in 2012. Sheis currently pursuing the Ph.D. degree with The Hong Kong PolytechnicUniversity. He received The Hong Kong Polytechnic University ResearchStudentship for his Ph.D. study. Her research interests include smart gridand electric vehicle.

JIANCHUN PENG (M’04–SM’17) received the B.S. and M.S. degrees fromChongqing University, Chongqing, China, in 1986 and 1989, respectively,and the Ph.D. degree from Hunan University, Hunan, China, in 1998, all inelectrical engineering. He was a Visiting Professor with Arizona State Uni-versity, Tempe, AZ, USA, from 2002 to 2003, and with Brunel University,London, U.K., in 2006. He is currently a Professor with Shenzhen Universityand the Director of the Department of Control Science and Engineering. Hisinterests include electricity markets and power system optimal operation andcontrol.

HUI JIANG received the B.S. degree from Chongqing University,Chongqing, China, in 1990, and the M.S. and Ph.D. degrees from HunanUniversity, Hunan, China, in 1999 and 2005, respectively, all in electricalengineering. From 2005 to 2006, she was a Visiting Scholar with BrunelUniversity, London, U.K. She is currently a Professor with Shenzhen Uni-versity. Her research interests include power system economics and powersystem planning and operation.

17492 VOLUME 7, 2019