Temperature handler in radios using machine learninguu.diva-portal.org/smash/get/diva2:1395231/FULLTEXT01.pdfIT 19 089 Examensarbete 30 hp December 2019 Temperature handler in radios

IT 19 089

Examensarbete 30 hpDecember 2019

Temperature handler in radios using machine learning

Arnthor Helgi Sverrisson

Institutionen för informationsteknologiDepartment of Information Technology

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Temperature handler in radios using machine learning

Arnthor Helgi Sverrisson

Machine learning is revolutionising the field of automation in various industries. Butthere exist powerful methods and tools in a number of cases that do not include thelearning process like machine learning does. In this thesis, controllers forcompensating for overheating in radio stations are built, evaluated and compared. Thecontrollers are based on two different approaches: the first approach is based onmodel predictive control (MPC), and the second one is based on methods ofreinforcement learning (RL). This report compares those two approaches, andreports qualitative and quantitative differences.

Tryckt av: Reprocentralen ITCIT 19 089Examinator: Mats DanielsÄmnesgranskare: Kristiaan PelckmansHandledare: Lebing Jin

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1 Current implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Expert system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 System identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 State space models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Model predictive control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Objective function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.1 Markov decision process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.2 Q learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.3 Deep Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1 System identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.1 Climate chamber experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.2 State space model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Model predictive control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3.2 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3.3 Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3.4 Reward function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3.5 Training process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.1 Model predictive control results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1.1 Comparison on different control horizons . . . . . . . . . . . . . . . . . . . . . 194.1.2 Comparison on different weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.1.3 MPC controller results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Reinforcement learning results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2.1 Hyperparameter test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2.2 Train on validation data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3 Comparison on MPC and RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

1

5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2

1. Introduction

Ericsson is a provider of Information and Communication Technology (ICT).The company offers services, software and infrastructure in ICT for telecom-munications operators, traditional telecommunications and Internet Protocol(IP) networking equipment, mobile and fixed broadband, operations and busi-ness support services, cable television, IPTV, video systems, and an extensiveservices operation [23]. Their products, for example radios, are deployed allaround the world and therefore must sustain all kinds of conditions. One ofthe challenges for the radios is heat. As an example, the hottest officiallyrecorded day in Phoenix, Arizona, the temperature went up to 50°C [15]. Ontop of that radio products must handle traffic which requires a lot of powerconsumption. The power amplifiers generate additional heat the system needsto sustain. When conditions become this harsh, the radio needs to reduce theoutput power to cool down for hardware protection. The timing of when tostart to reduce output power and by how much is important. If done too early,unnecessary reduction and diminished serviceability might occur. If done toolate operational temperature might continue to rise and reach critical level andthe system shuts down.

1.1 Current implementationRight now the control for the reduction of output power, back off power, inhot conditions is controlled by rule-based control. The idea behind a rule-based control is to try to encode human knowledge into automatic control.The temperature handling function in the radio unit is designed with if/elsestatements with manually defined thresholds. Several thresholds are set andfor the temperature, one triggers a timer, another triggers back off power andyet another threshold is the critical threshold and triggers shutdown of thewhole system. When the timer is started the time integral of temperature iscalculated and the integration of the temperature is not allowed to surpassanother (manually defined) threshold, otherwise back off starts. The quantityof the back off power is decided by a formula and is in proportion to theinternal temperature, that is, the higher the temperature the more back offpower is needed for reducing internal component temperature. These formulasand thresholds are have been manually set and tuned through trial-and error.

The radio contains several temperature sensors located on different placesof the radio. Each sensor measures temperature on different components.

3

These components can sustain at maximum different temperature values (thresh-olds defined by the manufacturer). Therefore the temperature thresholds aredifferent for each component, which makes the job of manually setting themeven more difficult. Since these crucial parameters, thresholds and formulasare manually decided there is room for exploring if a more scientific approachworks better. In this thesis I will propose to solve this problem with modelpredictive control (MPC) [12] and reinforcement learning (RL) [19]. To doso I will also do a system identification on how the internal temperature ofradio reacts to changes in its environment and build a simulator from that. Thesimulator is needed for both RL and MPC.

1.2 Expert systemOne branch of artificial intelligence (AI) is expert systems, which is essen-tially a system in which its intelligence and decision making process is basedon human expert knowledge. Knowledge engineering is the act of encodinghuman expert knowledge into a set of rules a system can follow. Rules containa IF and THEN part. Expert systems whose knowledge is represented in ruleform are called rule-based systems[5]. Expert systems used to be a popularresearch field and was one of the first truly successful forms of AI software,but in recent years the focus of research has moved to a more machine learningapproach and away from expert systems [8]. In a machine learning approachthe system, for example supervised learning, is told what to look for or whatthe solution should be but not how to find the solution. Also in RL the agentis told what is a good or bad action (relatively) but the agent is not told howto solve the problem. The algorithm finds out how. The current temperaturecontroller in the radios is a rule-based system. Since the other methods likeMPC and RL are emerging and proved good in some cases it is interesting tosee if such controllers would outperform the current rule based controller.

1.3 ContributionMachine learning (ML) is a popular research field and a hot topic nowadays,and rightly so. The improvement in recent years, fueled by improvement inprocessing power is impressive. ML algorithms however sometimes require alot of data and computing power so it is worth it to ask if it is really necessaryto use ML everywhere when perhaps there exists a non-ML method that re-quires less computing power but can still solve the task well or even better. Inthis thesis I will explore two methods; Reinforcement learning (ML method)on one hand, and model predictive control (non-ML method) on the other.This thesis will compare them for this particular problem and shed some lighton drawbacks and advantages of both methods.

4

2. Theory

This chapter introduces the theory behind the algorithms used. The main fo-cus is on system identification, model predictive control and reinforcementlearning

2.1 System identificationA dynamic system is an object in which single or multiple inputs or variablesproduce an observable signal, usually referred to as output . The relationshipbetween the inputs and outputs can be described with a mathematical formula.But identifying the system’s behaviour can be tricky in some cases especiallywhen dealing with MIMO systems (multiple input multiple output). In somecases a physical model of the system is simple and/or known but in most casesit is complicated and/or non-linear. Mathematical models can then be gener-ated from statistical data. A dynamic system can be described by for exampledifferential or difference equations, transfer functions, state-space equations,and pole-zero-gain models. The methodology for building a mathematicalmodel of the system is called System identification. [10]

2.1.1 State space modelsA common way of representing a model of a system is state space representa-tion. For a given state vector~x (~x ∈Rn), output vector~y (~y ∈Rq), input vector~u (~u ∈ Rp), A is the state matrix (A ∈ Rn×n), B is the input matrix (B ∈ Rn×p),C is the output matrix (C ∈Rq×n) and D is the feedforward matrix (D ∈Rq×p)and for continuous time-invariant system the state space model representationis as shown in equation 2.1

x′(t) = Ax(t)+Bu(t)y(t) =Cx(t)+Du(t) (2.1)

where~x is state vector. ~x ∈ Rn

y(·) is output vector. y(·) ∈ Rq

u(·) is input vector. u(·) ∈ Rp

A(·) is the state matrix, A(·) ∈ Rn×n

B(·) is the input matrix, B(·) ∈ Rn×p

C(·) is the output matrix, C(·) ∈ Rq×n

5

D(·) is the feedforward matrix, D(·) ∈ Rq×p

When this formula is transformed from continuous time to discrete time theformula becomes

x(k+1) = Ax(k)+Bu(k)y(k) =Cx(k)+Du(k) (2.2)

[24] [18]

2.2 Model predictive controlMPC originated in the late seventies [17] but has improved a lot since then.MPC is not a specific algorithm but rather an umbrella term that encompassesa range of control methods which make use of a model of the process to ob-tain the control signal by minimizing an objective function. What all MPCalgorithms have in common is that the MPC

• uses explicit use of a model to predict the process output• minimizing an objective function• receding strategy, that is, at every time instance the prediction horizon

is moved forward by one step and optimization calculations recalculatedwith a new horizon

But at the same time the MPC algorithms differ among themselves in how thecost function or noise is minimized or the type of model used, to name a fewthings.

Figure 2.1. MPC strategy [4]. The input or control variable is shown as u and theoutput as y. N is the prediction horizon.

The strategy MPC algorithms follow shown in 2.1 can essentially be shownin three steps:

1. The future outputs are predicted for a prediction horizon N are sent usingthe model of the system.

6

2. The future control signals u(t + k|k),k = 1...N is calculated by optimiz-ing the objective function.

3. The first control signal u(t|t) is sent to the process and executed and theprediction horizon is moved to t +1, and the whole process is repeated.

[4]

2.2.1 Objective functionMPC’s objective is to find a solution, that is a control policy, that minimizesan objective function. The objective function therefore represents how gooda control policy is. The higher the value of the objective function, when acontrol policy is used as an input, the ’worse’ the control policy is. In order todesign an MPC controller the objective function needs to be defined with thean appropriate criteria.In control theory it is desirable that a controller is able to optimize severalthings at the same time. So the objective function therefore contains usually 3or 4 factors. As equation 2.3 shows the the objection function J is the sum offour factors

J(zk) = Jy(zk)+ Ju(zk)+ J∆u(zk)+ Jε(zk) (2.3)

The different J functions in equation 2.3 measure different features in the con-troller. Those features are

• Jy is output reference tracking• Ju is manipulated variable tracking• J∆u is manipulated variable move suppression• Jε is constraint violation

The Jy is the factor that measures how closely the outputs follow a referencevalue for the the output. This can be for example if a thermostat is set on 25°C,the Jy becomes higher if the current temperature is further away from the 25°Creference value.

Jy(zk) =ny

∑j=1

p

∑i=1

{wy

i, j

syj[r j(k+ i|k)− y j(k+ i|k)]

}2

(2.4)

Where,• k - Current control interval.• p - Prediction horizon.• ny - Number of plant output variables.• zk - Control parameters selected (quadratic program decision).• y j(k+ i|k) - Predicted value of jth plant output at ith prediction horizon

step, in engineering units.• r j(k+ i|k) - Reference value for jth plant output at ith prediction horizon

step.• sy

j - Scale factor for jth plant output.

7

• wyi, j - Tuning weight for jth plant output at ith prediction horizon step.

The factor Ju is shown in equation 2.5, and measures how well a manipulatedvariable (MV), u follow a reference signal.

Ju(zk) =nu

∑j=1

p−1

∑i=0

{wu

i, j

suj[u j(k+ i|k)−u j,target(k+ i|k)]

}2

(2.5)

Where u represents the MV and wui, j is a tuning weight jth MV at ith prediction

horizon step.In some cases it is not desirable that the controller chooses sharp changes

in the MV. So the J∆u measures the change in MV

J∆u(zk) =nu

∑j=1

p−1

∑i=0

{w∆u

i, j

suj[u j(k+ i|k)−u j(k+ i|k)]

}2

(2.6)

Lastly the Jε is for constraint violations (see equation 2.7)

Jε(zk) = ρεε2k (2.7)

Where• εk - Slack variable at control interval k (dimensionless)• ρε - Constraint violation penalty weight (dimensionless)

[1]

2.3 Reinforcement learningReinforcement learning is a branch of machine learning. The idea is to learnfrom experience through trial and error. The decision maker is put into anenvironment to solve a task. It is then told, through a so called reward function,whether its actions are good or bad. The decision maker creates a memory ofits experiences. [19].

2.3.1 Markov decision processTo formulate this problem mathematically, a mathematical framework calledMarkov decision process (MDP) is used. The decision maker or controller iscalled agent. The agent interacts with the environment. This can be a sim-ulated environment or a real environment. At each time step, t, the agentreceives some representation of the environment’s state, st ∈S and based onthe state st , the agent selects an action at ∈A .

For the action it selects it receives an reward rt+1, which is a numericalmeasurement on how ’good’ or ’bad’ the action was. The environment sendsa new state, st+1, once the action has been executed (see figure 2.2). The agent

8

does not only try to maximize the a immediate reward but also cumulativereward in the long run. The cumulative sum of future rewards called value,Gt , is discounted with a factor γ called discount factor. The value representshow good the action is for future states while the rewards only represents theimmediate effect of the action. This discounted total sum of future rewards isshown in equation 2.8. Usually γ is set as a value in the interval 0 6 γ 6 1. Ifγ = 0, only the immediate reward is looked at when tried to be maximized. Ifγ > 1 the sum in 2.8 becomes infinite, but if 0 6 γ 6 1 the sum equals finite.The closer γ is to 0 the more "myopic" or shortsighted the agent is, but as γ

approaches 1 gives more importance to future rewards when the agent tries tomaximize objective function.

Gt = Rt+1 + γRt+2 + γ2Rt+3 + ...=

∞

∑k=0

γkRt+k+1 (2.8)

Gt =Rt+1+γRt+2+γ2Rt+3+...=Rt+1+γ(Rt+2+γRt+3+...)=Rt+1+γ Gt+1

(2.9)By following a policy π , i.e. a sequence of actions, it is then possible tocalculate the expected ’value’, qπ(s,a), of choosing action a in state s underπ policy. The equation for this value is shown in equation 2.10. The valuefunction is called action-value function or q-value function

qπ(s,a) = Eπ [Gt |St = s,At = a] (2.10)

Eπ [·] denotes the expected value given that the agent follows policy π . Forevery problem there is an optimal policy π∗ that will yield the highest possibleq-value q∗(s,a). Equation 2.9 shows that if the value of next state, St+1, isknown then the value of the current state, St , can be found [19].

Figure 2.2. The agent-environment interaction in a Markov decision process. [19]

2.3.2 Q learningQ learning is a reinforcement learning algorithm which uses state action pairsand estimates the value function q(s,a) for each state and action. For a finite

9

MDP problem M = (S,A,P) and a γ for each state and action pair possible,a the q-value q(s,a) is stored as an entry in a memory table. During trainingafter each action the q-value for the state action pair is updated according toequation 2.11, which is an equation based on Bellman equation.

q(st ,at)← q(st ,at)+α[rt+1 + γ maxa

q(st+1,a)−q(st ,at)] (2.11)

The α in equation 2.11 represents the learning rate. The larger the learningrate more we accept the new value and reject the old value. [19]

2.3.3 Deep Q-learningSome reinforcement learning problems are too complex to be able to set up infinite MDP. It can be because the state space is infinite or too big to be storedin a table. The larger the q-table the longer the training process takes. In DeepQ-learning the table is substituted for a neural network. The state space doesnot have to be discrete and finite (like in Q-learning) but can instead be a setof continuous values, that are then inputs to the neural network. The outputis approximation of q-values for each a ∈ A . So the size of the output layerequals the number of actions in the action set A . Since neural networks arefunction approximators, they can work well in approximating the q-values.The max

at+1q(st+1,at+1) can therefore be calculated in a single forward pass in

the neural network for a given st+1.The network is initialized with a random weights θ . In simulation, experiencesare gathered. This means storing actions, rewards, state and next state, <st ,a,r,st+1 > as a tuple in a dataset. Then, an approximation of the q-values areupdated towards the value Y Q

k shown in equation 2.12, where k is the iterationof the training and θ refers to the weights in the network.

Y Qk = r+ γ max

aq(st+1,at+1;θk) (2.12)

The parameters θk are updated by stochastic gradient descent by minimizingthe square loss LDQN (see equation 2.13)

LDQN = (q(s,a;θk)−Y qk )

2 (2.13)

And the parameters are then updated as follows

θk+1 = θk +α(Y qk − (q(s,a;θk))∇θkq(s,a;θk) (2.14)

where α is the learning rate. When experiences are selected from the dataset,they are usually selected in random batches. [6]

10

3. Experiment setup

The first part of the experiment is to understand the thermal physics of theradio and try to convert that into mathematical terms. Second part is to createa controller using the information from the mathematical model of the system.In this experiment two controllers are developed. First an MPC controller isdesigned and secondly a RL controller.

3.1 System identificationIn system identification practices measured data is used to create the model ofa system, whether it is a state space model, transfer function model, polyno-mial model, process model or gray-box model. There exists several methodsthat turn measured data into models and the one used here is N4SID [13]. Thedata that was used in this experiment came from a climate chamber experi-ment.

This experiment was done using Mathwork’s Matlab and its toolbox SystemIdentification Toolbox [11]

3.1.1 Climate chamber experimentIn Ericsson’s office in Kista there is a so called climate chamber, which is achamber where the temperature can be controlled easily. In June 2018 a testwas conducted in the climate chamber and the temperature profile from a warmday in Phoenix, Arizona [15], was simulated. Inside the climate chamber amobile communication radio transmitter was placed and a power usage fromradio operating on a typical busy day in Hong Kong was also simulated. Theheat and power usage together created a high internal temperature inside theradio so the temperature handler could therefore be tested. The test lasted 24hours. The result from the experiment is shown in figure 3.1.

3.1.2 State space modelThe climate chamber test created useful data on how ambient temperature andpower usage effected the internal temperature of the radio. From this data itwas possible to build a state space model.

The radio has several temperature sensors. The temperature inside the radiocan vary, by proximity to power amplifiers or periphery. The temperature

11

Figure 3.1. This graph shows the 24 hour climate chamber test. The brown PaFinal(brown) line shows the internal temperature on one temperature sensor in the radio.The simulated temperature inside the climate chamber is also shown. They follow theaxis on the right side in °C. The requested power and actual power is shown and fol-lows the left side of the axis measured in dB. As can be seen from the graph the radiofollows the requested power until approximately 14pm when the internal temperatureof the radio is quite high. Then the temperature controller kicks in an starts to backoff and the actual power usage is lower than the requested power.

sensor with the highest recorded value was chosen as an output for systemidentification.

The input to the system was the measured ambient temperature inside theclimate chamber and the power usage that was simulated. The output was theinternal temperature. The state space model therefore describes how the in-ternal temperature changes as power usage and ambient temperature changes.The N4SID method was used to acquire the state space model of the system.

Equation 2.1 Shows the structure of a state space model. The N4SID ap-proach with those inputs and output gave the values of the matrices as follows:

A =

[0.9536 0.03851

0.00942 0.9522

]

B =

[0.9536 0.03851

0.00942 0.9522

]

C =[54.09 −0.729

]D =

[0 0

]12

3.2 Model predictive controlThis part of the experiment was also done in Matlab using the MPC Toolbox[1].

Once the model of the thermal dynamics of the system is in place the plantcan be defined. The inputs to the plant is ambient temperature and powerusage. The output is internal temperature. To fit into the MPC’s structure anadditional output is added which is power usage. That output is just a delayedoutput, that is, the same as the input. The reason why power usage is bothand input and an output is because that is the controller variable, u, of thecontroller but it is also input as it effects the output of the plant which is theinternal temperature.

As shown in figure 3.2, the MPC controller’s inputs are two measured out-put and one measured disturbance. The two measured outputs are the outputsof the plant, those are power usage and internal temperature. The referencesignal is the requested power from the radio’s users. That reference signalis the power usage that the radio wants to follow but because of overheatingproblems that is not possible at all times.

The controller is supposed to prevent the radio from overheating and shut-ting down. The shut down occurs once the internal temperature exceeds theshut-down limit. The shut-down limit varies between temperature sensors inthe radio and but in this experiment it was set 105°C. So a hard limit is set onthe MPC output that was internal temperature equal to the shut-down limit.

MPC has weights on inputs and outputs, which penalizes deviations of thereference signals. Also the rate weight on the input which penalizes sharpchanges. In the results section, different values are compared for these param-eters.

Figure 3.2. The MPC structure

3.3 Reinforcement learningThe implementation of RL was written in Python programming language. Tobuild the RL environment and algorithm, the Open AI’s Open AI Gym Toolkit[3] was used.

13

3.3.1 SimulationSince a state space model of the system had already been acquired, it waspossible to build a simulator that simulates the change in internal heat of theradio. The inputs to the system are ambient temperature and power usage.There are other factors that can influence the internal temperature of the radiolike for example sun radiation and wind speed but since the data I used comesfrom the aforementioned experiment from a climate chamber, only the powerusage and ambient temperature were simulated and therefore data only existsfor ambient temperature and power usage and how that effects the system.

Each episode is a simulated several hour period. Both episode lengths of 24hour and 9 hour periods were tested. To train the agent in different scenarios,the inputs are randomly generated before each episode is played out. As forthe ambient temperature the sine function is used to simulate the change inambient temperature in over the course of several hours. One period in thesine function is represents 24 hours.

To simulate ambient temperature, the ambient temperature profile fromPhoenix from the climate chamber test was used as a benchmark. If it ispossible to approximate that temperature profile as a mathematical formulait is possible to introduce some randomness to that formula and in that wayrandomly generate ambient temperatures for the simulator. Swings in temper-atures over 24 hours are similar to sine function, so the sine function is used.Equation 3.1 shows how a sine function can be transformed into somethingthat resembles 24 hour temperature swings.

f (x) = range · sin(x)+o f f set +noise (3.1)

Here• range is set as range = highest−lowest

2 where highest and lowest are thehighest and lowest temperature values from the Phoenix data

• o f f set is set as mean of Phoenix data• noise is a random value with Gaussian distribution with 0 as center of

distribution and as 5This gives a function with rather similar features as the Phoenix data. And forthe simulator, random factors are multiplied to o f f set, range and noise. Alsoa random factor is used to shift the peak of the sine function.

Figure 3.3 shows random samples of temperature profiles generated by thesimulator from the calculations described above.

The user requested power of the radio was simulated by using the same aswas used in the Hong Kong test. It has two peaks, one in the morning and onein the afternoon. In the simulation a random starting point in the requestedpower data is chosen so for each scenario the requested power is not exactlythe same. This is shown in figure 3.4.The figure 3.5 shows when all of this is combined in together with the internal

temperature and simulated 30 times. It turned out that the distribution and ran-domness of the simulator was a really important factor in the performance of

14

Figure 3.3. The ambient temperature profile. The image shows 5 randomly generatedambient temperatures from the data generator in the simulation. The label of x-axis istime in seconds and the graph shows 86.000 seconds or 24 hours. The label of y-axisis temperature measured in °C.

the training so during the training process the random factors in the simulatorswere tuned a lot.The temperature controller in the radio needs to prepare for a wide variety of

ambient temperature and requested power and that is why in the training thesimulator gives different scenarios. This is also to prevent over fitting.

3.3.2 StatesThe state is the input to the neural network and is an array of values that are’relevant’ to the controller. Sometimes it can be tricky to find what is and isnot relevant. Having more values gives the controller more information aboutthe problem but at the same time means bigger network and longer time totrain. Example of values used as state are:

• Current internal temperature• Current ambient temperature• Current requested power• Back off• Future predicted internal temperature• Future predicted ambient temperature• Future predicted requestred power

15

Figure 3.4. Five random samples from the requested output power generator. Thelabel of y-axis is power in dB and the label of x-axis is time in seconds and shows86.000 seconds or 24 hours. As seen in this these are always the same series but withdifferent (randomly selected) stating point and then clipped in the end.

3.3.3 ActionsThe actions the controller could choose were as follows

• Increase the back off by 0.5• Increase the back off by 0.1 dB• No change in back off• Decrease the back off by 0.1 dB• Decrease the back off by 0.5 dB

So the back off starts at 0. The controller can than choose to increase, decreaseit or keep it the same, as long as the back off stayed between -5dB to 0db. Thislimit is set to make it easier for the RL agent to learn and search for the bestsolution as it makes the solutions space smaller. A controller that chooses toapply more power than the requested power (positive back off) is not what iswanted for this problem so it is forbidden.

3.3.4 Reward functionThe reward function needs to be able to tell the agent what is a good action andwhat is a bad action. In this problem there are two factors that are contradictingeach other, high temperature and power back off. Both are negative so thereward function should give a negative reward. To balance those things can

16

Figure 3.5. The image shows 30 randomly generated scenarios of the the simulator.The series colored red at the top show the internal temperature (in °C). The greenseries show the ambient temperature (in °C). The blue shows the output power (in dB)

however be tricky. In warm conditions the controller needs to find out if itis better to back off or not, and if so by how much. The less it backs off themore likely it is that the internal temperature becomes higher. Therefore thereward function needs to reflect that balance that will minimize the back offwhile keeping the temperature below a certain limit.

In each time instance the reward function looks at 3 aspects to give rewardfor:

• Back off: Since the controller wants to minimize the back off, the gen-eral rule is that the more back off the controller applies the more negativereward the agent receives

• Lower temperature limit violation: When the internal temperature goesabove a lower temperature limit called, The agent receives a negative re-ward and the higher above the lower temperature limit the more negativereward the agent receives.

• Higher temperature limit violation (shut down): When the internal tem-perature goes above a shut-down limit the radio shuts down. In the train-ing process, if this happens, the episode will terminate and the agent getsa big negative penalty in return.

17

3.3.5 Training processThe neural network’s values are initialized with Xavier initialization [7]. Thesimulation starts with exploration rate as 1, and gathers information into amemory array. In the memory array information about each action stored is:

• State• Action chosen• Reward received• Next state• A boolean which represents whether the episode is terminated or not (in

that case there is no ’next state’)Once the memory is large enough the training process begins. A random batchfrom the memory array is selected. For each randomly selected tuple from thememory the Q-value is calculated according to Bellman’s function (see 2.11)

The value of maxa

Q(st+1,a) is simply found by using the state_next as andinput to the neural network and take the maximum value of the output. Inthe beginning when the neural network is not trained the outputs are wrong,but studies show that in most cases (not all) the neural network converges toa good Q-value function approximator [25]. Now that the Q-value, Q(st ,at),is updated the back propagation is performed with the target output as theprevious Q-value array with the updated Q(st ,at). The batch size was 32,which means that the agent performs an action and then a random batch ofsize 32 from the memory array is selected, the neural network trained, andthen the agent selects the next action.

The agent’s exploration rate started as 1, meaning all actions are selectedat random. Then after each episode the exploration rate was multiplied by afactor called exploration rate decay. The value of exploration rate decayvaried and depended of the total number of episodes but was 0.95-0.9992.This means that the exploration rate becomes gradually lower as the trainingprocess goes on and the agent becomes more greedy.

The network structure was one of many settings that needed fine tuning.After several tests the most common network structure tried was a structurewith 4 hidden layers and 12 nodes per layer. The input layer is the state anddifferent states with different state sizes were tried. The number of nodes in theoutput layer were equal to the number of actions. The output values representthe Q-values for each action in a certain state (the input). The first value in theoutput is the Q-value for action 0, the second value is the Q-value for action 1and so on.

18

4. Results

In this chapter results are shown from the experiments explained before.

4.1 Model predictive control resultsThe main hyperparameters for tuning was the prediction horizon, control hori-zon and the weights on the objective functions. The weights on both the outputand input. To compare different parameter settings, the total back off was usedas a performance metric. No controller violated the hard limit set on the inter-nal temperature (105°C). The tests assume that perfect prediction is possible.

4.1.1 Comparison on different control horizonsThe goal of this experiment is to see what control and prediction horizon fitsbest for this problem. During this experiment the objective function weightswere constant, the output weight as 1, and input rate weight as 0. The controlhorizon cannot be larger than the prediction horizon. So a pair of control andprediction horizons from 1-89 were tested on the Phoenix and Hong Kongdata and the total back off was used as performance metric. The lower thetotal back off the better the controller did. The result is shown in figure 4.1.

The lowest total back off value was when prediction horizon was set to 87and 88 and control horizon set as 7. Note that those values are time steps andthe sample time is 67 seconds so for example 7 time steps equals 469 seconds.The lowest total back off value was 87.75. When the control horizon was setto 1 the controller did much worse for all values so in the heatmap in figure4.1 that row and column is left out for a better color contrast.

4.1.2 Comparison on different weightsThe goal of this experiment is to find what parameter settings for the weightson the objective functions were best for this problem. During this experimentthe control horizon was set as 7 and prediction horizon as 87 as those valuesgave the best result from the experiment described in section 4.1.1.

For testing the objective function weight on the output power, 11 differentvalues from 0-1 were tested (0.1 as the increment between values). This pa-rameter tells the controller how important it is that the output power follows

19

Figure 4.1. Finding optimal setting for prediction horizon and control horizon. Thecolorbar represents the total back off for that particular settings. The best value iswhen control horizon is 7 and the prediction horizon 87 or 88.

20

Figure 4.2. Finding optimal setting for weights on the objective function. The labelof the x-axis is the input rate weight from 0-1 and the label of the y-axis is the outputweight from 0.1-1. The value on the colorbar is the total back off from the givensetting.When the output weight was set as 0 the results were much worse so that rowwas removed from the heatmap for a better color contrast.

the reference values, which is in this case the requested power. For testing theobjective function rate weight on the input, also 11 different values between0-1 were tested. The input rate weight is a parameter that tells the controller ifit is bad to change the manipulated variable (output power) too rapidly.

The results are shown in figure 4.2

4.1.3 MPC controller resultsWith the optimal settings found the MPC controller can be created and testedin a simulator. Figure 4.3 shows how the controller performs and how it ap-plies back off on the Phoenix and Hong Kong data. The settings used were:

• Prediction horizon: 87 timesteps• Control horizon: 7 timesteps• Objective function output weight: 1• Objective function input rate weight: 0

This assumes perfect prediction. The total back off was 87.75 dB.It was also tested to add noise to the prediction. The result of that is shown

in figure 4.4. The noise model is gotten from a ramp-like disturbance model(build in function in Matlab [1]) with a magnitude of 0.007. The prediction

21

Figure 4.3. The MPC controller strategy. The label of x-axis is time in seconds andthe goes up to 24 hours. The label of y-axis is both temperature in °C and power indB. The red series shows how the internal temperature changes over time (in °C). Theorange series is the ambient temperature (also in °C). The requested power is the blueseries and the actual power that the controller controls is shown in green. Both aremeasured in dB.

22

Figure 4.4. The MPC controller strategy with prediction error. The label of x-axis istime in seconds and the goes up to 24 hours. The label of y-axis is both temperaturein °C and power in dB. The light blue series shows how the internal temperaturechanges over time (in °C. the orange series is the ambient temperature (also in °C).The requested power is the light green colored series and the actual power that thecontroller controls is shown in pink. Both are measured in dB.

error accumulates over time, so as the prediction horizon got larger the worsethe controller became. The prediction horizon is 10 time steps (670 secondsor 11 minutes approximately). The total back off for this test was 105.94 dB.

MPC cannot operate with no predictions but the prediction horizon wastested by setting as the minimum, which is 1 time step (67 seconds) the withsame disturbance as figure 4.4 shows. That controller applied a total back offof 88.85. Figure 4.5 show how that controller performed.

So to summarize the MPC controllers’ performance, measured in total backoff:

• Perfect prediction: 87.75 dB• Faulty prediction (11 minutes): 105.94 dB• Faulty prediction (1 minute): 88.85 dB

23

Figure 4.5. The MPC controller strategy with prediction error but small predictionhorizon. The label of x-axis is time in seconds and the goes up to 24 hours. Thelabel of y-axis is both temperature in °C and power in dB. The red series shows howthe internal temperature changes over time (in °C. the purple series is the ambienttemperature (also in °C). The requested power is the yellow colored series and theactual power that the controller controls is shown in green. Both are measured in dB.

24

4.2 Reinforcement learning resultsEarly on in the thesis work all kinds of different reward functions, distributionin simulation, network structures, episode length, action space, states spaceand other hyperparameter settings for the RL method were tried. There are alot of different settings and set ups to try out and it was difficult to find a setup that gave good result. Some settings gave a bit more promising results thanothers though none gave a truly good result. The settings that seemed to givebetter results than others was

• Adam compiler for the Neural network training• Reward function with no soft temperature limit• Episode length of 500 steps• Exploration rate decay was 0.997• Number of episodes was 1500• Batch size of 32 for training• Back off limit of 5db. That is the controller could not back off more than

5dB (and it was also not allowed to set a higher power than the requestedpower)

The shut down reward was also set as -40.000. This value was chosen be-cause if the controller chose to apply maximum back off for 5dB for the entireepisode (500 steps) the total episode reward would be ca. -30.000. Because itis ’worse’ to shut down than apply this much back off so the shut down rewardwas chosen as a lower value than -30.000. The distribution or randomness ofthe randomly generated ambient temperature in the simulator was also a factorthat needed a lot of tuning. The ambient temperature could not be so high thateven if the controller applied maximum back off it could not avoid shut down.Also if too many occasions were so that no back off was needed, that tendedto be the general strategy for all occasions, meaning the controller applied noback off at all times even when needed.

4.2.1 Hyperparameter testA hyperparameter test was conducted with the setting described above andthe parameters tested were learning rate, network structure (number of hiddenlayers and nodes per layers) and discount factor. The learning rates testedwere 0.00025, 0.00075 and 0.00125. There were 3 different network structurestested; 4 hidden layers with 12 nodes, 2 hidden layers with 24 nodes and lastly4 hidden layers with 24 nodes. There were 3 different discount factors tested;0.3, 0.6 and 0.9 That means three parameters and three test cases for eachparameter. In total that is 27 combinations. To make a long story short noneof these combinations gave a good results, ’good results’ compared MPC atleast. This made it difficult to come up with a numerical performance criteriafor the combinations.

25

Figure 4.6. The label of x-axis is number of episodes, from 0-1500. The label of y-axisis the total rewards per episode. The blue series are all episodes. But since the episodesare different scenarios, a validation series (orange colored) is added. The validationis performed every 10 episodes with no exploration and is always the same ambienttemperature and requested power. The reward graph from training using learning ratewas 0.00025, discount factor was 0.9 and the neural network has the structure 4 hiddenlayers with 12 nodes per hidden layer. The red line shows how the MPC controller,described in section 4.1.3, would perform. That controller would get 730.3 in rewards.

One of the best results came when the learning rate was 0.00025, discountfactor was 0.9 and the neural network has the structure 4 hidden layers and 12nodes per hidden layer. The reward graph is shown in the figure 4.6, and whenthe training was finished the controller was tested on the climate chamber dataand that result is shown in figure 4.7. Another example of training is shown infigure 4.6. Not all 27 (plus more) results from the training will be shown here.

4.2.2 Train on validation dataBecause the training did not go as expected, a new RL-training trained only onthe validation set. So instead of generating random scenarios and train on allkinds of different scenarios, a training was performed on only one scenario. Itwould be preferable to teach the controller to deal with all kinds of scenarios,

26

Figure 4.7. After the training shown in figure 4.6 was finished, the controller wastested. The red series shows the internal temperature, green series shows the requestedpower and the orange shows the actual power. The blue series shows the back off. Theshut down limit is 105°C and is shown as the upper red line. The label of x-axis isnumber of time steps (sample time = 67 seconds), so the total episode length is 24hours. The label of y-axis is both power (in dB) and temperature (in °C).

27

Figure 4.8. The label of x-axis is number of episodes, from 0-1500. The label ofy-axis is the total rewards per episode. The blue series are all episodes. But sincethe episodes are different scenarios, a validation series (orange colored) is added. Thevalidation is performed every 10 episodes with no exploration and is always the sameambient temperature and requested power. The reward graph is from training usinglearning rate was 0.00025, discount factor was 0.3 and the neural network has thestructure 4 hidden layers and 12 nodes per hidden layer. The red line shows how theMPC controller, described in section 4.1.3, would perform. That controller would get730.3 in rewards.

28

Figure 4.9. The label of x-axis is number of episodes, from 0-1500. The label ofy-axis is the total rewards per episode. The blue series are all episodes. The orangeseries is the rewards from validation. The validation is performed every 10 episodeswith no exploration. The upper red line show how the MPC would perform if the samereward function applied. The MPC would then receive 730.3 in rewards. The lowerred line is if the back off from the climate chamber test was calculated as rewards.That controller would get 4739.9 in rewards

but this test was performed to make things simpler and see if the training wouldwork since the previous training did not go as expected.

The parameters used for this test were the same as the parameters usedfor the training shown in figure 4.6 because that was one of combinationsthat gave the best results in the hyperparameter test. Those parameters werelearning rate was 0.00025, discount factor was 0.9 and the neural network hasthe structure 4 hidden layers with 12 nodes per hidden layer.

The reward graph from that training is shown in figure 4.9 and the con-troller’s performance after training is shown on figure 4.10.

4.3 Comparison on MPC and RLFigure 4.11 shows the different kinds of methods used for creating a controllerand how much they backed off if in the climate chamber scenario. The con-trollers showed are as follows

• MPC with faulty prediction from figure 4.3

29

Figure 4.10. After the training shown in figure 4.9 was finished, the controller wastested. The red series shows the internal temperature, green series shows the requestedpower and the orange shows the actual power. The blue series shows the back off. Theshut down limit is 105°C and is shown as the upper red line. The label of x-axis isnumber of time steps (sample time = 67 seconds), so the total episode length is 24hours. The label of y-axis is both power (in dB) and temperature (in °C).

30

Figure 4.11. Comparison between results of different controller methods. This graphshows how much back off the controllers developed perform on the same scenario.The RL controllers (green and purple series) apply much more back off than the MPCcontrollers (blue and orange series. The yellow series show the amount of back offduring the climate chamber test (see figure 3.1).

• MPC with perfect prediction from figure 4.4• RL trained on validation from figure 4.10• RL trained on random scenarios from figure 4.7

Also added to the figure is the back off from the climate chamber that usedthe current implementation of the temperature handler (rule based controller).Although that is not comparable because that was in a real product and theothers in a simulator, it is interesting use that to see how much that controllerbacked off. But this comparison should not be taken too serisously.

31

5. Discussion

In this section the results from the thesis work is concluded and discussed.Iwill also suggest ways forward, and things to work on in the future.

5.1 ConclusionThe results were quite clearly in favor of using MPC in all comparisons. Iwould therefore recommend, going forward in with this project, to focus onthe development on MPC. Another drawback with RL is that it is more com-putationally heavy because of the training step. So if RL should be used forradios the computational cost of that needs to be considered.

The best setting for MPC was when the input rate weight is set to 0. Thismeans that the controller is not ’punished’ for changing the input too quickly.This makes sense because in this application it is not an important factor torestrict the change in input.

The best setting for the output power weight in the objective function was1. This means that it is important that the controller follows the referencetrajectory (in this case the requested power) closely. So it is important that thisweight is high.

For the prediction and control horizon the best setting was prediction hori-zon as 87 time steps (5829 seconds or 97 minutes) and control horizon as 7time steps (469 seconds or approximately 8 minutes). It is interesting thoughthat there is not so much deviation in the results. So having a lower predic-tion horizon doesn’t give much worse results. And it is more difficult to havea longer prediction horizon as predictions become less accurate (predictionslonger into the future are harder to make). So in a real product it might beadvisable to add a shorter prediction horizon.

By adding prediction errors the MPC controller started to perform worse.But by reducing the prediction horizon to 1 time step the performance gotbetter again. So in a real application the controller could assess how good thepredictions are and tune the prediction accordingly. If predictions are goodand accurate, use long prediction horizon but if predictions are bad shorten theprediction horizon.

The results from the RL were bad. I don’t know why the results were so badbut most likely I could not find the right set up or the right set of hyperparam-eters. Because the results were this bad it made it difficult to do a very indepthcomparison between RL and MPC (MPC is much better in all comparisons).It would for example be interesting to do an indepth comparison on differentaccuracy in the prediction of inputs.

32

5.2 Future workThe work of this thesis can be further extended. What I would suggest is towork on is to improve the system identification. This includes explore whatmethods can be used for system identification and also what inputs are used. Inthis thesis ambient temperature and output power was used because that datawas available (from the climate chamber test). But other things can influencethe internal temperature like sun radiation and wind speed. I would suggestgathering data about those attributes and try to estimate the importance ofthose attributes. And then build a model using this data.

The work used on improving the system identification is also beneficial forbuilding a simulator which can save a lot of time and money used in testing inreal products like the climate chamber test.

The RL results were bad. But even if hypothetically in this thesis RL woulddo well in all tests, there are still some things to consider. As mentioned be-fore, one thing is that training the RL is very computationally heavy comparedto MPC or the rule based controller. And Ericsson has a lot of radios aroundthe world so if a RL controller should be trained for each radio that would re-quire a lot computational power. So before spending more time on improvingRL and searching for the right set up, some questions need to be answeredfirst. Those are for example, is training RL agents feasible and should they betrained in the radios or should the data be sent from each radio and the trainingperformed centrally? Those questions need to be assessed.

But if Ericsson concludes that it is worth it to continue to improve the RL,my suggestions to try is:

• Use LSTM (Long short-term memory) as the state in RL [2]. Since theinputs can be interpreted as time series data this method could work

• In this thesis a simple DQN algorithm was used. But a lot of improve-ments on the DQN algorithm exists, like Multistep DQN [9], DoubleDQN [21], Prioritize Experience Replay [20], Dueling Network [22] ormodel based RL that might be worth a try.

RL has some positives. RL can, opposed to supervised learning like neuralnetworks, learn more than a human. Supervised learning is limited to what itis taught . But RL can become smarter than the developers. So if RL wouldgive a good result, Ericsson could analyse the policies RL chooses and learnfrom the RL agent. Another good thing with RL is that it is easier to modifythe objective function (f.ex. back off rewards). But on the other hand ML is ablack box so it is impossible to understand why a certain decision by the agentis taken.

One other attribute that makes machine learning popular and attractive is the’learning’ part. That is that an algorithm can learn on a specific environmentor data. And this is very desirable feature in the radios at Ericsson as it makesthe controllers learn and adapt to different environments. But this adaptivitycan come also without using RL. By using system identification methods and

33

MPC (as suggested in this thesis) the radio can also adapt and learn on itsenvironment.

Prediction on the inputs, like ambient temperature, requested power, windspeed and sun radiation can also assessed. Once a better system identificationmodel is acquired, it would be interesting to see how the controllers performwith perfect prediction, bad prediction and no prediction. And this way it ispossible to estimate how accurate the prediction needs to be before it stopsbeing useful (when ’no prediction’ starts to outperform ’bad prediction’). Itwould then be interesting to create prediction models for f.ex. requested powerusing LSTM or linear regression and see if it is possible to get sufficientlyaccurate prediction model.

Lastly it would be interesting for future work to test these controllers in areal product, like in a climate chamber. Then it would be possible to assess ifthese controllers are better than the current implementation of the temperaturehandler.

34

References

[1] Manfred Morari Alberto Bemporad, N. Lawrence Ricker. Model predictivecontrol toolbox: User’s guide.https://se.mathworks.com/help/pdf_doc/mpc/mpc_ug.pdf, 2019.

[2] Bram Bakker. Reinforcement learning with lstm in non-markovian tasks withlong-term dependencies, 2001.

[3] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, JohnSchulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.

[4] Eduardo F. Camacho and Carlos Bordons Alba. Model Predictive Control.Addison-Wesley Professional, 2 edition, 2007.

[5] Edward A. Feigenbaum, Peter Friedland, Bruce B. Johnson, H. Penny Nii,Herbert Schorr, Howard E. Shrobe, and Robert S. Engelmore.Knowledge-based systems in japan (report of the JTEC panel). Commun. ACM,37(1):17–19, 1994.

[6] Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare,and Joelle Pineau. An introduction to deep reinforcement learning. CoRR,abs/1811.12560, 2018.

[7] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deepfeedforward neural networks. In In Proceedings of the International Conferenceon Artificial Intelligence and Statistics (AISTATS’10). Society for ArtificialIntelligence and Statistics, 2010.

[8] J. Hendler. Avoiding another ai winter. IEEE Intelligent Systems, 23(2):2–4,March 2008.

[9] J. Fernando Hernandez-Garcia and Richard S. Sutton. Understanding multi-stepdeep reinforcement learning: A systematic study of the DQN target. CoRR,abs/1901.07510, 2019.

[10] L. Ljung. System Identification: Theory for the User. Prentice Hall informationand system sciences series. Prentice Hall PTR, 1999.

[11] L. Ljung. System identification toolbox: User’s guide.https://www.mathworks.com/help/pdf_doc/ident/ident.pdf, 2019.

[12] J.M. Maciejowski. Predictive Control: With Constraints. Prentice Hall, 2002.[13] Mathworks. n4sid - estimate state-space model using subspace method, 2019.

[Online; accessed 5-September-2019].[14] Manfred Morari and Jay H. Lee. Model predictive control: Past, present and

future. Computers and Chemical Engineering, 23:667–682, 1997.[15] University of Arizona. Azmet : The arizona meteorological network.

https://cals.arizona.edu/azmet/az-data.htm.[16] Sasa V. Rakovic and William S. Levine. Handbook of Model Predictive

Control. Birkhauser Basel, 09 2018.[17] Jacques Richalet, A Rault, J.L. Testud, and J Papon. Model predictive heuristic

control: Applications to an industrial process. Automatica, 14, pages 413–428,01 1978.

35

[18] Derek Rowell. State-space representation of lti systems.http://web.mit.edu/2.14/www/Handouts/StateSpace.pdf, October 2002.

[19] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: AnIntroduction. The MIT Press, second edition, 2018.

[20] Ioannis Antonoglou Tom Schaul, John Quan and David Silver. Prioritizedexperience replay. ICLR, 2016.

[21] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learningwith double q-learning. CoRR, abs/1509.06461, 2015.

[22] Ziyu Wang, Nando de Freitas, and Marc Lanctot. Dueling network architecturesfor deep reinforcement learning. CoRR, abs/1511.06581, 2015.

[23] Wikipedia. Ericsson, 2019. [Online; accessed 5-September-2019].[24] Wikipedia. State-space representation, 2019. [Online; accessed 5-June-2019].[25] Zhuoran Yang, Yuchen Xie, and Zhaoran Wang. A theoretical analysis of deep

q-learning. CoRR, abs/1901.00137, 2019.

36

Documents

Temperature handler in radios using machine learninguu.diva-portal.org/smash/get/diva2:1395231/FULLTEXT01.pdfIT 19 089 Examensarbete 30 hp December 2019 Temperature handler in radios