Thermal Neural Networks: Lumped-Parameter Thermal Modeling

1

Thermal Neural Networks:Lumped-Parameter Thermal Modeling With

State-Space Machine LearningWilhelm Kirchgassner, Oliver Wallscheid, Member, IEEE and Joachim Bocker, Senior Member, IEEE

Abstract—With electric power systems becoming more com-pact and increasingly powerful, the relevance of thermal stressespecially during overload operation is expected to increaseceaselessly. Whenever critical temperatures cannot be measuredeconomically on a sensor base, a thermal model lends itselfto estimate those unknown quantities. Thermal models forelectric power systems are usually required to be both, real-time capable and of high estimation accuracy. Moreover, easeof implementation and time to production play an increasinglyimportant role. In this work, the thermal neural network (TNN)is introduced, which unifies both, consolidated knowledge inthe form of heat-transfer-based lumped-parameter models, anddata-driven nonlinear function approximation with supervisedmachine learning. A quasi-linear parameter-varying system isidentified solely from empirical data, where relationships betweenscheduling variables and system matrices are inferred statisticallyand automatically. At the same time, a TNN has physicallyinterpretable states through its state-space representation, isend-to-end trainable – similar to deep learning models – withautomatic differentiation, and requires no material, geometry,nor expert knowledge for its design. Experiments on an electricmotor data set show that a TNN achieves higher temperatureestimation accuracies than previous white-/grey- or black-boxmodels with a mean squared error of 3.18 K² and a worst-caseerror of 5.84 K at 64 model parameters.

Index Terms—Machine learning, thermal management, perma-nent magnet synchronous motor, neural networks, temperatureestimation, functional safety, hybrid ML, system identification.

I. INTRODUCTION

Lumped-parameter thermal networks (LPTNs) are a systemof ordinary differential equations (ODEs) and a popular choicefor various thermal modeling applications throughout differentindustry branches. Most notably, their relatively low amount ofmodel parameters lends themselves for real-time temperatureestimation tasks in electric power systems, such as multichippower modules [1], [2]; automotive-grade electric machines[3]–[5], or even in civil engineering [6]. An accurate real-time temperature estimator is often highly desirable whenevera system’s control performance depends on thermal stateinformation of such parts that are not economically mea-surable on a sensor base, e.g., the permanent magnets in

This work is supported by the German Research Foundation (DFG) undergrant BO 2535/15-1. Computing time was provided by the Paderborn Centerfor Parallel Computing (PC2).

W. Kirchgassner and J. Bocker are with the Department of Power Electron-ics and Electrical Drives while O. Wallscheid is with the Automatic ControlDepartment, Paderborn University, D-33095 Paderborn, Germany(e-mail: kirchgaessner, wallscheid, [email protected]).

a synchronous electric machine or semiconductors in powerelectronic converters. Their need is particularly urgent whenoverload operation can be set, which potentially harms thesystem’s health through thermal stress. Even with system tem-peratures being tracked by sensors, a redundant thermal modelbecomes more and more relevant against the background ofincreasing functional safety requirements.

Being a parameter-light model often makes the LPTN ap-proach the only feasible choice, although established, compu-tational heavy, thermal analysis tools like computational fluiddynamics (CFD) and finite-element modeling (FEM) basedon partial differential equations (PDEs) usually give a morethorough and accurate picture of heat flows in any intricatesystem [7]. Apart from thermal models, there are also real-time methods in the electric machine domain based on eitherinvasive, high-frequency signal injection [8], [9] or preciceobservers that indirectly assess an electric machine’s ther-mal condition from temperature-dependent electric parameters[10]. Both methods expose high system parameter sensitivityas well as operation-point-dependent accuracies, that mightlead to undesired temperature estimation errors [11]. What ismore, a considerable amount of domain knowledge is requiredto be incorporated in each system model.

A. State-of-the-art Thermal Modeling

Likewise, typical for an LPTN design of sufficient accuracyis the incorporation of expert knowledge from the application’sdomain, as well as geometry and material information frommanufacturers. These help to identify the parameter variationof thermal resistances and capacitances, yet the geometry isoften approximated with simple shapes (cylinders, spheres)to further reduce the modeling complexity of also intricatecomponent shapes. What is more, power losses are usuallyeven more difficult to grasp, and significant research effortis put into their modeling [12], [13]. In order to overcomemodeling errors imposed by simplified assumptions over thesystem, empirical measurement data (for instance, as collectedon a test bench) is increasingly utilized and employed, e.g., inthe form of look-up tables or for fine-tuning analytically foundinitial parameter values [3], [12], [14]. The idea is depicted inFig. 1.

Advancing the trend of fitting models on empirical data,machine-learning-based thermal modeling was proposed re-cently, where artificial neural networks (ANNs) are tuned to

arX

iv:2

103.

1632

3v2

[cs

.LG

] 8

Apr

202

1

2

MeasurementInterface

ControlUnit

Power System

Workstation

StateModel

ControlUnit

Power System

StateModel

a) b) c)

Fig. 1. The data-driven state modeling process is depicted. a) The powersystem is either upgraded with measurement equipment or operated at a testbench, and reveals information about the system’s input u and measured statesy (e.g, interior temperatures). b) The recorded data set is used to tune a statemodel on estimating y. c) The tuned model is deployed on an embeddedsystem and informs the control unit.

estimate temperatures of interest from a parallel stream ofreadily available sensor data, effectively harnessing statisticalcorrelations between the latent temperatures and measurablequantities [15]–[18]. These methods are completely detachedfrom thermodynamic theory and rely solely on empirical data.They neither necessitate expert knowledge nor geometry ormaterial information for their design at the same level ofaccuracy as that of data-driven LPTNs. However, an increasedcomputational demand in real time is required due to addi-tional feature engineering and a substantially increased amountof model parameters [16]. Moreover, learnt parameters of anANN are not as interpretable as of an LPTN. This impliesthat other systems, albeit similar to the one observed by theblack-box model, would always require, again, empirical datafor an appropriate temperature estimation unit. Furthermore,initial conditions can not be set in black-box models as theirinternal states are abstract, if there are any at all.

B. Lumped-Parameter Thermal Networks

Based on heat transfer theory, LPTNs are designed to esti-mate specific spatially separated system components’ temper-atures by representing heat flows through an equivalent circuitdiagram, where parameters stand for thermal characteristics.An appropriate LPTN structure is usually derived from theheat diffusion equation [19]:

ρcp∂ϑ

∂t= p+∇ · (λ∇ϑ), (1)

where ρ denotes the mass density, cp stands for the volume’sspecific heat at constant pressure, ϑ is the scalar temperaturefield, p being the thermal energy generation from other energysources (such as electric or chemical energy), ∇ being thedel operator across spatial dimensions, and λ representingthe potentially nonconstant thermal conductivity. For steady-state heat transfer, a constant thermal resistance can then bederived analytically from geometry as well as boundary andinitial conditions to describe one-dimensional heat flow thatdepends on the absolute temperature difference, much likecurrent flow depends on potential differences in an equivalentelectric circuit. In the multi-dimensional case, converting (1)into a spatial finite-difference equation is often helpful. Fig. 2shows a thermal element, which is an LPTN’s basic building

block and represents one particular subdomain in the system.This simplified diagram discretizes a system’s state space andconverts the governing PDEs into ODEs.

Fig. 2. A typical thermal element consisting of a thermal capacitance, apower loss source, and multiple thermal resistances leading to other thermalelements. ϑi denotes the i-th element’s average temperature.

However, the parameters in Fig. 2 are commonly operation-point-dependent rather than being constant for most appli-cations. Moreover, especially convective and radiative heattransfer processes inherit highly nonlinear thermal character-istics making the transient definition of thermal parameters achallenging endeavor [20], [21]. Usually, thermal parametersare thus denoted in dependence on a scheduling vector ζ(t).This vector contains not only measurement quantities thatare available during operation but also all thermal statesthemselves, which will be discussed later.

In order to avoid incorporation of geometric specificationsupfront, a general-purpose LPTN structure of the followingform is considered as starting point for subsequent designs:The i-th thermal element in a fully interconnected LPTN ofm-th order describes the ODE

Ci

(ζ(t)

)dϑidt

= Pi

(ζ(t)

)+∑

j∈M\i

ϑj − ϑiRi,j

(ζ(t)

) + n∑j=1

ϑj − ϑiRi,j

(ζ(t)

) , (2)

where C denotes the thermal capacitance, P the power loss,Ri,j the bidirectional thermal resistance between the temper-atures at node i and j, and ϑ the actual temperature at timet with M = {1, 2, . . . ,m}. Moreover, temperatures that aremeasurable during operation (typically ambient and coolanttemperatures) can be incorporated as further n ancillary nodesmerely consisting of an equivalent thermal source ϑ.

It will prove beneficial for the following sections to rear-range (2) into a quasi-linear parameter-varying (quasi-LPV)state-space representation encompassing all nodes in matrixform:

dx

dt= A

(ζ(t)

)x+B

(ζ(t)

)u(ζ(t)

),

y = Cx,(3)

with

x = ϑ = [ϑ1 . . . ϑm]T,

u = [P1

(ζ(t)

). . . Pm

(ζ(t)

)ϑ1 . . . ϑn]

T,

A ∈ Rm×m, B ∈ Rm×(m+n), C ∈ Rq×m,

where A, B, and C are the system, input, and output matrix,respectively. The output matrix C is typically a maskingoperation allowing the system output y to encompass fewer

3

temperatures than are being modeled with q ≤ m. Identifica-tion of A

(ζ(t)

)and B

(ζ(t)

)is central to thermal modeling

with LPTN structures.Furthermore, as mentioned in the previous section, deter-

mining the dependence of Pi on ζ(t) is usually a hard task:there are often only few sensor information (e.g., electricalcurrent at the terminals of a power electronic converter orelectric machine) from which a complex nonlinear spatialpower loss distribution is to be inferred. Finding this mappingis notoriously challenging in addition to identifying A

(ζ(t)

)and B

(ζ(t)

).

C. State-Space Machine Learning

Recently, a paradigm shift emerged promoting the combi-nation of general-purpose machine learning (ML), stemmingfrom computer science, and prior scientific knowledge inthe form of PDEs commonly appearing in nature, insteadof contemplating these as distinct problem sets. Leveragingthe potential of automatic differentiation [22], various data-driven approaches were presented for both, discovery andsolution of nonlinear PDEs [23]–[26]: a prominent idea isthe physics-informed ANN (PINN), which adds any PDE ofinterest to a standard ANN’s cost function [25]. While a fittedPINN has been shown to respect PDEs with certain boundaryand initial conditions, their convergence properties are notfully understood yet, and, more notably, their architecture isunaltered compared to usual ANNs. This makes a PINN asnoninterpretable as typical black-box models.

Conversely, neural ordinary differential equations denoteinitial value problems of the form

dx

dt= gθ(x,u, t), (4)

with x denoting a system’s state at time t, u the system’sinput, and gθ being a neural network function parameterizedwith θ [24]. Although this is a first step to applying an outerstructure to an ANN that imitates the nonlinear workings ofODEs, it relies too heavily on the universal approximationtheorem [27], and prior scientific knowledge of fundamentalmechanisms are not directly incorporated.

This shortcoming is addressed by universal differentialequations (UDEs), which extend the ODE case (4) to

dx

dt= f(x,u, t, gθ(x,u, t)), (5)

where f represents arbitrary known mechanisms of the under-lying physical problem in which difficult to model parts areapproximated by any universal approximator gθ [26]. Note,however, that except for some concrete examples with veryspecific ANN constellations there is no rule-based procedureproposed on how to receive an appropriate UDE given anonlinear system. Due to the broadly outlined definition ofan UDE, most modeling problems still remain to be solved byexperts in the corresponding domain.

D. State-Space Neural Networks

The neural ODE (4) strongly resembles the state-spaceneural network (SSNN), which was proposed within the

control theory domain over two decades ago [28]. SSNNswere an early attempt to combine ANNs with state-spacerepresentations of general nonlinear, dynamic systems, inorder to leverage well established stability and convergencetheorems with the nonlinear modeling capability of ANNs.They describe the difference equation

x[k + 1] = gθ(x,u, k),

y[k] = hθ(x,u, k),(6)

where gθ and hθ are nonlinear functions represented bypotentially the same ANN with parameters θ at discretetime k. Obviously, the mere difference to neural ODEs (4)is the differential equation form, but, again, the universalapproximation theorem is relied on completely here. Followingthis, most contributions deriving from SSNNs employ someform of recurrent ANNs (RNNs)

gθ(x,u, k) = σ(Wrx[k] +Whu[k] + b), (7)

with W{r,h} being the recurrent and forward weight matrices,b being the bias vector, and σ denoting the RNN’s activa-tion function [29], [30]. While the computer-science-basedmovement on ML estimation of state derivatives does notrestrict the topology of the state estimator to any form (e.g.,certain number of layers, standard vs. convolutional ANNs,etc.), SSNN derivatives are often proposed with those settingsfixed. This might stem from the desire to be compatible toexisting control theorems, yet it could also originate from thelack of automatic differentiation tools, which render trainingof large ANNs possible and emerged just a decade ago duringthe revived interest into neural networks.

However, the idea of estimating x[k + 1] under a certaintime discretization is central to this article as well, albeit herethe chosen topology is more sophisticated than (7) and tailoredto heat transfer processes. As will be pointed out in the nextsection, this contribution is motivated by (5) and informed by(3).

E. ContributionIn this work, the thermal neural network (TNN) is proposed,

which facilitates physically reasoned heat transfer mechanicsand can be applied to any system that exposes componenttemperatures of particular interest. Unifying the concepts ofLPTNs and UDEs, a TNN is universally applicable acrossall relevant engineering domains that deal with heat transfer.In particular, it integrates the following advantages into onemodeling approach:• Physically interpretable model states are featured, which

make initial condition adjustment simple.• TNNs are end-to-end differentiable, which enables fast,

parallelized, and distributed training scenarios with auto-matic differentiation frameworks.

• Neither expert knowledge, nor material or geometry in-formation are required for the design.

• Being completely data-driven, parasitic real-world effectsare considered through observational data.

• A quasi-LPV system is identified where unknown rela-tionships between scheduling variables and system ma-trices are automatically discovered.

4

• Even relatively small model sizes achieve high estimationaccuracies, which facilitates real-time capability.

This article begins with the working principle of a TNN, fol-lowed by the demonstration of its modeling capacity throughcross-validation on a representative data set. A kernel-basedhyperparameter optimization is described, and chances formodel reduction are shown. Finally, the TNN’s initial condi-tion adjustment and its recovering properties are demonstrated.To assist related work all code is published on GitHub1.

II. THERMAL NEURAL NETWORKS

Circumventing the intricate and error-prone design effortrevolving around identification of the functions that are subjectto the thermal equivalent parameters, the following architec-ture is leveraging the high nonlinear modeling capacity ofANNs. It learns the existing physical relationship betweentemperatures and other sensor information without geometryor material information - just from measurement data and thegeneral structure of an LPTN. Building on the concept of (5),the thermal neural network (TNN) reformulates (2) after afirst-order Euler discretization to

ϑi[k + 1] = ϑi[k] + Tsκi[k]

(πi[k]+∑

j∈M\i

(ϑj [k]− ϑi[k])γi,j [k] +n∑

j=1

(ϑj [k]− ϑi[k])γi,j [k]),

(8)

with ϑi[k] denoting the i-th node’s normalized temperatureestimate at discrete time k, Ts being the sample time, andκi, πi as well as γi,j denoting arbitrary feed-forward ANNoutputs dependent on ζ[k] = [ϑT ϑT ξT]T, which in turnconsists of the ancillary temperatures ϑ[k], the temperatureestimates ϑ[k], and additional observables ξ[k] ∈ Ro. Morespecifically, γ(ζ) approximates all thermal conductances be-tween components of interest, π(ζ) all power losses withinthese components, and κ(ζ) determines all inverse thermalcapacitances. Common observables are, e.g., the current andvoltage vector as well as the motor speed in an electric drivetrain.

Without loss of generality, κ(ζ) can usually be reduced tojust end-to-end trainable constants θc in the form

κ = 10θc , θc ∈ Rm,

whereas

π :Rm+n+o → Rm,

γ :Rm+n+o → R(m+n)(m+n−1)/2

are distinct ANNs. This reduction corresponds to a simplifica-tion through the lumped capacitance method, which is feasiblein many engineering applications [19]. The TNN concept issketched in Fig. 3.

1https://github.com/wkirgsn/thermal-nn

Thermal

Neural Network

Cell State

-th Order LPTN Structure

InverseTherm. Capacitances

Power Loss

Therm. Conductances

Fig. 3. A TNN considers all input features of the current time step φ[k] toestimate the next time step’s temperatures ϑ[k+1]. This estimate acts as cellinformation within the TNN to further inform the three function approximatorsγ, κ, and π. An LPTN structure of third order with an ancillary thermalsource is exemplarily depicted.

Hence, except for κ, the general algorithmic shape isdescribed by the multilayer perceptron (MLP) algorithm fedby ϑ and φ = [ϑT ξT]T:

h(0)[k] = σ(0)(Wrϑ[k] +W(0)h φ[k] + b(0)),

h(l)[k] = σ(l)(W(l)h h(l−1)[k] + b(l)), ∀ l > 0,

gθ[k] = h(L−1)[k],

(9)

with σ(l)(·) denoting the nonlinear activation function at layerl, and θ = {Wr,W

(l)h , b(l) : l ∈ [0, L − 1]} describing the

trainable parameters that exist independently for both, π andγ. Note that if one finds the assumptions under the lumpedcapacitance method to be too simplistic for the task, liftingκ to the MLP form would be an appropriate first measure.Rearranging (8) into matrix form similar to (3) and substituting(κ,π,γ) with (9) will reveal the relationship between thescheduling variables and the quasi-LPV system matrices asmodeled by a TNN.

In ML terminology, the TNN is an RNN with one singlelayer, whose cell consists of three (arbitrary complex) ANNs.It has a single cell state vector being the estimated tempera-tures and the TNN’s output for the next time step, much likean SSNN with C being the identity matrix. Dedicating threedifferent function approximators to the thermal characteristicsof an LPTN and acting on the discretized ODE rather than onthe difference equation is what distinguishes the TNN froman SSNN.

1) Training Remarks: It is well known that training RNNswith the error backpropagation method suffers from the van-ishing and exploding gradient problem [31]. The problemdefines the phenomena of gradients soon becoming very largeor very close to zero the longer the sequence on whichgradients are accumulated for the subsequent weight update.This is especially severe for arbitrary long sequences on which

5

thermal models conduct estimates. Early on in ML literature,this problem was curbed with the introduction of memoryblocks, like the long short-term memory (LSTM) block, thatwould expand each RNN neuron [32]. However, the general-purpose topology of an LSTM is in conflict with the physicallymotivated TNN, such that a TNN has to overcome this hurdlewith other methods, e.g., gradient normalization, clipping, andtruncated backpropagation through time (TBPTT) [33].

2) Physical Interpretation: Since no information about thegeometry or material of the system is assumed to be knowna priori, a fully-connected LPTN denotes the starting pointof the TNN design, i.e., A and B are dense matrices. For athermal model that is modeling m+n different temperatures,this translates to (m+n)(m+n−1)/2 thermal conductancesand, thus, a parameter size complexity of O((m + n)2). Inpractice, however, several components within a system areoften sufficiently detached such that the heat transfer betweenthem is neglegible. Featuring physically interpretable states,a fitted TNN can evidence such prunable connections, thatcan be removed in a subsequent TNN design iteration. Thisparameter reduction method is presented in Sec. III-C.

3) Domain Knowledge: If domain knowledge is available,this can be incorporated in several ways: for instance, bymeans of constraining the choice of topology defining param-eters like the activation functions of the output layers of γ,κ, and π, or by means of feature engineering or eliminationper thermal parameter estimator. Without loss of generality, itcan be recommended to apply the `1-norm on all the ANNs’outputs, as negative thermal conductances, capacitances, andpower losses are rarely physically plausible.

4) Initial Condition Adjustment: As is usual for state-spacemodels, the TNN structure asks for an initial temperatureestimate. In simulations and training, these initial values can beset with the ground truth values, while in a field application theambient temperature is a good approximation. Note that thisinitial conditioning is not possible for black-box models whichhave completely abstract parameter structures. The ability toset the initial condition is thus to be seen as a major advantagein favor of models representing a dynamic state space. Thesevereness of a biased initial guess is investigated in Sec. III-D.

III. EXPERIMENTS

In order to demonstrate the TNN concept’s efficacy, severalexperiments are conducted on a publicly available dataset2,which was also already utilized in [12], [15], [16], [34].This data set represents 185 h of multivariate measurementssampled at 2Hz from a 52 kW three-phase automotive tractionpermanent magnet synchronous motor (PMSM) mounted ona test bench, which is torque-controlled while its motor speedis determined by a speed-controlled load motor (210 kW)and fed by a 2-level IGBT inverter. All measurements wererecorded by dSPACE analog-digital-converters (ADCs) whichhave been synchronized with the control task. Tab. I compilesthe most important test bench parameters.

2https://www.kaggle.com/wkirgsn/electric-motor-temperature

The contour plot in Fig. 4 depicts the excitation within thedata set, which is characterized by constant and randomizedexcitation likewise.

TABLE ITEST BENCH PARAMETERS

Motors device under test load motorType PMSM induction machinePower rating (nom./max.) 22/52 kW 160/210 kWVoltage rating 177V 380VCurrent rating (nom./max.) 110/283A 293/450ATorque rating (nom./max.) 110/250Nm 380/510NmPole pair number 8 1Cooling water-glycol forced airThermocouples type K - class 1 -

Inverter 3×SKiiP 1242GB120-4DWTypology voltage source inverter

2-level, IGBTInverter interlocking time 3.3 µs

Controller hardware dSPACEProcessor board DS1006MC, 4 cores, 2.8 GHz

Fig. 4. An interpolated 2D-histogram of visited operation points is shown inthe motor-speed-torque-plane. Brighter areas were visited more often.

The measured permanent magnet surface temperature as av-eraged across four sensors around a single permanent magnetis assumed to be representative for the rotor temperature, andis transmitted with a telemetry unit. For all experiments, sensorinformation is classified in the following way:

ξ = [us is ωmech]T,

ϑ = [ϑa ϑc]T,

x = y = [ϑPM ϑSY ϑST ϑSW]T,

with us and is being the voltage and current vector norm,respectively; ωmech denoting the mechanical angular frequency;and ϑ{a, c, PM, SY, ST, SW} representing temperatures at the ambi-ent, coolant, permanent magnet, stator yoke, stator tooth, andstator winding, respectively. An exemplary cross-section of aPMSM is sketched in Fig. 5 highlighting the sensors’ geomet-ric arrangement. ϑ is assumed to be the system’s boundary,i.e., no heat transfer to other environmental temperatures arecontemplated.

As is usual for ANN training, all quantities are normalized:is is divided by 100A, us by 130V, ωmech by 2π ·6000min−1,and all temperatures by 100 °C. No further feature engineeringis employed, which would be common for pure black-box

6

Stator yoke

Stator tooth

Permanent magnet

Stator winding

Ambient

Air gap

Cooling duct

Fig. 5. Exemplary and simplified PMSM cross-section with two pole pairs

alternatives where φ would effectively grow to over 100elements [16]. The lack thereof can be seen as a computationaladvantage in favor of TNNs.

All models are evaluated for their estimation performanceby the mean squared error (MSE) between estimated andground truth trajectory, averaged across the targeted tempera-tures. Proper cross-validation is ensured and described below.Throughout this work, Tensorflow 2 [35] is used to train andinvestigate TNNs.

A. Automated Architecture Optimization

Since a TNN only loosely defines the topological extent ofits three thermal parameter estimators γ, κ, and π, there ismuch room for hyperparameter optimization (HPO).

1) Search Intervals: Amongst topological factors,optimization-related hyperparameters are optimized as well.An overview of all hyperparameters, their search intervals,and the found optima, is compiled in Tab. II. Generally,interval bounds were chosen to restrict parameter size growthinto dimensions that would not be reasonably real-timecomputable anymore.

The set of activation functions α comprises those commonin literature and the biased Elu σbiasedElu(·) = σelu(·) + 1, inorder to produce nonnegative output. P|| represents a flag thatsignals the expansion of π into dedicated MLP branches -one per target component (compare Fig. 6). The TBPTT lengthLTBPTT denotes how many forward passes are accumulated andaveraged across for each backward pass during training witherror backpropagation. Splitting all measurement profiles intosubsequences of a certain length LS , where each subsequenceis treated as an independent record, is indicated by the flag S.

2) Cross-Validation Scheme: The biggest stumbling blockduring the endeavor to achieve high estimation accuracies indata-driven modeling may be overfitting. It explains the factthat, during training, any model of some expressivity can startto remember noise in the data in order to further minimize theunderlying loss. Appropriate cross-validation schemes need tobe determined a priori in order to avoid such too optimisticgeneralization error assertions.

In this work, the investigated data set consists of severalmeasurement records of different lengths. A common approach

a) b)

Fig. 6. The HPO is given the opportunity to multiply the amount of modelparameters in π (from (a) to (b)) by dedicating single MLP structures to eachtargeted component. This is a technique that has been proven useful in [15].Given the higher amount of thermal conductances compared to targets, thisscheme is not contemplated for γ.

Guides HPO

Fold 1

Fold 2

Generalizationerror

Full data set

Training set Validation set DisregardedTest set

Fig. 7. Cross-validation strategy where each square represents a measurementprofile

is to split those records into different sets dedicated foreither training or validating the model. Following scheme isapplied: First, no record should be present in both training andtesting sets at the same time, in order to preclude exploitationof record-specific anomalies. Second, two so-called folds ofmutually-exclusive sets are defined comprising a number ofprofiles that either act as validation or test set dependingon the current iteration. Both validation and test set are notobserved by the model during training. While the validation setis evaluated after each training epoch to apply early stopping,the test set is evaluated only once after training to receive anoverall score. The eventual score will be the average acrossboth folds. Last, considering the HPO being another source ofoverfitting but for hyperparameters, another set of profiles isexcluded from training and validation during the HPO, and iscalled the generalization set, see Fig. 7. Having found optimalhyperparameters, validating a model against this last set willgive true generalization errors.

Furthermore, training outcome of ANNs is sensitive to theirrandom initialization. The same set of hyperparameters mightend training at very different performance values depending onthe first weight values being fortunate or not. For this reason,during HPO, each training fold is repeated ten times withdifferent random seeds and the score is then averaged acrossin order to assess the mean quality of a certain hyperparameterset.

3) Search Results: The HPO’s sampling strategy is thatof Optuna’s tree-structured Parzen estimator (TPE) [36], [37].Hyperparameter sets were independently fetched and evaluatedin parallel by 60-80 workers on a high-performance computingcluster. The HPO error trend is shown in Fig. 8. The overall

7

TABLE IIHYPERPARAMETER SEARCH INTERVALS AND RESULTS

Symbol Hyperparameter Search interval Optimum

L{π,γ} No. hidden layers in {π,γ} [1, 3] {3; 3}

n(l){π,γ} No. hidden units per layer in {π,γ} [2, 128] {109→ 3; 2→ 2}

σ(l){π,γ}

Activation functionper layer in {π,γ}

Sigmoid, tanh, linear, ReLU,biased Elu, sinus

{ReLU→sigmoid→sigmoid;Tanh→sinus→biased Elu}

α(l){π,γ} `2 regularization rate per layer in {π,γ} [10−9, 10−1] {10−8 → 10−8; 5.5 · 10−6 → 5.3 · 10−8}

P|| Dedicated branches in π [True, False] False

η Initial learn rate [10−5, 1] 11 · 10−3

LTBPTT TBPTT length in samples [8, 2048] 1227

S Split profiles into subsequences [True, False] False

LS Subsequence length in hours [0.5, 4] -

β Optimizer Adam, Nadam, Adamax, SGD, RMSProp Nadam

best hyperparameter set is reported in Tab. II, and its perfor-mance on the generalization set is displayed in Fig. 9.

Comparisons to a hyper-parameter-optimized MLP, anotheroptimized 1-D convolutional neural network (CNN), a likewiseoptimized LPTN, and a simple ordinary least squares (OLS)algorithm can be found in Tab. III. All numbers are compiledfrom an evaluation on the generalization set. Note that for pureML algorithms the increased feature entries are to be computedin real-time as well rather than being measured, which, next tothe model size, represents an additional computational demandon the embedded system.

TABLE IIISTATE-OF-THE-ART THERMAL MODELING PERFORMANCE METRICS AS

EVALUATED ON THE GENERALIZATION SET

MSE(K2)

`∞(K)

LUTentries

Featureentries

Modelsize

MLP [16] 5.58 14.29 0 81 1444OLS [16] 4.47 9.85 0 81 328CNN [15] 4.43 15.54 0 81 4916LPTN [3] 3.64 7.37 12810 4 34TNN (small) 3.18 5.84 0 5 64TNN (HPO optimum) 2.87 6.02 0 5 1525

B. Model Size Pareto Front

The foregoing HPO was guided by estimation accuracyalone, without considering model size. However, in real-worldapplications, real-time capability and, thus, small, lightweightmodels play an important role as they are to be deployedprevalently on best-cost hardware. In the following, a ran-dom search over a grid of a TNN’s topological settings isconducted, keeping other hyperparameters at the optimum, inorder to shape an informative Pareto front. This shall give anoverview of the trade-off between model size and estimationaccuracy, see Fig. 10. The amount of layers and the numberof units per layer for both, γ and π, vary from 1 to 3 and 1

0 250 500 750 1000 1250 1500 1750 2000Number of trials

101

102

103

104

MS

E in

K²

BestEvaluations

0

250

500

750

1000

1250

1500

1750

Info

rmed

by

how

man

y tr

ials

Fig. 8. The HPO error trend over trials on the electric motor dataset is shown.Due to parallel execution, subsequent evaluations are not necessarily strictly-linearly informed by previous evaluations. Each trial denotes the averageacross ten random seeds on each of two test folds. The best MSE foundis 1.9K2.

to 128, respectively, and independently from each other. Thenumber of units is varied exponentially in order to bound thecombinatorial grid size growth.

It is evident that for very small model sizes the accuracydegrades only insignificantly. Thus, the found hyperparametersreveal a robust guidance on the design of TNNs that are lessinfluenced by the model size. A topology with 2 neurons in anintermediate layer for both thermal parameter approximators,which results in just below 100 model parameters, seems tobe a good trade-off between accuracy and model size.

It might appear incoherent that the kernel-based HPO op-timum does not touch the Pareto front of the grid search.This has two simple reasons: First, the foregoing HPO hadto sample from a substantially larger sample space of 26hyperparameters, while the grid search sampled from just two,such that the over 2000 candidates could have easily missedbetter local optima. Second, the TPE sampling method wasguided by the averaged estimation accuracy on two test sets,that might expose different peculiarities than the generalizationset.

8

50

100

150

Tem

pera

ture

in °

C

PM

Ground truth Estimate

SY ST SW

5

0

5

Err

or in

°C

MSE: 4.34 (°C)², ||e|| : 6.02 °C MSE: 1.94 (°C)², ||e|| : 5.58 °C MSE: 2.94 (°C)², ||e|| : 5.60 °C MSE: 2.26 (°C)², ||e|| : 5.82 °C

0 1 2 3 4 5 6 7 8Time in hours

0.0

0.5

1.0

Nor

mal

ized

qty

Motor speed

0 1 2 3 4 5 6 7 8Time in hours

Torque

0 1 2 3 4 5 6 7 8Time in hours

Ambient temperature

0 1 2 3 4 5 6 7 8Time in hours

Coolant temperature

Fig. 9. The performance of the found optimum on the generalization set is depicted. Dashed vertical lines separate independent profiles. Some selected sensorquantities are depicted at the bottom for reference.

102 103 104

Model size

2

4

6

8

10

MS

E in

K²

Pareto front (all conductances)Pareto front (pruned)HPO optimumEvaluations (all conductances)Evaluations (pruned)

Fig. 10. The MSE on the generalization set over different topological settingsis illustrated. Here, L{π,γ} and n(l)

{π,γ} are independently sampled for both,γ and π.

C. Thermal Conductance Sparsification

Having iterated over many TNN design parameters, a vastsource for conductance analysis is available. In an attemptto further reduce the model size, together with the fact thatnot all components within an intricate system necessarilyoffer direct heat paths to each other, the median activationof each conductance is investigated. Fig. 11 shows an inter-conductance plot between all available temperatures, whereline widths denote the median across a subset of all compiled γnetworks’ median output, when exposed to a random uniforminput φ ∈ Rn+o with intervals [0, 1.3]. Only those modelsthat achieved an average MSE of below 5K2 across thecomponents are considered. Those connections illustrated inblue, dashed lines are candidates for pruning.

It becomes apparent, that the highest thermal conductancesexist between the coolant and stator temperatures, peaking be-tween the stator yoke and coolant. This is physically plausible

SY

STSW

PM

a c

Fig. 11. The median conductance values as being output on random input,averaged over γ networks derived from all successful experiments (i.e.,MSE < 5K2) are visualized. Line thickness scales with conductance valueslinearly. Conductances under a certain threshold are drawn in blue, dashedlines, and are subject to pruning.

as those two components are spatially attached and form asubstantial, convective heat transfer (see Fig. 5). The weakestconnections exist for the ambient and rotor temperature to allother, likewise. The marginally significant ambient tempera-ture conductance can be explained by its remote operatingarea. However, the small conductances around the permanentmagnet temperatures are counter-intuitive at first, but becomemore plausible under the fact that temperatures in the rotorare more power-loss-driven, and, thus, rather determined byinternal heat generation than by conductive or convective heattransfer. This hypothesis, if it were not verified in literatureyet, is further corroborated by this temperature’s small timeconstant – best observed during cooling phases (refer toFig. 9).

The conductance analysis represents one of many benefits

9

a TNN exhibits. Without prior knowledge of the geometry ormaterial in the system, the TNN learns the physical thermalrelationship between measured nodes solely from data.

With the inferred conductance relevance, those under acertain threshold are pruned thereafter, and another randomsearch over topological settings is conducted and shown inFig. 10. The minimum amount of model parameters reducesfrom 60 to 50 due to conductance pruning, which is the casefor one hidden layer of one neuron before each of the outputlayers of γ and π. Note that this can be further reducedthrough feature selection, which is out of this work’s scope.Keeping the model size constant, sparsification also has abeneficial effect on accuracy, because more parameters canbe used to model fewer non-linear function behaviors.

D. Detuned Initial Conditions

Another advantage of TNNs is the possibility to set theinitial condition through the cell state, which is not straightforward for usual ML methods. Determining the initial con-dition is trivial in simulations and approximately equal to theambient temperature in real-world scenarios. However, if thisinitial guess is inaccurate, how long does it take to recover toa decent estimate?

Fig. 12 exemplifies different scenarios with varying detunedinitial estimates. It becomes evident that the worst-case esti-mate recovers to a ±10 °C error band (red-shaded area) inunder 30min for the stator temperatures, and in under 45minfor the rotor temperature. The higher the thermal capacitanceof a component, the slower the convergence is to be expected.This is acceptable for most applications considering the lowrisk of a flawed initial condition approximation. Similar ob-servations can be made for pure LPTN models.

Finally, it can be noted that a TNN, being quasi-linear inits state-space representation, lends itself as internal modelfor state observers (Kalman, Luenberger) as soon as partsof the target temperatures are (virtually) measured in a fieldapplication [38], [39]. Standard observer types for linearsystems would already suffice while the internal system modelis updated by the TNN. Given the external feedback, devia-tions during estimation and at initialization could be reducedsubstantially more rapidly.

IV. CONCLUSION

A hybrid thermal model for real-time temperature estima-tion, the so-called thermal neural network, is presented in thiswork, which exhibits the benefits of both worlds – consolidatedknowledge on heat transfer phenomena and nonlinear, data-driven function approximation with ML tools. Due to a TNN’souter LPTN structure, which lifts the black-box model toa quasi-LPV model, it is applicable everywhere an LPTNwould be seen appropriate, but with the additional adaptivityto parasitic real-world effects as observed in measurementdata. For the same reason, initial conditions can be set,and model states are physically interpretable, giving rise toconclusions over the thermal characteristics of the measuredsystem. Being a composite of three general-purpose functionapproximators, a TNN is end-to-end differentiable such that

Fig. 12. The impact of misconfigured initial conditions on estimationtrajectories is demonstrated. The ground truth trajectory is accompanied by ared-shaded ±10 °C error band.

modern automatic differentiation frameworks are employable.Neither expert knowledge nor motor sheet data is necessaryto design a TNN, but they can help streamlining its structure.Having these properties combined in one model is uniquethroughout the thermal modeling literature to the authors’ bestknowledge.

One disadvantage a TNN shares with common ML ap-proaches is the dependence on measurement data, althoughtransferring knowledge from one trained TNN model to an-other likewise manufactured system is indeed possible. More-over, data is more and more available even in power systemdomains nowadays.

Future investigations shall consider applying the UDE andSSNN principle to other domains of energy conversion wherelatent states are estimated in real-time, e.g., battery systems’state of charge, or induction motors’ differential inductancesand magnetic flux linkage. Applications to reinforcementlearning controllers - especially in power systems - are obviousand promising as well.

REFERENCES

[1] M. Iachello, V. De Luca, G. Petrone, N. Testa, L. Fortuna, G. Cam-marata, S. Graziani, and M. Frasca, “Lumped Parameter Modeling forThermal Characterization of High-Power Modules,” IEEE Transactionson Components, Packaging and Manufacturing Technology, vol. 4,no. 10, pp. 1613–1623, 2014.

[2] A. S. Bahman, K. Ma, P. Ghimire, F. Iannuzzo, and F. Blaabjerg, “A 3-D-Lumped Thermal Network Model for Long-Term Load Profiles Analysisin High-Power IGBT Modules,” IEEE Journal of Emerging and SelectedTopics in Power Electronics, vol. 4, no. 3, pp. 1050–1063, 2016.

10

[3] O. Wallscheid and J. Bocker, “Global Identification of a Low-OrderLumped-Parameter Thermal Network for Permanent Magnet Syn-chronous Motors,” IEEE Transactions on Energy Conversion, vol. 31,no. 1, pp. 354–365, 2016.

[4] N. Bracikowski, M. Hecquet, P. Brochet, and S. V. Shirinskii, “Mul-tiphysics Modeling of a Permanent Magnet Synchronous Machine byUsing Lumped Models,” IEEE Transactions on Industrial Electronics,vol. 59, no. 6, pp. 2426–2437, 2012.

[5] A. Boglietti, E. Carpaneto, M. Cossale, and S. Vaschetto, “Stator-Winding Thermal Models for Short-Time Thermal Transients: Definitionand Validation,” IEEE Transactions on Industrial Electronics, vol. 63,no. 5, pp. 2713–2721, 2016.

[6] A. P. Ramallo-Gonzalez, M. E. Eames, and D. A. Coley, “Lumped pa-rameter models for building thermal modelling: An analytic approach tosimplifying complex multi-layered constructions,” Energy and Buildings,vol. 60, pp. 174–184, 2013.

[7] A. Boglietti, A. Cavagnino, D. Staton, M. Shanel, M. Mueller, andC. Mejuto, “Evolution and Modern Approaches for Thermal Analysisof Electrical Machines,” IEEE Transactions on Industrial Electronics,vol. 56, no. 3, pp. 871–882, 2009.

[8] D. D. Reigosa, D. Fernandez, H. Yoshida, T. Kato, and F. Briz,“Permanent-Magnet Temperature Estimation in PMSMs Using PulsatingHigh-Frequency Current Injection,” IEEE Transactions on IndustryApplications, vol. 51, no. 4, pp. 3159–3168, 2015.

[9] D. Reigosa, D. Fernandez, M. Martınez, J. M. Guerrero, A. B. Diez,and F. Briz, “Magnet Temperature Estimation in Permanent MagnetSynchronous Machines Using the High Frequency Inductance,” IEEETransactions on Industry Applications, vol. 55, no. 3, pp. 2750–2757,2019.

[10] O. Wallscheid, A. Specht, and J. Bocker, “Observing the Permanent-Magnet Temperature of Synchronous Motors Based on Electrical Fun-damental Wave Model Quantities,” IEEE Transactions on IndustrialElectronics, vol. 64, no. 5, pp. 3921–3929, 2017.

[11] O. Wallscheid, T. Huber, W. Peters, and J. Bocker, “A Critical Review ofTechniques to Determine the Magnet Temperature of Permanent Mag-net Synchronous Motors Under Real-Time Conditions,” EPE Journal,vol. 26, pp. 1–10, 2016.

[12] E. Gedlu, O. Wallscheid, and J. Bocker, “Permanent Magnet Syn-chronous Machine Temperature Estimation using Low-Order Lumped-Parameter Thermal Network with Extended Iron Loss Model,” TechRxive-prints: 11671401, 2020.

[13] F. Qi, M. Schenk, and R. W. De Doncker, “Discussing details oflumped parameter thermal modeling in electrical machines,” in 7th IETInternational Conference on Power Electronics, Machines and Drives(PEMD 2014), 2014, pp. 1–6.

[14] D. Wockinger, G. Bramerdorfer, S. Drexler, S. Vaschetto, A. Cavagnino,A. Tenconi, W. Amrhein, and F. Jeske, “Measurement-Based Opti-mization of Thermal Networks for Temperature Monitoring of OuterRotor PM Machines,” in 2020 IEEE Energy Conversion Congress andExposition (ECCE), 2020, pp. 4261–4268.

[15] W. Kirchgassner, O. Wallscheid, and J. Bocker, “Estimating ElectricMotor Temperatures with Deep Residual Machine Learning,” IEEETransactions on Power Electronics, vol. 36, no. 7, pp. 7480–7488, 2020.

[16] ——, “Data-Driven Permanent Magnet Temperature Estimation in Syn-chronous Motors with Supervised Machine Learning: A Benchmark,”IEEE Transactions on Energy Conversion, 2021.

[17] J. Lee and J. Ha, “Temperature Estimation of PMSM Using a Difference-Estimating Feedforward Neural Network,” IEEE Access, vol. 8, pp.130 855–130 865, 2020.

[18] K. Zhang, A. Guliani, S. Ogrenci-Memik, G. Memik, K. Yoshii,R. Sankaran, and P. Beckman, “Machine Learning-Based TemperaturePrediction for Runtime Thermal Management Across System Compo-nents,” IEEE Transactions on Parallel and Distributed Systems, vol. 29,no. 2, pp. 405–419, 2018.

[19] T. L. Bergman, F. P. Incropera, D. P. DeWitt, and A. S. Lavine,Fundamentals of Heat and Mass Transfer, 6th ed. John Wiley & Sons,2007.

[20] A. Boglietti, A. Cavagnino, M. Parvis, and A. Vallan, “Evaluation ofRadiation Thermal Resistances in Industrial Motors,” IEEE Transactionson Industry Applications, vol. Vol. 42, N, pp. 688–693, 2006.

[21] D. A. Howey, P. R. N. Childs, and A. S. Holmes, “Air-Gap Convectionin Rotating Electrical Machines,” IEEE Transactions on IndustrialElectronics, vol. Vol. 59, N, pp. 1367–1375, 2012.

[22] A. G. Baydin, B. A. Pearlmutter, A. Radul, and J. M. Siskind, “Auto-matic differentiation in machine learning: A survey,” Journal of MachineLearning Research, vol. 18, pp. 1–43, 2018.

[23] S. H. Rudy, S. L. Brunton, J. L. Proctor, and J. N. Kutz, “Data-DrivenDiscovery of Partial Differential Equations,” Science Advances, vol. 3,no. 4, p. e1602614, 2017.

[24] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, “Neu-ral Ordinary Differential Equations,” in Advances in Neural InformationProcessing Systems, vol. 31. Curran Associates, Inc., 2018, pp. 6571–6583.

[25] M. Raissi, P. Perdikaris, and G. E. Karniadakis, “Physics-InformedNeural Networks: A Deep Learning Framework for Solving Forward andInverse Problems Involving Nonlinear Partial Differential Equations,”Journal of Computational Physics, vol. 378, pp. 686–707, 2019.

[26] C. Rackauckas, Y. Ma, J. Martensen, C. Warner, K. Zubov, R. Supekar,D. Skinner, and A. Ramadhan, “Universal Differential Equations forScientific Machine Learning,” ArXiv e-prints: 2001.04385[cs.LG], 2020.

[27] K. Hornik, M. Stinchcombe, and H. White, “Multilayer FeedforwardNetworks are Universal Approximators,” Neural Networks, vol. 2, no. 5,pp. 359–366, 1989.

[28] I. Rivals and L. Personnaz, “Black-Box Modeling With State-SpaceNeural Networks,” in Neural Adaptive Control Technology. WorldScientific, 1996, pp. 237–264.

[29] J. M. Zamarreno and P. Vega, “State Space Neural Network. Propertiesand Application,” Neural Networks, vol. 11, no. 6, pp. 1099–1112, 1998.

[30] H. Abbas and H. Werner, “Polytopic Quasi-LPV Models Based onNeural State-Space Models and Application to Air Charge Control of aSI Engine,” IFAC Proceedings Volumes, vol. 41, no. 2, pp. 6466–6471,2008.

[31] R. Pascanu, T. Mikolov, and Y. Bengio, “Understanding the ExplodingGradient Problem,” Proceedings of The 30th International Conferenceon Machine Learning, 2012.

[32] F. A. Gers and F. Cummins, “Learning to Forget: Continual Predictionwith LSTM,” in Ninth International Conference on Artificial NeuralNetworks, vol. 2, 1999, pp. 1–19.

[33] R. J. Williams and J. Peng, “An Efficient Gradient-Based Algorithm forOn-line Training of Recurrent Network Trajectories,” Neural computa-tion, vol. 2, no. 4, pp. 490–501, 1990.

[34] W. Kirchgassner, O. Wallscheid, and J. Bocker, “Empirical Evaluationof Exponentially Weighted Moving Averages for Simple Linear ThermalModeling of Permanent Magnet Synchronous Machines,” in Proceedingsof the 28th International Symposium on Industrial Electronics, 2019, pp.318–323.

[35] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard, and Others, “Tensorflow: A Systemfor Large-Scale Machine Learning,” in 12th USENIX symposium onoperating systems design and implementation (OSDI 16), 2016, pp. 265–283.

[36] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: ANext-generation Hyperparameter Optimization Framework,” in Proceed-ings of the 25rd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, 2019.

[37] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl, “Algorithms for Hyper-Parameter Optimization,” in 25th annual conference on neural infor-mation processing systems (NIPS 2011), vol. 24. Neural InformationProcessing Systems Foundation, 2011.

[38] O. Wallscheid and J. Bocker, “Fusion of Direct and Indirect TemperatureEstimation Techniques for Permanent Magnet Synchronous Motors,”in 2017 IEEE International Electric Machines and Drives Conference(IEMDC), 2017, pp. 1–8.

[39] D. Gaona, O. Wallscheid, and J. Bocker, “Improved Fusion of PermanentMagnet Temperature Estimation Techniques for Synchronous MotorsUsing a Kalman Filter,” IEEE Transactions on Industrial Electronics,vol. 67, no. 3, pp. 1708–1717, 2020.

Documents

Thermal Neural Networks: Lumped-Parameter Thermal Modeling