Improved Process Diagnosis Using Fault Contribution Plots

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

Users may download and print one copy of any publication from the public portal for the purpose of private study or research.

You may not further distribute the material or use it for any profit-making activity or commercial gain

You may freely distribute the URL identifying the publication in the public portal If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from orbit.dtu.dk on: Nov 30, 2021

Improved Process Diagnosis Using FaultContribution Plots from SparseAutoencoders

Hallgrimsson, Asgeir Daniel; Niemann, Hans Henrik; Lind, Morten

Published in:Proceedings of 21<sup>st</sup> IFAC World Congress

Publication date:2021

Document VersionPeer reviewed version

Link back to DTU Orbit

Citation (APA):Hallgrimsson, A. D., Niemann, H. H., & Lind, M. (2021). Improved Process Diagnosis Using Fault ContributionPlots from Sparse Autoencoders. In Proceedings of 21

st IFAC World Congress International Federation of

Automatic Control.

https://orbit.dtu.dk/en/publications/32042018-b82e-4d64-80f0-759da9db320e

Improved Process Diagnosis Using FaultContribution Plots from Sparse

Autoencoders

Asgeir Daniel Hallgrımsson ∗ Hans Henrik Niemann ∗

Morten Lind ∗

∗ Automation and Control Group, Dept. of Electrical Engineering,Technical University of Denmark, Kgs. Lyngby, 2800, Denmark,

(e-mail: {asdah, hhn, mli}@ elektro.dtu.dk).

Abstract: Development of model-based fault diagnosis methods is a challenge when industrialsystems are large and exhibit complex process behavior. Latent projection (LP), a statisticalmethod that extract features of data via dimensionality reduction, is an alternative approachto diagnosis as it can be formulated to not rely on process knowledge. However, LP methodsmay perform poorly at identifying abnormal process variables due a “fault smearing” effect -variables unaffected by a fault are unintentionally characterized as being abnormal. The effectoccurs because data compression permits faulty and non-faulty variables to interact. This paperpresents an autoencoder (AE), a nonlinear LP method based on neural networks, as a monitoringmethod of a simulated nonlinear triple tank process (TTP). Simulated process data was used totrain the AE to generate a monitoring statistic representing the condition of the TTP. Sparsitywas introduced in the AE to reduce variable interactivity. The AE’s ability to detect a faultwas demonstrated. The individual contributions of process variables to the AE’s monitoringstatistic were analyzed to reveal the process variables that were no longer consistent with normaloperating conditions. The key result in this study was that sparsity reduced fault smearing ontounaffected variables and increased the contributions of actual faulty variables.

Keywords: Fault detection and isolation, machine learning, grey box modelling, learning forcontrol, subspace methods.

1. INTRODUCTION

Effective online monitoring of process performance is inte-gral for maintaining stable plant operation, maximizingproduction, and ensuring the survivability of industrialsystems. In fact, abnormal events that disrupt plant per-formance can cause up to 8% annual loss in productionprofit (Bullemer et al. [2008]). Due to the increasingcomplexity of large-scale industrial processes, statisticalmethods - which can be formulated to not rely on processknowledge - are a practical alternative to more tradi-tional and rigorous model-based fault detection methods.The relevance of statistical monitoring schemes is furthersupported by the current trend of industries to generateindustrial big data thanks to the integration of additionalsensors, computers, and other technological artifacts con-nected to every industrial process (Yan et al. [2017]).

This approach to quality control is known as statisti-cal process control (SPC) (MacGregor et al. [1995]). Animportant component of SPC is diagnosis of a detectedabnormal event and determining its cause. Once an unin-tended plant upset is identified, it is typically up to the op-erators to decipher which statistical quality variables con-tain signal characteristics that help diagnose the problem.Unfortunately, industrial application of SPC-based eventdiagnosis is ineffective since the most common practice formonitoring the quality of a process is to observe traditional

univariate control charts such as Schewart, CUSUM, andEWMA (Brooks et al. [2014]). Their application inher-ently assumes that process variables are independent ofone another, potentially making their use ineffective atdiagnosing events that affect multiple process variables.

Multivariate quality control (MQC) methods - which pro-duce quality variables that summarize the condition ofseveral process variables - are a better alternative to uni-variate approaches for monitoring of multivariable pro-cesses (Peres et al. [2018]). Essentially, the Hotelling’sT 2 and Q statistics are paired with latent projection(LP) - dimensionality reduction methods such as principalcomponent analysis (PCA) that uncover the correlationstructure of data - to detect out-of-control situations. Aprocess is monitored by comparing current plant behaviorwith an LP model representing its “in-control” behavior.An abnormal event that changes the correlation betweenprocess variables is detected when the monitored deviationbetween the current process state and that predicted bythe model exceeds a threshold.

Industrial applications of LP-based process monitoringtend to use linear methods, such as PCA, due to their easeof implementation. Unsurprisingly, linear methods resultin high Type I and Type II error rates if the process isnonlinear (Hallgrımsson et al. [2019]; Yan et al. [2016]).Nonlinear extensions of PCA have emerged to uncover

both linear and nonlinear correlations between variables.The focus of this paper is on autoencoders (AEs); a typeof artificial neural network that learns salient, encodedrepresentations via nonlinear transformations of an orig-inal data set. Dong and McAvoy [1996] show that AEscan discover principal curves, i.e., a one-dimensional curvewhose shape provides a nonlinear summary of the nonlin-ear structure of the complex data set it passes through.Kramer [1991] demonstrates significant improvement innonlinear feature extraction by using a multi-layered AEas opposed to a single-layered AE, assuming that thedimension of latent layers were consistent.

Recent advances in AE-based process monitoring havebeen made by including developments from other appli-cations of neural networks. Yan et al. [2016] observedimproved fault detectability of the Tennessee Eastmanprocess over PCA-based process monitoring by using novelvariants of AEs; denoising AEs, which reconstruct the un-corrupted version of corrupted input data, and contractiveAEs, which penalize the sensitivity of hidden represen-tations to small (noisy) perturbations around the input.Lee et al. [2019] used a variational autoencoder (VAE) toenforce the monitored data to follow a multivariate normaldistribution in the latent space to facilitate appropriate useof Hotelling’s T 2 monitoring charts for nonlinear and non-normal processes, resulting in a reduction of Type I andType II error rates. Osmani et al. [2019] monitored thecondition of a turbo-compressor using a recurrent neuralnetwork (RNN) that captured temporal dependency ofprocess variables with the additional regularization con-straint that activations in the reduced space followed aBernoulli distribution. Cheng et al. [2019] combined VAEsand RNNs to produce a variational recurrent neural net-work for fault detection of the Tennessee Eastman process.

Contributions in the AE-based SPC literature tend toprioritize fault detection over fault isolation. Much of thesubject matter focuses on reducing Type I and Type IIerror detection rates by: (a) increasing model sensitivity tofaults; (b) obtaining more robust and complex monitoringstatistics; and (c) reducing hampering effects from nominalprocess changes. Though AEs have been used as a pre-training step for fault-classification networks when labeledfault data is scarce (Sun et al. [2016]), few methods existwhere fault isolation is performed exclusively with an AE.However, rudimentary diagnosis with PCA models can becarried out with the analysis of fault contribution plots(Joe Qin [2003]; Miller et al. [1998]). The plots indicate thecontributions of process variables to an observed increasein the T 2 or Q statistic, with variables showing largecontributions concluded as no longer following nominaloperating conditions. Operators can then apply processknowledge to determine an appropriate cause.

There are reports of fault contribution plots suffering froma property called “fault smearing” - variables unaffectedby the fault demonstrate a contribution and actual faultyvariables are obscured (Westerhuis et al. [2000]). Smearingoccurs because the compression of the input to a smallernumber of latent variables and subsequent expansion tothe original space permits faulty and non-faulty variablesto interact (Van den Kerkhof et al. [2013]).

Gao et al. [2016] imposed an elastic net constraint toobtain a sparse PCA model for the Tennessee Eastmanprocess. The result was a reduction in interactivity be-tween variables in the latent space. It subsequently leadto the discovery of process knowledge, specifically therelationships among process variables.

The objective of this paper is to extend the analysis of faultcontribution plots to AEs and investigate the effect re-duced latent variable interactivity has on process variablecontribution. Two AEs - a dense one and a sparse one - aregenerated to monitor a numerical simulation of a nonlineartriple tank process (TTP) - a variant of the quadrupletank process (QTP) (Johansson [2000]). Their ability todetect a fault is demonstrated by inducing an abnormalbias in one of the TTP’s sensors. Individual contributionsof process variables to the AEs’ monitoring statistics arethen analyzed. The key result in this study was thatsparsity reduced fault smearing onto non-faulty variablesand increased the contributions of faulty variables.

This paper presents the mathematical model of the TTPin section II. Section III describes how a sparse AE isobtained and subsequently used in process monitoring.The effectiveness of the sparse AE method at processmonitoring and improved generation of fault contributionplots is presented in section IV.

2. THE TRIPLE TANK PROCESS

A schematic drawing of the TTP is given in Fig. 1. Theupper tanks are supplied with liquid that is transportedfrom a large sump by the means of two gear pumps.Liquid flows from the upper left tank into the sump. Theliquid from the upper right tank flows into the lower tank,which sequentially flows into the sump. The objective isto control the liquid levels in the upper left and lowerright tanks, which are monitored with two voltage-basedlevel measurement devices. A level measurement device isalso fixed to the upper right tank. A nonlinear numericalmodel of the TTP is derived by applying mass balancesand Bernouilli’s law to yield a set of differential equationsthat describes the evolution of the liquid level of each tank.They are:

dh1

dt= − a1

A1

√2gh1 +

1

2

k1

A1v1(1 + η1)

dh2

dt= − a2

A2

√2gh2 +

1

2

k1

A2v1(1 + η1) +

k2

A2v2(1 + η2)

dh3

dt= − a3

A3

√2gh3 +

a2

A3

√2gh2

(1)

where Ai is the cross-section of tank i and ai is the cross-section of its outlet hole. The liquid level of tank i is hiand g is acceleration due to gravity. The voltage appliedto pump i is vi and the corresponding flow is kivi(1 + ηi),where ηi ∈ R is zero mean Gaussian noise emitted frompump i. The system is measured and actuated discretelywith a sample time of Ts. The measured level signals atsample k are:

y1[k] = kch1[k] + w1[k]

y2[k] = kch2[k] + w2[k]

y3[k] = kch3[k] + w3[k]

(2)

where wi[k] ∈ R is zero mean measurement noise withGaussian distribution for level signal i. For decentralizedcontrol, the error terms are:

e1[k] = r1[k]− y1[k]

e2[k] = r2[k]− y3[k](3)

where r1[k] and r2[k] are reference signals for level signalsy1[k] and y3[k], respectively. The error terms are mini-mized by a discrete PI controller. The closed loop controllaws for the process inputs are:

K1 : v1[k] = KP e1[k] +KI

k∑i=1

e1[i]Ts

K2 : v2[k] = KP e2[k] +KI

k∑i=1

e2[i]Ts

(4)

Here KP and KI denote the proportional and integralgains, respectively, of the PI controller. Monte Carlo sim-ulations were performed on the TTP to generate data setsthat exhibited nonlinear correlations between the processvariables. The data sets were used to train, validate, andtest an AE model that monitored the process. The un-certain parameters were the reference signals r1 and r2.Values for r1 and r2 were sampled from two independentuniform distributions. Process, controller, and noise pa-rameters were based on the QTP from Johansson [2000]and are listed in Table 1.

3. AUTOENCODERS

Process variables tend to be highly correlated with oneanother due to the presence of physical laws and con-trol loops in process plants. Feature extraction can beperformed on the original variable space to reveal thesimplified structure that underlies it. An AE - an artificialneural network used for learning encoded representationsfor a set of data - is applicable when variables exhibitnonlinear correlations. Given a m × 1 vector of processvariables x, the m× n reference data matrix consisting ofn standardized observations is:

𝒓𝟏

𝒆𝟏

𝒉𝟏 𝒉𝟐

𝒉𝟑

𝒚𝟑

𝒚𝟏

𝒗𝟏 𝒗𝟐

𝒓𝟐

𝒚𝟐

𝐾2𝐾1

- Signal flow- Liquid flow

𝒆𝟐

+ - +-

Fig. 1. A schematic of the TTP showing the connectivityof the tanks and location of the pumps, dual valves,and the level measurement devices. Included are thedecentralized feedback loops.

Table 1. List of Parameters

Process param. Noise param.

A1, A3 28 cm2 ηi N (0, 0.1)

A2 32 cm2 wi N (0, 0.0005)

a1, a3 0.071 cm2

a2 0.057 cm2

kc 1 V/cm Controller param.

k1 3.33 cm3/Vs Ts 10

k2 3.35 cm3/Vs KP 20

g 981 cm/s2 KI 0.25

X =

x[1] x[2] ··· x[n]

x1[1] x1[2] · · · x1[n]x2[1] x2[2] · · · x2[n]

......

. . ....

xm[1] xm[2] · · · xm[n]

∈ Rm×n (5)

An AE consists of two parts - an encoder and a decoder.The encoder transforms its input X into new, higher-level representative features Z ∈ Rq×n. The decoder thenreconstructs the original data as X ∈ Rm×n with atransformation of the features (Hinton [2006]). Modifiableinterconnecting weights are introduced in the AE suchthat it learns in an unsupervised manner to minimize thedifference between its input and its reconstruction.

The simplest form of an AE is a multilayered, feedforward,non-recurrent neural network. Nonlinear transformationsoccur at the layers of the network, allowing for processingof data that has inherent nonlinear properties. The en-coder maps the input X ∈ Rm×n to the latent variablesZ ∈ Rq×n:

Ei =

{σe1 (We

1X + be1) , for i = 1

σei (WeiEi−1 + bei ) , else

Z = σz (WzEN + bz)

(6)

where i ∈ Z : i ∈ [1, N ]. We1 is the weight matrix between

the input layer and the first encoder layer. Wei is the

weight matrix between layers i− 1 and i, be is the bias atlayer i, and σei is the component wise activation functionat layer i. Wz, bz, and σz are defined similarly for thelatent layer.

The decoder maps the latent variables Z ∈ Rq×n to theinput reconstruction X ∈ Rm×n:

Di =

{σd1(Wd

1Z + bd1), for j = 1

σdj(Wd

jDj−1 + bdj), else

X = σx(WxDM + bx

) (7)

where j ∈ Z : j ∈ [1,M ]. Wd1 is the weight matrix

between the latent layer and the first decoder layer. Wdj

is the weight matrix between layers j − 1 and j, bd

is the bias at layer j, and σdj is the component wise

activation function at layer j. Wx, bx, and σx are definedsimilarly for the output layer. The modifiable parametersWe

i ,bei ,W

z,bz,Wdj ,b

dj ,W

x, and bx are optimized withrespect to minimizing the following reconstruction lossfunction via stochastic gradient descent (Nielsen [2015]):

L(X, X) =1

n

∣∣∣∣∣∣X− X∣∣∣∣∣∣2 (8)

Fig. 2 illustrates a typical autoencoder that graduallycondenses the input to the latent space and then graduallyreconstructs it. The dimension q of the latent layer playsa significant role in the discovery of informative repre-sentations of the input. The traditional approach is tocreate a bottleneck by setting q < m, thereby forming anunder-complete representation. In this case, the networkpursues an effective compression that retains informationabout input X. The compressed data, being sufficientlyrepresentative of the original data, allows for accuratereconstruction of the input data, albeit with a non-zeroreconstruction error. An AE using linear nodal activationfunctions will uncover latent projection that correspondto the projection onto the subspace obtained from PCAof the covariance matrix of X (Baldi and Hornik [1989]).This occurs even if the network is composed of severallayers of linear units. However, Bourlard and Kamp [1988]show that PCA-like projections can be obtained even ifnonlinear functions are used since it is possible for activa-tions to remain in the linear regions of functions such asthe sigmoid or tangent hyperbolic. This becomes unlikelyif the AE is composed of several hidden layers with varyingactivation functions (Japkowicz et al. [2000]).

3.1 Invoking network sparsity

Further optimization constraints are introduced to obtainlatent representations that generalize better and preventover-fitting. One approach is to include the naıve elasticnet weight decay penalty - a regularized regression methodthat linearly combines the L1 and L2 weight decay penal-ties of the LASSO and ridge methods (Zou and Hastie[2005]). The loss function in (8) becomes:

L(X, X,W) =1

n

∣∣∣∣∣∣X− X∣∣∣∣∣∣2 + λ1 ||W||1 + λ2 ||W||22 (9)

where λ1 and λ2 control the importance of the LASSO andridge regressions, respectively, and W is the collection ofweight matrices in (6) and (7). Biases b in (6) and (7) arenot included in the naıve elastic net penalty. Minimizationof (9) yields an optimized AE consisting of shrunk weightsthat minimize its reconstruction loss. The individual con-tribution of each regularization term is: (a) L1 regular-ization shrinks weights at a constant rate towards zero,thereby establishing a small number of high-magnitude,i.e., high-importance, connections by driving redundant

𝑬𝑬1

𝑬𝑬𝑁𝑁 𝒁𝒁

𝑿𝑿

𝑫𝑫𝑀𝑀

𝑫𝑫1

𝑿𝑿�

Encoder ℰ Decoder 𝒟𝒟

Encoded representation

Fig. 2. Illustration of an under-complete AE. Labels forthe encoder and decoder of the network are included.Biases are excluded from the illustration.

weights to zero; and (b) L2 regularization shrinks weightsby an amount proportional to their magnitude, thus penal-izing larger weights more than smaller weights. The net re-sult is an interpretable grouping of correlated variables; L2

regularization opposes the tendency of L1 regularizationto prioritize one variable from a group correlated variablesand ignore the others. Grouping of process variables is rel-evant for identification of control systems; Gao et al. [2016]demonstrate that a sparse principal component model canuncover the underlying process variable relations.

Weight connections deemed redundant can be removedto clarify the interconnectivity of a neural network.Magnitude-based weight pruning is a technique that re-duces the number of non-zero weight parameters to invokenetwork sparsity. Zhu and Gupta [2017] introduce a prun-ing algorithm that progressively trims away redundantweight connections. Weight connections are removed ac-cording to a pruning function that sets the current sparsitypercentage, i.e., the ratio of the number zero magnitudeweights to the total number of weights, of a network:

st = sf + (si − sf )

(1− t− t0

n∆t

)3

for t ∈ {t0, t0 + ∆t, . . . , t0 + n∆t} (10)

The network is first trained for t0 time steps to permit theweights to converge to an acceptable solution. Thereafterthe initial sparsity of the network is set to si (usuallyzero). Weights are then pruned every ∆t steps to graduallyincrease the network’s sparsity while allowing it to recoverfrom any pruning-induced loss in accuracy. The intuitionbehind the order of (10) is to rapidly prune the networkin the beginning phase when redundant connections areplentiful before slowing down once fewer connections re-main (Fig. 3). The algorithm operates continuously over nsparsity updates until the final sparsity value sf is reached.Zhu and Gupta [2017] discovered that large-sparse modelsconsistently outperformed small-dense models when thenumber of parameters was kept the same.

The pruning algorithm presented by Zhu and Gupta [2017]is extended upon in this paper. At every sparsity updatest, each weight matrix Wi ∈W is divided by the largestabsolute value of Wi. This normalization step is done toprevent severe pruning of weight matrices whose largestabsolute value is much smaller compared to the othermatrices. The normalized matrices are then flattened andconcatenated. The smallest weights are then masked tozero until the desired sparsity level st is reached. Further-more, the pruning algorithm is stopped prematurely if thevalidation loss experiences a 5%-10% increase.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

0.2

0.4

0.6

0.8

1

Fig. 3. Example sparsity function used for gradual pruningwith sf = 0.8, si = 0.0, t0 = 3000, ∆t = 100, n = 20.

Once pruning ends, the naive elastic net weight penaltyis removed from the training session. This is to relax theconstraints on the AE and permit the remaining weightsto maximize their capacity to reduce the loss function in(8) without the concern of any additional loss penalties.

3.2 Process Monitoring

Process monitoring consists of comparing current plantbehaviour with that predicted by an “in-control” AEtrained with historical data collected when the processexhibited nominal behaviour. New observations are prop-agated through the AE to generate the residuals enew =xnew − xnew. The quality of new observations is assessedby computing the squared prediction error (SPE) (moreformally known as the Q statistic) of the residuals of newobservations (MacGregor et al. [1995]):

SPE =

m∑i=1

(xnew,i − xnew,i)2(11)

An abnormal event that changes the correlation betweenprocess variables will cause the SPE to increase. Assumingthat the SPE follows a chi-squared distribution, the controllimit can be computed with the following approximatevalue (Box [1954]):

CLSPEAE=σ2

2µχ2

(2µ2/σ2,α) (12)

where µ and σ, respectively, are the sample mean andsample standard deviation of the SPE and α is the falsealarm rate. An abnormal event is deemed to have occurredif the SPE crosses the control limit. Abnormal processvariables are isolated by analysing the contribution of eachvariable i to the SPE in (11) (Miller et al. [1998]):

Ci = (xnew,i − xnew,i)2(13)

Variables with large contributions are said to no longer beconsistent with normal operating conditions. It is notedthat analysis of (13) does not determine the underlyingcause of a fault. Rather, it will highlight the processvariables containing signal characteristics of a fault. Theresults of (13) must be integrated with a qualitative modelof the process that takes into account the causal nature ofsystem components to decipher the actual cause.

4. RESULTS AND DISCUSSION

4.1 Derivation of influence rules

It was of practical interest to determine the influence ofreference variables r1 and r2 on the control and measure-ment variables; steady-state correlations uncovered by theAE can then be validated to what is implied by the data.Fig. 4 displays the time series plots obtained from inducingrandom step changes in a single reference variable whilekeeping the other constant. The plots demonstrate that:(a) a step change in r1 causes a transient change in thesteady state values of y1, v1, and v2, while variables y2

and y3 experience a transient change that has no affect ontheir steady-state values; and (b) a step change in r2 hasno influence on y1 and v1 yet generates a transient changein the steady state values of y2, y3 and v2. The correlationsets C1 = (r1, y1, v1, v2) and C2 = (r2, y2, y3, v2) are

determined from the plots. They indicate which processvariables observe a permanent change in their steady statevalue caused by a change in a reference signal.

4.2 Data generation from TTP simulation

The TTP was simulated with random step changes inreference signals r1 and r2 occurring every 200 time steps.The training set Xt (consisting of 300,000 samples) andvalidation set Xv (consisting of 30,000 samples) were gen-erated to train and validate, respectively, an AE. Fig. 5displays the distribution of standardized samples of vari-ables in Xt in the form of scatter and histogram plots. Thescatter plots indicate the existence of nonlinear correla-tions between variable pairs (r1, v1), (v1, y1), and (v2, y1).The histogram plots reveal that several variables do notfollow the assumption of normality with v1 in particular.

The fault set Xf (consisting of 300 samples) was generatedby simulating the TTP with a bias in sensor 1, introducedwith the additive fault y1[k] = kch1[k] + w1[k] + f withf = −0.01. The fault was introduced after 100 time steps.No reference changes occurred in r1 and r2. Fig. 6 presentstime series plots of the first 200 samples of Xf and showsthe fault’s effect on the process variables. Deterministicresults (in grey) from the same simulation case (obtainedby setting ηi and wi in (1), (2) to zero) are included to aidin interoperability. The plots demonstrate that: (a) thefault has no influence on r1 and r2; (b) the fault inducestemporary changes in y1, y2, and y3 that have no influenceon their steady state values; and (c) the fault induces apermanent change in v1 and v2 and thus carry steady-statesignatures that explain the presence of the fault.

4.3 AE model generation and testing

Two AEs, denoted AE1 and AE2, were trained with thetraining set Xt. Both networks were inherently the same,

(a)

(b)

Fig. 4. Time series of simulated process variables where(a) r1 is changed whilst r2 is held fixed and (b) r2 ischanged whilst r1 is held fixed. Red lines indicate thereferences for the measurement.

i.e., same number and dimension of layers, same numberof latent variables, same initialization of the weights, andso on, except AE2 included the naive elastic net weightpenalty with λ1 = λ2 = 0.001 and was pruned. Thepruning parameters in (10) of AE2 were sf = 0.9, si = 0.7,t0 = 5000, ∆t = 100, n = 200, but early-stopping resultedwith a sparsity of 80.86%. Both AEs were trained for 12000epochs using the Adam gradient-based optimization witha learning rate of 0.001 for stochastic gradient descent(Kingma and Ba [2014]). The matrices Xt, Xv, and Xf

were standardized with the mean and standard deviationof Xt. The dimension of the latent layer in each modelwas set to q = 2. This was to see if the sparse AE2 wouldexpose the correlation sets C1 and C2. The dimensions andactivation function of each layer were specified as:

[dimL(E) dimL(D)σei σdi

]=

[X E1 E2 D1 D2 X

7 9 9 9 9 7tanh tanh tanh tanh I

](14)

where tanh is the tangent hyperbolic function and I is theidentity function. The tangent hyperbolic transfer func-tion was primarily used since the data is mean-centered.The design of the AE is essentially an expanded under-represented AE; setting the dimension of the encoder anddecoder layers larger than the size of the input dimensionallowed the AE models to generate complex, higher dimen-sional features before information retaining compressionoccurred (Olah [2014]). The tangent hyperbolic functionwas implemented at the latent layer.

The training loss (TL) and validation loss (VL) fromtraining AE1 and AE2 are plotted in Fig. 7. It can be seenthat the TL and VL of AE2 observe a significant differencethat recedes when pruning ends. This is because the TLincludes the naive elastic net weight penalty in (9) that isthen removed once pruning stops. The figure shows that

Fig. 5. Scatter plot of standardized process variables,including a histogram along the diagonal.

Fig. 6. Influence of fault f1 at sample tf on (left) measure-ments and (right) control inputs. Red lines indicatethe references for the measurements.

the VL of AE2 is similar to the VL of AE1 at the end oftraining. In fact, the VL of AE2 is only 7.2% larger despiteAE2 having 80.86% fewer weight parameters than AE1.

Fig. 8 portrays the connectivity between network layers ofAE2 and shows the propagation of original variables x tothe reconstructions x. It can be seen that the network hasidentified the correlations between process variables, thuseliminating potential fault smearing between uncorrelatedvariables. The activation of the second node in the latentlayer is computed by the process variables of correlationset C1. In addition, the activation is solely responsible forthe reconstruction of the same variables. The activationof the first latent node is determined by the variables ofcorrelation set C2 with the exception of v2. However, thelatent node’s activation reconstructs all of the variablesof C2. From this it follows that fault signatures containedin v2 cannot not smear onto r2, y2, and y3. Although itprovides a partial explanation for the loss in validationaccuracy in comparison to AE1 (Fig. 7), AE2 has dis-covered a form of reconstruction redundancy: although v2

appears both in C1 and C2, it is sufficient to reconstruct itfrom a partial subset of process variables. It is noted thatthe interconnectivity of AE2 is heavily influenced by thechosen hyperparameters for the learning rate, regressioncoefficients λ1 and λ2, and pruning parameters; a differentselection is bound to result with a different connectivity.

The contribution plots obtained from propagating Xf

through AE1 and AE2 are displayed in Fig. 9. Plotsfrom the deterministic equivalent of Xf are included toease the analysis of the effect of network pruning onmean contributions. The fault is detected by both AEsas their SPEs cross their control limit at sample tf , i.e.,the onset of the fault. Despite the model complexity ofAE1 being greater than that of AE2, their SPEs are nearlyidentical over the fault set. This indicates that a more

0 5000 9701 1200010-2

10-1

100

8000 9000 10000 11000 1200010-2

10-1

Fig. 7. Training and validation losses during training.Right figure zooms in on epoch interval [8000,12000]and includes final losses in its legend.

𝒁

𝑫2𝑫1

𝑿

Encoder ℰ Decoder 𝒟

Encodedrepresentation

𝑬1 𝑬2

𝑿

𝑟1𝑟2𝑣1𝑣2𝑦1𝑦2𝑦3

Ƹ𝑟1Ƹ𝑟2ො𝑣1ො𝑣2ො𝑦1ො𝑦2ො𝑦3

Input-reconstruction weight connections: ,

Fig. 8. Illustration of trained AE2, showing pruned weight connections (grey) and remaining connections (black). Biaseshave been excluded from the illustration.

complex model is not necessarily more sensitive to faults.Smearing onto unaffected variables r2, y2, and y3 is lessfor AE2, indicated by a reduction in the variance (Fig 9a)and mean (Fig 9b) of their contributions. In fact, theirmean contributions are zero once steady-state is reachedbecause the steady state fault signatures retained in v1

and v2 cannot propagate to r2, y2, and y3 (Fig. 8). Eventhough smearing occurs onto non fault-carrying variablesr1 and y1, invoking network sparsity guarantees that thesteady state signal characteristics of faulty variables v1

and v2 stay within the variables of correlation set C1. Infact, network AE2 generates larger contributions for fault-carrying variables v1 and v2 and reduces the contributions

for non-fault-carrying variables r1 and y1, indicating thatnetwork sparsity makes faulty variables more highlighted.

It is reiterated that analysis of fault contribution plots doesnot determine the cause of a fault. Instead, process vari-ables containing steady-state fault signatures are inferred.An additional “causal reasoning” step must be performedthat takes into consideration the causal nature of themonitored process, e.g, qualitative modeling of relationsbetween different components of a system, to determinethe root cause of fault-contaminated process variables.The presented method makes qualitative diagnosis moreeffective, since the reduction of fault smearing ensures thatmore precise qualitative information is provided.

(a) (b)

Fig. 9. Magnitudes of contributions to the SPE from AE1 (grey) and AE2 (black) via (a) stochastic simulations for Xf

and (b) deterministic simulations for Xf . Dashed line in SPE plot indicates the control limit obtained from Xv.

5. CONCLUSION

This study introduces the combined application of a spar-sity constraint and a pruning strategy to produce a sparseAE with the purpose of diagnosing a sensor fault occurringin the TTP. The obtained AE lead to the discovery ofprocess knowledge, specifically the relationships amongprocess variables. The solution demonstrated that a sparseAE, which inherently has fewer parameters than a fullyconnected AE, suffered little in its validation performance.

The results show that the proposed method improved theperformance of fault contribution plots; process variablesunaffected by the fault produced significant less contribu-tions due a reduction of fault smearing. The results alsodemonstrated that variables carrying no fault signatures,but were strongly correlated with the faulty variables,observed reduced contributions. Finally, variables thatcontained fault signatures produced larger contributions,providing further fault isolation capabilities.

ACKNOWLEDGEMENTS

The authors would like to acknowledge the support ofthe Danish Hydrocarbon Research and Technology Center(DHRTC) at the Technical University of Denmark.

REFERENCES

P. Baldi and K. Hornik. Neural networks and principalcomponent analysis: learning from examples withoutlocal minima. Neural Networks, 2(1):53-58,1989.

H. Bourlard and Y. Kamp. Auto-association by multilayerperceptrons and singular value decomposition. Biologi-cal Cybernetics, 59(4-5): 291-294, 1988.

G.E. Box. Some theorems on quadratic forms applied inthe study of analysis of variance problems, I. Effect ofinequality of variance in the one-way classification. TheAnnals of Mathematical Statistics, 25(2):290-302, 1954.

E.B. Brooks et al. On-the-fly massively multitemporalchange detection using statistical quality control chartsand Landsat data. IEEE Transactions on Geoscienceand Remote Sensing, 52(6):3316-3332, 2014.

P. Bullemer et al. ASM consortium guidelines-effective op-erator display design. Houston, Honeywell InternationalInc./ASM Consortium, 2008.

F. Cheng, Q.P. He, and J. Zhao. A novel process moni-toring approach based on variational recurrent autoen-coder. Computers & Chemical Engineering, 129:106515,2019.

D. Dong and T.J. McAvoy. Nonlinear principal componentanalysis-based on principal curves and neural networks.Computers & Chemical Engineering 20(1):65-78, 1996.

H. Gao et al. Process knowledge discovery using sparseprincipal component analysis. Industrial & EngineeringChemistry Research, 55(46):12046-12059, 2016.

A.D. Hallgrımsson, H.H. Niemann, and M.Lind. Autoen-coder based residual generation for fault detection ofquadruple tank system. IEEE Conference on ControlTechnology and Applications (CCTA), p:994-999, 2019.

G.E. Hinton. Reducing the dimensionality of data withneural networks. Science, 313(5786):504-507, 2006.

N. Japkowicz, S.J. Hanson, and M.A. Gluck. Nonlinearautoassociation is not equivalent to PCA. Neural Com-putation, 12(3): 531-545, 2000.

S. Joe Qin. Statistical process monitoring: basics andbeyond. Journal of Chemometrics: A Journal of theChemometrics Society 17(8-9):480-502. 2003.

K.H. Johansson. The quadruple tank process: A multivari-able laboratory process with an adjustable zero. IEEETransactions on Control System Technology, 8(3):456-465, 2000.

D.P. Kingma and J. Ba Adam: A method for stochasticoptimization. arXiv preprint, arXiv:1412.6980, 2014.

M.A. Kramer. Nonlinear principal component analysisusing autoassociative neural networks. AIChE journal,37(2): 233-243, 1991.

S. Lee et al. Process monitoring using variational autoen-coder for high-dimensional nonlinear processes. Engi-neering Applications of Artificial Intelligence 83:13-27,2019.

J.F. MacGregor and T. Kourti. Statistical process controlof multivariate processes. Control Engineering Practice,3(3):403-414, 1995.

P. Miller, R.E. Swanson, and C.E. Heckler. Contributionplots: a missing link in multivariate quality control.Applied Mathematics and Computer Science, 8(4):775-792, 1998.

M.A. Nielsen. Neural networks and deep learning. Vol.2018. USA: Determination Press, 2015.

C. Olah. Neural networks, manifolds, and topol-ogy. url: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/, 2014.

A. Osmani, M. Hamidi, and S. Bouhouche. Monitoring ofa dynamic system based on autoencoders. Proceedingsof the 28th International Joint Conference on ArtificialIntelligence, AAAI Press:1836-1843, 2019.

F.A.P. Peres and F.S. Fogliatto. Variable selection meth-ods in multivariate statistical process control: a system-atic literature review. Computers & Industrial Engi-neering, 115:603-619, 2018.

W. Sun et al. A sparse auto-encoder-based deep neuralnetwork approach for induction motor faults classifica-tion. Measurement 89: 171-178, 2016.

P. Van den Kerkhof et al. Contribution plots for statisticalprocess control: Analysis of the smearing-out effect. In2013 European Control Conference (ECC). IEEE:428-433, 2013.

J.A. Westerhuis, S.P. Gurden, and A.K. Smilde. General-ized contribution plots in multivariate statistical processmonitoring. Chemometrics and Intelligent LaboratorySystems, 51(1):95-114, 2000.

J. Yan et al. Industrial big data in an industry 4.0environment: Challenges, schemes, and applications forpredictive maintenance. IEEE Access 5, 23484-23491,2017.

W. Yan, P. Guo, and Z. Li. Nonlinear and robust sta-tistical process monitoring based on variant autoen-coders. Chemometrics and Intelligent Laboratory Sys-tems 158:31-40, 2016.

M. Zhu and S. Gupta. To prune, or not to prune: exploringthe efficacy of pruning for model compression. arXivpreprint arXiv:1710.01878, 2017.

H. Zou and T. Hastie. Regularization and variable selec-tion via the elastic net. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology), 67(2):301-320, 2005.

Documents

Improved Process Diagnosis Using Fault Contribution Plots