View
1
Download
0
Category
Preview:
Citation preview
“ It is not what you want, its what you need that matters”
Anonymous
Abstract
Tropical cyclone wind-intensity and path prediction are challenging tasks consid-
ering drastic changes of the climate patterns over the last few decades. Cyclones
cause extensive damage to everything in its path, however, the destruction caused
by this natural calamity could be reduced immensely with accurate and timely
forecasts of cyclone track and intensity. The unpredictable nature of cyclones are
difficult for statistical predicting models to learn and make efficient and timely
predictions. Cyclones have been studied extensively and statistical models have
been used to make predictions. Time series prediction relies on past data points
to make robust predictions. Recurrent neural networks have been suitable for
time series prediction due to their architectural properties in modeling temporal
sequences. Coevolutionary recurrent neural networks have recently given very
promising performance for time series prediction. The study applies the afore-
mentioned methods to tropical cyclone wind intensity and path prediction. The
study begins with the prediction of the wind intensity of cyclones that took place
in the South Pacific Ocean over past few decades. Timespan was defined as the
amount data points necessary to start prediction using neural architecture. To im-
prove the prediction performance of the models, an empirical study on minimal
timespan is required. The cyclone track prediction is a two dimensional (mul-
tivariate) time series prediction problem that involves latitudes and longitudes
which define the position of a cyclone. An architecture for encoding the two di-
mensional time series problem into Elman recurrent neural networks composed
of a single input neuron is proposed in the thesis for cyclone path prediction.
Transfer learning incorporates knowledge from related source dataset to comple-
ment a target dataset. The additional knowledge aids in learning especially, in
cases, where there is lack of target data. Stacking is a form of ensemble learning,
focused on improving generalization performance. It has been used for transfer
learning problems which are referred to as transfer stacking. The final contri-
bution of the thesis involves transfer stacking as a means of studying the effects
of cyclones whereby the contribution of cyclone data from different geographic
locations towards improving generalization performance was evaluated.
i
Acknowledgements
Foremost, I thank God for knowledge, strength and protection.
I would like to express my sincere gratitude to my advisor Dr. Rohitash Chandra
who was my supervisor initially and later became my external supervisor. He
gave me motivation and confidence that helped me in completing this thesis.
My sincere thanks to Dr.Anurag Anand Sharma who guided me in the means of
completing this thesis towards the final stage of the masters program.
My parents, Bram Deo and Antala Devi, my brothers, Ravnil Deo and Rajneel
Deo, my sister-in-laws Karishma Devi and Marshlin Lata, have constantly been
there for me, motivating, advising and encouraging me throughout my studies.
Finally, all my friends and family members who have not been mentioned, deserve
my wholehearted acknowledgement.
ii
Contents
Abstract i
Acknowledgements ii
List of Figures v
List of Tables vi
1 Introduction 1
1.1 Premises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background and Literature Review 5
2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Feedforward Neural Networks . . . . . . . . . . . . . . . . 6
2.1.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . 7
2.1.3 Elman Neural Networks . . . . . . . . . . . . . . . . . . . 7
2.2 Training in Neural networks . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Backpropagation Based Learning . . . . . . . . . . . . . . 9
2.2.2 Backpropagation-Through-Time Based Learning . . . . . . 10
2.2.3 Coevolutionary Neuro-Evolution based Learning . . . . . . 11
2.3 Evolutionary Computation . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Generalized Generation Gap Parent Centric Crossover . . 16
2.3.3 Cooperative Coevolution . . . . . . . . . . . . . . . . . . 17
2.3.4 Problem Decompositions . . . . . . . . . . . . . . . . . . . 19
2.3.5 Cooperative Fitness Evaluations . . . . . . . . . . . . . . . 20
2.4 Ensemble learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Time Series Prediction . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.1 Weather Prediction . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Tropical Cyclones . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7.1 Conventional Cyclone Forecasting . . . . . . . . . . . . . 24
2.7.2 Neural Network for Cyclones . . . . . . . . . . . . . . . . 25
2.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Identification of Minimal Timespan Problem for Cyclone Wind-Intensity Prediction 28
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
iii
3.2 Problem Definition and Methodology . . . . . . . . . . . . . . . . 29
3.2.1 Problem Definition: Minimal Timespan Prediction Problem 29
3.2.2 Methodology: Recurrent Networks for Prediction . . . . . 31
3.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 Data Preprocessing and Reconstruction . . . . . . . . . . 33
3.3.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . 33
3.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Cyclone Track Prediction Using Coevolutionary Recurrent Neu-ral Networks 40
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 RNN Architecture for Cyclone Tracks . . . . . . . . . . . . . . . 40
4.3 Simulation and Analysis . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.1 Data Preprocessing and Reconstruction . . . . . . . . . . 41
4.3.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . 44
4.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5 Stacked Transfer Learning for Tropical Cyclone Intensity Pre-diction 50
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.1 Neural networks for time series prediction . . . . . . . . . 50
5.2.2 Stacked transfer learning . . . . . . . . . . . . . . . . . . . 51
5.2.3 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . 53
5.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6 Conclusions and Future Work 58
6.1 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . 59
Appendix 60
Bibliography 61
iv
List of Figures
2.1 Feed forward neural network [1] . . . . . . . . . . . . . . . . . . . 6
2.2 Elman style RNN [1] . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Neuron level problem decomposition for recurrent neural networks 12
2.4 Transfer learning for one source and one target data . . . . . . . . 22
3.1 Cyclone wind intensity time series showing relation between train-ing and testing timespan. Training data uses embedding dimension5 while testing dataset uses embedding dimension 3 . . . . . . . . 29
3.2 Elman Recurrent neural network trained with timespan W andtested with X,Y,Z for identifying the minimal timespan. . . . . . 31
3.3 Elman recurrent neural network used for tropical cyclone wind in-tensity prediction. Time series data is preprocessed and embeddedusing Taken’s theorem and fed into the Elman RNN. . . . . . . . 32
3.4 Unfolded view of RNN. . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Performance of CNE and BPTT in wind intensity prediction inthe testing dataset (2006 -2013) for tropical cyclones in the SouthPacific. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Performance of CNE for a single experimental run . . . . . . . . 38
4.1 Elman RNN used for prediction of cyclone latitude and longitude.Two input and output neurons are used for mapping the longitudeand latitude [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Proposed RNN architecture: A single input and output neuron El-man recurrent neural network (SIORNN) used for predicting lati-tude and longitude of the cyclone path. . . . . . . . . . . . . . . 42
4.3 Embedded data reconstructed using Taken’s theorem. Embeddingdimension (D) of 4 is depicted. Both (a) and (b) have the twodimensions longitude and latitude. . . . . . . . . . . . . . . . . . 43
4.4 Tropical cyclones track data in the South Pacific from 1985 to 2013.(Generated using Gnuplot) . . . . . . . . . . . . . . . . . . . . . 44
4.5 Performance of SIORNN using cooperative coevolution for 6 ran-dom cyclones from the year 2006 to 2013. . . . . . . . . . . . . . 45
4.6 Typical prediction performance of a single experiment (one-stepahead prediction) given by BPTT-SIORNN for Cyclone tracktest dataset (2006 - 2013 tropical cyclones) where Time is takenat six hour intervals. . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1 Neural Network Ensemble Model . . . . . . . . . . . . . . . . . . 51
v
List of Tables
3.1 Best Performance of cooperative coevolution . . . . . . . . . . . . 36
4.1 Generalization performance of training models on cyclone trackprediction for Configuration A . . . . . . . . . . . . . . . . . . . . 45
4.2 Generalization performance of training models on cyclone trackprediction for Configuration B . . . . . . . . . . . . . . . . . . . . 47
5.1 Generalization performance . . . . . . . . . . . . . . . . . . . . . 54
5.2 Experiment 3: Performance of FNN on different categories of cyclones 54
5.3 Experiment 5: Performance of Vanilla FNN on independent decadetraining data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
vi
To my parents, Bram Deo and Antala Devi.
Chapter 1
Introduction
1.1 Premises
Neural networks are nature-inspired computational methods that try to model
biological neural systems [3]. Neural networks are characterized into feedfor-
ward and recurrent architectures. Networks provide a mapping from input to
output from one state to another. In contrast to feedforward networks, recurrent
neural networks are dynamical systems whose next state and output(s) depend
on the present network state and input(s). Recurrent neural networks (RNNs)
due to their architecture, are well-suited for modeling temporal sequences [1].
Gradient descent has been customarily used for training neural networks. Feed-
forward networks use backpropagation whilst a slight variation of the algorithm,
backpropagation-through-time [4] is used for recurrent networks. Neuro-evolution
has also been used to train neural networks [5]. Neural networks methods have
shown to be robust methods for time series problems [6]. Amongst popular com-
putational intelligence methods, evolutionary neural networks have shown good
potential for time series prediction [5, 7].
Transfer learning utilizes knowledge learned previously from related problems
into learning models in order to have faster training or better generalization per-
formance [8]. Transfer learning incorporates knowledge from a related problem
(also known as source) to compliment a target problem especially in cases where
there is lack of data or in cases where there is requirement to speed up the learn-
ing process. The approach has seen widespread applications with challenges on
the type of knowledge should be transferred in order to avoid negative transfers
whereby the knowledge deteriorates the performance [9–11]. Transfer Learning
has recently been used for visual tracking and computer-aided detection [12, 13].
Transfer learning has been implemented with ensemble learning methods such
as boosting and stacking [9]. The approach in the case of transfer stacking is
implementing multiple learners previously trained on the source data in order to
have a single base learner. Simple stacking on the other hand uses multiple base
learners[9].
1
Tropical cyclones have aroused much attention due to their destructive nature
[14]. Statistical models have been previously used to forecast the movement and
intensity of the cyclones. There has been a growing interest in computational
intelligence techniques for cyclone prediction systems [15, 16]. Cyclone path and
wind intensity prediction is a multidimensional time series problem.
The forecast of the intensity and track of tropical cyclones are considered to be
extremely important for avoiding casualties and mitigating damage to properties
[17, 18]. A comprehensive review of tropical cyclone track forecasting techniques
can be found in [19]. Computational intelligence methods have established them-
selves as complementary approaches for tropical cyclone tracking and prediction
[20–22]. Amongst these methods, neural networks trained using an evolutionary
learning paradigm have shown great promise [2, 5, 7, 23].
In the past, cyclone wind-intensity [24] and track prediction [2] have been tackled
by cooperative neuro-evolution of recurrent neural networks. Track prediction
was tackled as a two-dimensional time series problem where the latitude and lon-
gitude of the cyclone tracks were involved [2]. The results have been promising for
cyclones in the South Pacific region however, there is room for further improve-
ments in the accuracy of the predictions. Cyclone intensity and track prediction
need to be made as soon as possible when a cyclone is identified. It is impor-
tant to identify if a prediction model can work with the shortest duration after
which time series prediction can begin, that is, if the cyclone data is recorded
every 6 hours, an important issue is the minimal timespan required to make a
prediction. The time span is a windowed snapshot of taken at regular intervals
the observation period for a time series data [25]. The minimal timespan is an
important factor when it comes to predicting the nature of cyclones in terms of
track and wind intensity. Robust predictions can be vital in reducing the impact
of the calamity of cyclones through efficient planning and management.
1.1.1 Motivations
Elman recurrent neural networks and feed forward neural networks have been
effectively used for time series predictions. More recently, the Elman recurrent
neural network have shown good performance in predicting tropical cyclone wind
intensity and tracks for South Pacific tropical cyclones. Due to the catastrophic
2
nature of cyclones, predictions on these cyclones are very time sensitive. Thus,
the minimum timespan needed to start making accurate predictions on cyclone,
needs to be identified.
Cyclone track data is modeled as a two dimensional time series problem due to
the presence of two data streams; latitude and longitude. However, the track of
cyclone is essentially a single commodity which needs to be modeled as a single
dimension. Therefore, an encoding scheme is proposed which would allow such
modeling of two-dimensional into a single data stream.
In order to develop robust prediction models, one needs to consider different char-
acteristics of cyclones in terms of spatial and temporal characteristics. Transfer
learning can be used as a strategy to evaluate the relationship of cyclones from
different geographic regions. Note that a negative transfer is considered when the
source knowledge utilized with the target data, contributes to poor generaliza-
tion. This can be helpful in evaluating if the cyclones from a particular region
is helpful in guiding as source knowledge for decision making by models in other
geographic locations. Moreover, it is also important to evaluate the effect of the
duration of a cyclone on the generalization performance given by the model.
1.2 Research Goals
The main research goal of this thesis is to use neural network methodologies to
predict cyclone path and wind intensity.
This thesis is based on the following research objectives:
1. Identify the minimal timespan required to start making accurate and mean-
ingful predictions with applications to tropical cyclone wind intensity fore-
casting.
2. Novel architecture for encoding two-dimensional time series for path pre-
diction problem in Elman recurrent neural network.
3. Evaluate the performance of the standard neural network when trained with
different regions of cyclone data.
4. Evaluate the effects of duration of the cyclones and their contribution to-
wards the neural networks generalization performance.
3
5. Use transfer learning via stacked ensemble and consider a region as the
source and another as target dependency of the datasets.
1.3 Thesis Outline
The outline of the thesis is as follows:
� Chapter 5 Outlines the motivations, research goals and methodology of
the thesis.
� Chapter 6 Discusses the general background about neural networks, train-
ing algorithms, evolutionary algorithms ensemble learning, transfer learning
and tropical cyclone intensity and path forecasting.
� Chapter 7 Suggests the optimal time required for making predictions on
tropical cyclone wind intensity
� Chapter 8 Proposes a framework for encoding two-dimensional cyclone
track data into Elman style recurrent neural networks for efficient track
forecasting.
� Chapter 9 Uses a transfer learning with feed forward neural network based
ensemble stacking model to predict cyclone wind intensity.
� Chapter 10 Concludes the thesis with an overview of the results and anal-
ysis from Chapters 7 - 9 and discusses future research.
4
Chapter 2
Background and Literature Review
This chapter describes the general background about neural networks, training
algorithms for neural networks , evolutionary algorithms, ensemble learning and
transfer learning. The chapter also talks about tropical cyclone intensity and
path predictions.
2.1 Neural Networks
Neural Network (NN) are nature inspired computation methods which try to
model the biological neural systems [3]. It comprises of a group of objects re-
ferred to as neurons that are logically interconnected to form a network. A neuron
is a single processing unit that computes the weighted sum of its inputs. Inter-
connections between neurons are called synapses which have weights associated
with them.
These neural networks are used for modeling the relationships between input and
output values in the data [26]. They learn by training on data using algorithms
that update the weights of the synapses in order to achieve the learning objective.
The knowledge gained in training is held in distributed set of weights. The net-
works learn on one of three systems which include: supervised, unsupervised and
reinforced learning. Supervised learning uses direct comparisons between desired
and actual outputs [27] and is formulated using sum squared error. Unsupervised
learning uses the correlation between the inputs in training as there is no informa-
tion on the actual output [27]. Reinforced learning is a special form of supervised
learning wherein the exact expected output is not known [27] and as such it is
based on the correctness of the actual output. There are multiple applications
of neural networks including but not restricted to pattern recognition [28, 29],
control problems [30, 31] and time series prediction [32–34].
The two most commonly used architectures of neural networks are feedforward
and recurrent neural networks [35]. The neurons in the network is further grouped
into three layers; input, hidden and output layer neurons. The number of neurons
per layer and the number of hidden layers are variable.
5
Figure 2.1: Feed forward neural network [1]
2.1.1 Feedforward Neural Networks
Feedforward networks are static structures that consist of an input layer, a num-
ber of hidden layers and an output layer [36] . The number of neurons present
in each layer is dependent on the type of data being modeled. Additional hidden
layers are also added for different problem domains. However, once the network is
initialized, the size of the network is left constant throughout its use. Figure 2.1
shows a feedforward neural network with the three layer architecture. The con-
nections between the different layers of the network help propagate the weighted
sum of the activations from each neuron that later goes through a transfer func-
tion. The dynamics of a feedforward network is described in equation 2.1, where
the total net input activation value yi of neuron i is given for N input connections.
yi =N∑j=1
(wijxj) + δi (2.1)
where wij is the weight defined from the input signal xj to the neuron i and δi
represents the bias. The input activation yi is regularized by the transfer function
f(yi) that computes the final output of the unit. The sigmoid transfer function
is given in equation 2.2.
f(yi) =1
1 + e−yi(2.2)
6
2.1.2 Recurrent Neural Networks
Recurrent Neural Networks (RNN) are dynamical systems as opposed to feedfor-
ward networks. RNNs have multiple intermediate states whereby the next state
and output depend on the current state and inputs. This makes them highly
successful in time series prediction, pattern classification, language learning and
control [32–34] as they are highly capable of modeling dynamical systems. Re-
current neural networks are further subdivided due to their various architectures
[37]. The basic architectures of recurrent neural networks include: first-order
recurrent networks, second-order recurrent networks, NARX recurrent networks,
long short term memory networks and reservoir computing.
First-order recurrent neural network was proposed by Elman and Zipser [1] and
was referred to as the Elman recurrent neural network. This thesis uses this RNN
architecture therefore it is discussed in detail in the next subsection. Second-order
recurrent neural networks [38] have been seen to perform better then first-order
for modeling finite state behavior. The limitation of second-order recurrent neural
network is that it requires a lot of computational resources for training. Their
architecture is such that there are a higher number of weight connections per
hidden neuron when compared to first-order networks [39]. NARX recurrent
neural networks are based on non-linear autoregressive models with exogenous
inputs [40]. They have limited feedback that only comes from the output layer.
NARX networks are as powerful as fully connected recurrent networks due to
their information retention capabilities [41]. Long short term memory networks
(LSTM) [42] were specifically designed to effectively learn long-term dependencies
in data. Reservoir computation, [43] creates a recurrent neural network referred
to as a reservoir. The reservoir is randomly created and the internal weights
of the reservoir remains unchanged during the training. The weights from the
reservoir to the output neurons are updated in learning. Common approaches
to reservoir computing are Liquid State Machines (LSM) [44] and Echo State
Networks (ESN) [43, 45]
2.1.3 Elman Neural Networks
First-order recurrent neural networks have context neurons in addition to having
the input, hidden and output neurons. The context neurons are connected to the
7
Figure 2.2: Elman style RNN [1]
hidden neurons. They get input form the hidden layer neurons and then recurse
to feed the input back to the same hidden neuron in the nest time step. The
recurrent neural network is a dynamic structure that grows as it unfolds in time.
Figure 2.2 shows an Elman recurrent network at initial unfolded state. Elman
recurrent neural networks computational capabilities have been studied in [46].
The study showed that these types of networks are able to represent any finite-
state machine due to its dynamical properties. Note that the basic components
of an observed dynamical system are clearly represented in an Elman network:
the input stands for the control of the system, the contextual hidden layer stands
for the state of the system and the outputs stand for the measurement [47]. The
network is able to develop representations of unobservable states of a dynamical
system in the hidden layer through learning.
The change of the hidden state neurons’ activation in Elman style recurrent net-
works [1] is given by Equation (2.3).
yi(t) = f
(K∑k=1
vik yk(t− 1) +J∑
j=1
wijxj(t− 1)
)(2.3)
where yk(t) and xj(t) represent the output of the context state neurons and input
neurons at time step t, vik and wij represent their corresponding weights and f(.)
8
is the sigmoid transfer function.
2.2 Training in Neural networks
Neural networks are mathematically represented as a sum of their weight vectors.
These representations form objective functions that can then be optimized by
learning algorithms. Gradient descent is by far the most widely used approach
for training neural networks. Evolutionary algorithms have also been successfully
used for training recurrent neural networks [5, 48]. Supervised learning is used
by both gradient descent and evolutionary algorithms. Whereby, data used in
training the recurrent neural network contains both the input and actual output
data. The performance of the two methods is measured using the root mean
squared error (RMSE) and the mean absolute error (MAE) as given in Equation
2.4 and 2.5, respectively.
RMSE =
√√√√ 1
N
N∑i=1
(yi − yi)2 (2.4)
MAE =1
N
N∑i=1
|(yi − yi)| (2.5)
where yi and yi are the observed and predicted data, respectively. N is the length
of the observed data.
2.2.1 Backpropagation Based Learning
The backpropagation (BP) algorithm uses gradient descent for training feedfor-
ward networks [36]. Delta learning based approach to learning is used whereby
the main principle is to search the hypothesis state of weight vectors and find the
weights that best fit the training dataset. The goal is model the data as best as
possible by updating the weights. BP works in two passes: forward and backward
pass. The input is passed to the network from the input neurons which then prop-
agates the activations through the hidden layers all the way to the output layer.
9
The output from the network is mapped to the actual output and a cost function
is used to calculate the network error. Equation 2.6 gives the sum-squared-error
(SSE) cost function.
SSE =1
2
n∑k=1
(yk − tk)2 (2.6)
where n is the number of neurons in the output layer, yk is the actual output and
tk is the desired or target output of the respective neurons in the output layer.
Backward pass uses the actual descent where the gradient δSSEδw
of each weight
w in the network can be computed by propagating an error backwards through
the network. The two passes are completed for all the data points in the training
data. A complete cycle through the training data is called an epoch. The network
is trained through multiple epochs.
2.2.2 Backpropagation-Through-Time Based Learning
Backpropagation-through-time (BPTT) [4] and real-time recurrent learning [4]
are the two basic gradient descent based algorithms used in training recurrent
neural networks. The algorithm unfolds a recurrent neural network in time into
a deep multilayer feedforward network and employs the error back-propagation
for weight update. When unfolded in time, the network has the same behavior
as a recurrent neural network for a finite number of time steps. Algorithm 13
shows the BPTT algorithm which was used to train the Elman recurrent neural
network.
The algorithm starts by initializing the weight vectors of the recurrent neural
network with random numbers. Due to the use of supervised learning, with every
epoch, all data samples (sets of inputs and corresponding outputs) inputs are fed
into the network. The output recorded from the transformation of input data for
each epoch is the forward pass. The network output is compared with the actual
output from the dataset. The distance between the two outputs is calculated and
is referred to as the prediction error through which the error gradient is calculated
and used to update the weights of the synapses as errors are back propagated
10
Algorithm 1: Backpropagation Through-Time for Training Elman RNNsStep 1: Prepare Training and Testing datasetStep 2: Initialize the RNN weights with small random numbers in range [-0.5, 0.5]for each Epoch until termination do
for each Sample dofor n Time-Steps do
Forward Propagateend forfor n Time-Steps do
i) Backpropagate Errors using Gradient Descentii) Weight update
end forend for
end for
through the entire network one layer at a time. The algorithm terminates when
a fixed number of epoch or cycles have been reached.
The limitation of BPTT algorithm is that it is unable to learn sequences with
long-term dependencies [49, 50] as with these dependencies the error gradient
usually approaches zero therefore, reducing weight update.
2.2.3 Coevolutionary Neuro-Evolution based Learning
Coevolution based training of recurrent neural networks have shown a lot of
promise [5, 48].
The general cooperative coevolutionary method for training Elman recurrent neu-
ral networks is given in Algorithm 2. The recurrent neural network is decomposed
into k subcomponents using the neural level problem decomposition method [51],
where k is equal to the total number of hidden, context and output neurons. Each
subcomponent contains all the weight links from the previous layer connecting
to a particular neuron. Each hidden neuron also acts as a reference point for
the recurrent (state or context) weight links connected to it. Therefore, the sub-
components for a recurrent network with a single hidden layer are composed as
follows:
11
Figure 2.3: Neuron level problem decomposition for recurrent neural net-works
1. Hidden layer subcomponents: weight links from each neuron in the hidden(t)
layer connected to all input(t) neurons and the bias of hidden(t), where t
is time.
2. State (recurrent) neuron subcomponents: weight links from each neuron in
the hidden(t) layer connected to all hidden neurons in previous time step
hidden(t− 1).
3. Output layer subcomponents: weight links from each neuron in the output(t)
layer connected to all hidden(t) neurons and the bias of output(t).
.
The subcomponents are implemented as sub-populations that employ the gen-
eralized generation gap with parent-centric crossover operator genetic algorithm
[52]. A cycle is completed when all the sub-populations are evolved for a fixed
number of generations. In past work, it is seen that a search depth of one gener-
ation is suitable for this evolutionary process [51]. The algorithm halts when the
termination condition is satisfied: either a specified fitness has been achieved as
measured by the root mean squared error on the training dataset or the maximum
number of function evaluations has been reached.
12
Algorithm 2: Cooperative Coevolutionary Training of Elman Recurrent Net-worksStep 1: Decompose the problem into k subcomponents according to thenumber of Hidden, State, and Output neuronsStep 2: Encode each subcomponent in a sub-population in the followingorder:i) Hidden layer sub-populationsii) State (recurrent) neuron sub-populationsiii) Output layer sub-populationsStep 3: Initialize and cooperatively evaluate each sub-populationfor each cycle until termination dofor each Sub-population dofor n Generations doi) Select and create new offspringii) Cooperatively evaluate the new offspringiii) Add the new offspring to the sub-population
end forend for
end for
There are other decompositions of recurrent neural networks which include synapse
level decomposition and network level decomposition, however the neuron level
decomposition was found to outperform its competitors [2, 24]. Figure 2.3 shows
the neuron level decomposition of a recurrent neural network. Further details
about the cooperative coevolutionary architecture is explained in the next sec-
tion that talks about evolutionary computation.
2.3 Evolutionary Computation
Evolutionary Computation focus on solving mathematical problems that are not
solvable by traditional mathematical processes. Evolutionary computation is in-
spired by the process of biological evolution. It links to Darwin’s theory of evo-
lution wherein it preserves the notion of “survival of the fittest” [53] and mimics
natural processes such as reproduction, selection, mutation, and recombination
[54].
Evolutionary Algorithms (EA) are used for mathematical optimization [55] whereby,
the goal is maximization or minimization of single or multiple objective functions.
13
The candidate solutions to the problem being optimized are placed into a pop-
ulation and evolved over multiple generations to solve the problem. EA’s have
been pretty successful as genetic optimizers due to its simple design and its abil-
ity to work without prior knowledge about the problem being solved [56, 57].
Therefore, evolutionary algorithms have been quite frequently used for tackling
black-box optimization problems, job-shop scheduling as well as multiobjective
optimization. [57–59]
However, a major limitation of evolutionary algorithms is that it does not scale up
well to higher dimensions. It has been found to suffer the “curse of dimensionality”
[60] whereby, the performance EA’s deteriorates as the number of dimensions
increase with reference to large-scale optimization. Evolutionary computation is
mainly associated with swarm intelligence and evolutionary algorithms such as
genetic algorithms.
2.3.1 Genetic Algorithms
In the field of evolutionary computation, Genetic Algorithms have been widely
used as optimizer. The genetic algorithm was initially used with binary encoded
representations of candidate solutions of optimization problems. These candidate
solutions were classified as individuals and a population of such individuals would
be created. The individuals have their own genetic material or chromosomes
with different traits. The population is evaluated based on some fitness function
identifying the fittest individuals with the best chromosomes. These individuals
are selected to be parents so that after evolution, the best chromosomes are passed
onto the next generation. In the course of evolution, new offspring’s or children
were created by using the candidate solutions classified as parents. Various genetic
operators such as selection, mutation and crossover would be utilized in order
to facilitate mating and or evolution. Evolution is carried out over multiple
generations in order to attain fitter individuals.
Real Coded Genetic Algorithms (RCGA) was the successor to binary encoded
GA’s. RCGA used real numbers to encode candidate solutions as opposed to
only using binary values [61]. Real-valued encoding encouraged a more natural
representation of individuals thus eliciting the potential of RCGA to be utilized
14
in tackling real-world problems. Real coded genetic algorithms differ form one
another in the types of selection and crossover they employ.
Some of these selection strategies include; rank selection [62], tournament se-
lection [63], roulette wheel selection [63] and the elitist strategy [64]. Rank
selection assigns the fitness to all individual and selects the fittest individuals
for mating [62]. Tournament selection works on the basis of randomly choos-
ing individuals creating small groups or pools (a tournament) in the population.
Internal fitness evaluations take place within the pools and they nominate their
winner or fittest individual as a parents. Roulette Wheel selections work with
assigning a priority based selection where the fitter individuals are given more
priority then weaker individuals. The spinning wheel mechanism is used whereby
N parents are randomly picked from the population. Higher priority individuals
have a higher chance of being selected as they are given a larger percentage of the
wheel [63] howsoever, weaker individuals are also chosen sometimes as they may
contain valuable genes. In the elitist selection, some of the strongest chromosomes
are always retained in the population to ensure that the best individuals are ever
present in the population [62].
Upon identification of the parents, reproduction is performed using one of the
various crossover. Crossover is a reproduction technique that takes two parent
chromosomes and produces two child chromosomes [65]. The common crossovers
of real coded generic algorithms include; simulated binary crossover (SBX) [66].
SBX used single point crossover on binary strings in order to have a continuous
search space. SBX assigned a greater probability for offspring to remain closer
to the selected parents. Unimodal normal distribution crossover (UNDX) [67]
selects multiple parents chromosomes and creates offspring around the center of
mass of the selected parents. The simplex crossover (SPX) [68] selects parents
that mark a search space and the children are generated in that restricted search
space.
SPX has been found to outperform UNDX on test problems when used with
minimum generation gap genetic algorithm. This thesis will use Parent centric
crossover (PCX) which has outperformed other crossovers on unimodal problems
[52]. Other crossovers such as Blend crossover (BLX) [52] and Wright’s heuristic
crossover [65] have also shown improved performance on a set of optimization
functions.
15
2.3.2 Generalized Generation Gap Parent Centric Crossover
Generalized Generation Gap Parent Centric Crossover (G3-PCX) is a genetic
optimizer that is based upon the concept of sexual reproduction in animals. The
Generalized Generation Gap(G3) is different from standard genetic algorithms
in the way it selects parents for mating. It restricts mating in a population to
only selected number of patents instead of mating between all individuals of a
population. In each round of evolution, a sub-population parents and children is
created and evaluated instead of evaluating the whole population.
The Parent Centric Crossover(PCX) is used during mating to create the children.
PCX uses the orthogonal distance between the parents. The parents consist of
male and female components where, the female parent points to the search areas
and the male parent determines the extent of search of the areas pointed by the
female.
Algorithm 3 describes the basic process used by G3PCX. The algorithm starts by
assigning the number of parents and children that will be part of evolution. The
population is created and randomly initialized and all individuals are evaluated.
The best individual and α − 1 other randomly chosen individuals are chosen
as parents. These parents then mate using PCX to create β children. Two
parents are randomly chosen from the original population which combined with
the children to form a separate sub-population. The new sub-population is then
evaluated and n strong individuals are then replaced into the original population
Algorithm 3: G3-PCX Evolutionary Algorithm [52]
Set the number of parents (α) and children (β)initialize and evaluate all individuals in the populationSetup the sub-population which would contain parents and childrenwhile not optimal solution do1) Select the best parent and α - 1 parents randomly from population2) Create β children from α parents using PCX3) Choose two parents at random from the population4) From the combined sub-population of two chosen parents and β createdchildren, choose the two best individuals and replace the chosen twoparents (in Step 3) with these solutions.
end while
The PCX crossover was able to outperform the simplex crossover (SPX) and
simulated-binary crossover(SBX) using the generalized generation gap model [52].
16
The performance of the self-adapting parent-centric crossover in terms of opti-
mization time was appealing when problems were scaled up to higher dimensions
[52].
The G3-PCX algorithm requires a large population size in order to perform reli-
ably. This was identified as a limitation in study where a population size of 90
was needed to effectively solve a two-dimensional problem [69].
2.3.3 Cooperative Coevolution
Evolutionary algorithms performance deteriorates with increase in dimensionality
of the problem being optimized [70, 71]. Therefore research has been focused
on adapting and evolving these algorithms [53]. Coevolutionary algorithms have
aroused a lot of interests from researchers [72].
Cooperative Coevolution (CC) is an evolutionary computation method that solves
a larger problem by dividing it into smaller subcomponents. [72]. Cooperative
coevolution aims to simplify the complexities of a problem therefore, simplifying
the problem as a whole, which attributes to its success [48, 72, 73].
Applications of cooperative coevolution have been utilized in many research dis-
ciplines. CC has demonstrated to be very promising in solving real parameter
global optimization problems [72] as well as large scale optimization [74–76]. Co-
evolution has also been utilized in neuro-evolution for time series prediction and
pattern classification [5, 28, 30, 48, 77].
Algorithm 4: The General Cooperative Coevolution Algorithm
1) Decompose the problem into k subcomponents2) Initialize and cooperatively evaluate each subcomponent represented as asub-populationwhile until termination dofor each Subpopulation dofor n Generations doi) Select and build new individualsii) Cooperatively evaluate the new individualsiii) Update sub-population
end forend for
end while
17
Algorithm 4 depicts the basic cooperative coevolution framework. Decomposition
of the problem creates the subcomponents. The size and number of subcompo-
nents depend on the type of the optimization problem. Each subcomponent is
represented as a sub-population and assume the characteristics of species in na-
ture. Individuals are added to the sub-populations using random values and all
the species are then cooperatively evaluated. In CC, the sub-populations are
evolved in isolation and the only cooperation takes place during fitness evalua-
tion for the respective individuals in each sub-population. The evolution phase
involves evolving all the sub-populations in a round-robin fashion for a depth of
n generations. A CC cycle is completed when all the sub-populations have been
evolved for n generations. The algorithm terminates when the maximum number
of cycles have been reached or the minimum error is achieved.
Different variants of cooperative coevolution have been utilized where the issue of
problem decomposition and separability has been central [56]. The performance
of CC algorithms on high dimensional problems could be significantly enhanced
by incorporating more advanced EAs. Fast evolutionary programming in the CC
framework (FEPCC) was initially used to solve large-scale optimization problems
of up to 1000 dimensions [78]. Cooperative coevolutionary genetic algorithms
(CCGA) framework designed by [72] was used in FEPCC. CCEA does not cater
for variable interactions in subcomponents, therefore, FEPCC performed poorly
on non-separable functions. Cooperative coevolution based on particle swarm
optimization (CPSO) was later developed by [79]. CPSO decomposed the prob-
lem into m s-dimensional subcomponents where s is the number of variables in
a subcomponent. Deferential Grouping was used to optimize the subcomponents
in [80] where each subcomponent comprised half of the optimization problem.
The method was applied to problems of 100 dimensions only as the halved sub-
components were unable to cope with the higher dimensions [80].
Yang and Yao [56], proposed a cooperative co-evolutionary framework to ad-
dress high dimensional non-separable problems. They utilized random group-
ing and adaptive weighting to permit co-adaptation among subcomponents while
they are interdependent. The interacting and non interacting variables would be
grouped into separate subcomponents heuristically. Their framework performed
fairly well with large scale non-separable problems of 1000 dimensions. Amend-
ments to DECC-G was proposed by [81]. The drawback of DECC-G was that the
18
competence of random grouping in capturing two interacting into a subcompo-
nent is significantly reduced in the presence of more than two interacting variables
in the problem.
Omidvar et. al [82] later proposed a cooperative co-evolutionary differential evo-
lution to discourse high dimensional problems of up to 1000 dimensions which
showed promising results. Their goal was to advance an earlier algorithm [83] by
employing principal component analysis to condense the dimensions of a problem
[84].
More recently, Chandra et. al presented an adaptive method known as compet-
itive island based cooperative coevolution (CICC)[85] where candidate solutions
were grouped into islands that compete and collaborate. The best individual
from the winning islands is injected into the losing island to ensure fair com-
petition in different phases of evolution for global optimization [85]. The same
method was earlier used for training Elman recurrent neural networks for time
series prediction [86] with promising results. Omidvar et. al [87] presented de-
pendent variables into subcomponents based on the deferential grouping method
and achieved improvements in the problem decomposition strategies employed.
2.3.4 Problem Decompositions
Decomposition is the process of transforming a large problem into smaller ones
referred to as subcomponents. It is quite difficult to effectively decompose a
problem without having prior knowledge about its internal structure. Divide-
and-conquer is generally used to decompose large scale complex problems in CC.
Cooperative coevolution is highly sensitive to the problem decomposition there-
fore, arousing major challenges in decomposing large scale optimization problems
into smaller subproblems. To successfully utilize CC in optimization, interact-
ing variables need to be grouped into same subcomponent. There is no unique
decomposition for some classes of functions such as fully-separable, fully non-
separable or overlapping functions [88]. Fully-separable functions do not have
interacting variables therefore, they can be grouped in any fashion and optimized
independently making any decomposition viable. Contrarily, there is no unique
decomposition for non-separable functions as there are interdependencies between
all decision variables, eliminating the possibility the independent optimization.
19
Placing dependent (interacting) variables into separate subcomponents signifi-
cantly demises the optimization performance [72, 89, 90]. An ideal decomposition
would comprise of subcomponents that include parameters with minimal variable
inter-dependencies [81] and as such obtaining such an ideal decomposition is a
major hurdle [90].
Various types of decomposition methods have been explored. The two major
types are; static and dynamic. In static decomposition strategy, the problem
decomposition remain fixed throughout the optimization process. Conversely,
using dynamic decomposition method, the problem decomposition changes, that
is, the grouping is varied [91]. Dynamic decomposition methods simulating
automatic variable interaction detection have been common solutions [75, 91–93].
2.3.5 Cooperative Fitness Evaluations
Cooperative Cooperation works with multiple subcomponents as the basis of
CCEA’s are to divide a large problem into smaller subcomponents. Therefore,
the variables present in the function being optimized are placed in separate sub-
components. In evolution, each sub-population is optimized independently thus,
initially, only a certain number of individuals are evolved within a sub-population
whilst individuals in other sub-populations still contain arbitrary genes as as-
signed in the initialization phase. The fitness function requires all variables in or-
der to start computation, hence, cooperation is used by the sub-populations. This
implies that in order to assign the fitness to one individual of the sub-population
being evolved, arbitrary individuals are chosen from the other sub-populations
as representatives to complete the objective function. The fitness function in
context of neuro-evolution is the neural network. The fitness of an individual is
the inverse of the performance error for that particular individual on the neural
network training data.
2.4 Ensemble learning
Ensemble learning generally considers combination of multiple standalone learn-
ing methods to improve generalization performance when compared to standalone
20
approaches [94]. The basic concept of ensemble based learning is having multiple
learning working on the same problem would help solve the problem more effec-
tively and efficiently due to the added computation power. Ensemble learning has
been implemented with group of neural networks that are trained as an ensemble
with different configuration in parameter settings or initializations for executing
the same task[95, 96]. Popular methods in ensemble learning involve stacking,
bagging and boosting [97–99] . Stacked generalization is a form of ensemble learn-
ing where the predictions are combined from the other ensembles. The principal
to stacked generalization is that more learners would improve performance due
the additional computational power. [94]. In the literature, logistic regression has
been commonly used as the combiner layer for stacking [100]. Stacking has been
successfully used with both supervised and unsupervised learning [100, 101]. Re-
cently, the approach has been applied to modifying emission kinetics of colloidal
semiconductor nanoplatelets[102].
2.5 Transfer learning
Transfer learning utilizes knowledge learned previously from related problems
into learning models in order to have faster training or better or generalization
performance [8]. Transfer learning incorporates knowledge from a related prob-
lems (also known as source) to compliment a target problem especially in cases
where there is lack of data or in cases where there is requirement to speed up
the learning process. Figure 2.4 shows the simple transfer learning approach. It
shows that there is large quantities of pre-existing knowledge that can be used
to enhance the capabilities of target data. The smaller target data may benefit
from the previously learnt knowledge of larger source data in related context. In
practical transfer learning is derived from the basis of learning whereby learning
to drive a car would be helped further by having pre-existing knowledge about
driving tractors.
The approach has seen widespread applications with challenges on the type of
knowledge should be transferred in order to avoid negative transfers whereby the
knowledge deteriorates the performance [9–11]. Transfer Learning has recently
been used for visual tracking, computer-aided detection [12, 13] . Transfer learn-
ing has been implemented with ensemble learning methods such as boosting and
21
Figure 2.4: Transfer learning for one source and one target data
stacking [9]. The approach in the case of transfer stacking is implementing mul-
tiple learners previously trained on the source data in order to have a single base
learner. Simple stacking on the other hand uses multiple base learners[9].
2.6 Time Series Prediction
Time series are data of a series of events observed over a certain time period.
Time series prediction involves using past and present time series data for making
future predictions [103, 104]. Research has been ongoing on time series with its
applications ranging from financial prediction [105, 106] to atmospheric weather
prediction [2, 15].
A hybrid Elman-NARX neural network together with embedding theorem [107]
was developed to solve chaotic time series problems. The hybrid model per-
formed exceptionally well on benchmark datasets due to its high accuracy at
capturing relationships in the data [107]. Gholipour et al developed the Locally
linear neuro-fuzzy model - Locally linear model tree (LLNF-LoLiMot) [108] which
showed good performance with benchmark data. The model generalized well as it
prevents over fitting due to its intuitive constructive implementation which also
made this model computationally efficient. Radial basis network with orthogonal
least squares (RBF-OLS) [108] has been used for prediction of noisy data. It
gave competitive results when compared to LLNF-LoLiMot.
22
Predicting time series data using recurrent neural networks requires restructuring
of the data. We used Taken’s theorem [109] to reconstruct the time series data in
state space vector. Given an observed time series x(t), an embedded phase space
Y (t) = [(x(t), x(t − T ), ..., x(t(D − 1)T )] can be generated, where, T is the time
delay, D is the embedding dimension, t = 0, 1, 2, ..., N − DT − 1 and N is the
length of the original time series.
2.6.1 Weather Prediction
Weather prediction involves time series prediction for natural phenomenon such
rainfall prediction, cyclones, tornadoes, wave surges and droughts[110, 111]. One
needs to check how fast the prediction model can make a decision when the event
occurs. If the model is training over specific months for rainy seasons for a decade,
the system should be able to make a robust prediction from the beginning of the
rainy season. Methods for weather prediction have been discussed in the former
section.
2.7 Tropical Cyclones
A tropical cyclone is a non-frontal low pressure system with organized convection
that forms over warm tropical waters [112]. A cyclone once formed moves over
the ocean in the direction away from the equator lasting a few days to sometimes
2-3 weeks [113]. During its lifetime, a cyclone can travel hundreds of kilome-
ters and the actual position of the cyclone’s eye recorded every six hours defines
the cyclone’s track. The forecast of a cyclone comprises cyclone track, intensity,
induced storm surges, rainfall and threat to coastal areas [18]. The direction of
cyclone movement and wind intensity are the most important features in the fore-
cast as it helps the inhabitants to prepare ahead of time and minimize damage
to life and property. For this reason, forecasting cyclone track and intensity is
considered extremely important forecast functions by scientists and meteorologi-
cal agencies around the world [17]. Various combination techniques in the track
and intensity prediction models are incorporated to account for the variation in
cyclone behavior in different ocean basins and achieve highest possible accuracy
and reliability [17, 19].
23
The official tropical cyclone guidance track and intensity forecast is assigned to
the designated regions around the world. For the South Pacific, the Fiji Me-
teorological Service (FMS) in Nadi Fiji is a World Meteorological Organisation
(WMO) recognised Regional Specialised Meteorological Centre (RSMC) who is
responsible for the southwest Pacific Ocean 1. In addition to FMS, the Australian
Bureau of Meteorology (BoM), a Tropical Cyclone Warning Center (TCWC), is
also responsible for the far southwest Pacific Ocean basin 2. While the U.S naval
Joint Typhoon Warning Centre (JTWC) 3 is not a WMO recognised RSMC or
TCWC, it however also issues cyclone warnings for various ocean basins, including
northwest Pacific, North Indian, Southwest Indian, Southeast Indian/Australian,
and the Australian/Southwest Pacific basins.
2.7.1 Conventional Cyclone Forecasting
The satellite era introduced the ability of estimating the intensity of tropical
cyclones using satellite images [114]. Methods involving cloud features along
with conventional estimates of cyclone strengths had been proposed to estimate
the intensity of cyclones [115, 116]. Over the past few decades, there has been
steady advances in tropical cyclone track forecasting with increase in availability
of observation data and state-of-the-art numerical models [117]. However, there
has been little improvement in cyclone wind intensity prediction [117]. The
difficulty with identifying the best approximate of intensity lies in its non-trivial
dependency on the horizontal resolution at small grid spacing [118]. In recent
times, statistical models have shown to provide the best intensity forecasts [119].
Dvorak [120] used a hybrid model that combined meteorological analysis of satel-
lite imagery together with a model consisting a set of curves that depicted cyclone
intensity change with time and cloud feature descriptions of the cyclone at in-
tervals along the curves [120]. There has been extensive use of this technique in
cyclone track forecasting therefore, it had been referred to as the Dvorak tech-
nique.
Statistical Hurricane Intensity Forecast (SHIFOR) [121], was based on a multiple
regression statistical model which was able to make up to 72 hour forecasts of
1www.met.gov.fj2www.bom.gov.au3http://www.usno.navy.mil/JTWC/
24
cyclone wind intensities. It used various predictor variables that include: Ju-
lian day, current storm intensity, intensity change in the past 12 hours, initial
storm location (latitude and longitude), and zonal and meridional component of
storm motion. The SHIFOR model was trained with cyclones which were at least
30 nautical miles away from land [121]. However, the current SHIFOR5 equa-
tion uses 1967-1999 cyclone data with a minimum requirement that each cyclone
intensifies into tropical storms [122] and is independent of the location of the
cyclone. Climatology and Persistence (CLIPER) is one of the computer-based
forecast models which was able to give 5 days prediction, i.e., 72 hours of cyclone
intensity [122].
Statistical Hurricane Intensity Prediction Scheme (SHIPS)[119] made cyclone in-
tensity forecasts for 12-hour periods out to 120 hours.
Together with the five predictors used in SHIFOR, SHIPS uses divergence of
winds at 200 hPa, intensification potential, vertical shear of the horizontal winds
between 850-200 hPa levels, cloud top temperature measured by GOES satellite,
average 200 hPa temperature, average 850 hPa vorticity, average 500-300 hPa
layer relative humidity and oceanic heat content from altimeter measurements.
The limitation of SHIPS model it is not suitable for predicting cyclone intensities
for cyclones which are near the coasts as the model was developed using cyclone
data that did not make landfall [123].
The Southern Hemisphere Statistical Typhoon Intensity Prediction Scheme (SH
STIPS) [124] model was the successor to the Ships model. It used a consensus-
based methodology based on multiple linear regression equations for each forecast
time to make forecasts on cyclone intensity. SH STIPS took advantage of environ-
mental forecast information and used optimal combination of factors related to
climatology and persistence, vertical wind shear, intensification potential, atmo-
spheric stability and dynamic intensity forecasts. The performance of SH STIPS
was able to beat its predecessors and competitors for making cyclone intensity
forecasts in the Southern Hemisphere [124].
2.7.2 Neural Network for Cyclones
Neural network regression models have been used for the prediction of the max-
imum potential intensity of cyclones [15].The error back-propagation learning
25
algorithm was used in a feedforward neural network with two hidden layers with
binary triggers that dynamically triggered the neurons based on the regressions of
the inputs. The proposed model provided satisfactory results on Western North
Pacific tropical cyclones [15]. A model inspired by the human visual system con-
sisting of a multi-layered neural network architecture with bi-directional connec-
tions in the hidden layers was introduced in [22]. The prediction of the direction
of movement from previously unseen satellite images showed good performance.
A hybrid neural network model that clusters input data using self-organizing
maps and feeds data from the different clusters to separate networks for training
and prediction was proposed [125]. The method was used for forecasting actual
typhoon-rainfall in Taiwan’s Tanshui river basin. It showed improved performance
over the conventional prediction methods.An investigation was done on the impact
of varying the number of layers and the number of neurons per layer for the
prediction of the direction and intensity of cyclones over the North Indian Ocean
[20]. The study found that an increase in the number of hidden layers improved
the accuracy of the forecast while the number of nodes in the hidden layer had
no significant effect on performance.
Deep convolutional neural networks have been used to detect tropical cyclones
using image processing of wind vector and sea level pressure color maps [126]. The
neural network was able to achieve 99% accuracy of predicting the occurrence of
cyclone from large climate datasets such as 20 century reanalysis and NCEP-
NCAR reanalysis. An approach combining a multilayer perceptron (MLP) with
a neuro-fuzzy model for the prediction of a cyclone’s track and surge height of
cyclones for the same cyclone data showed good prediction performance [127].
Tropical cyclones reported over the Bay of Bengal and the Arabian Sea were
considered by [128] to predict the track and intensity of tropical cyclones using
multi-layer feed forward neural networks. The proposed MLP model referred to as
neural network architecture 1(NNA 1) was able to give comparable performance
with existing numerical models for 6 hour ahead forecasts. Chandra et al. [2]
proposed a method for cyclone track prediction based coevolution of Elman RNNs
for the South Pacific where the latitude and longitude were treated as separate
dimensions. A similar approach was used for the prediction of wind intensities
[24], howsoever, there is room for further improvements in prediction accuracy.
26
2.8 Chapter Summary
This chapter has provided the background about neural networks, training algo-
rithms, evolutionary algorithms, ensemble learning, transfer learning and tropical
cyclone predictions. It has also explored cooperative coevolution algorithm, prob-
lem decomposition in neuro-evolution of neural networks. An extensive review of
the recent developments in these areas are provided and their strengths and limi-
tations have been highlighted. The ensemble learning method was also discussed.
Neural ensembles have proven their worth in recent work. Transfer learning tires
to use existing knowledge to help in gaining new knowledge. Neural systems using
transfer learning are giving promising results in recent times. Therefore, giving
rise to the available applications where further explorations could be made with
these expert systems. The following chapters suggest improvements to existing
systems and then apply the upgraded systems for tropical cyclone wind intensity
and path prediction.
27
Chapter 3
Identification of Minimal Timespan Problem for Cy-clone Wind-Intensity Prediction
This chapter presents an empirical study on minimal timespan required for ro-
bust prediction using Elman recurrent neural networks. Two different training
methods are evaluated for training Elman recurrent network that includes coop-
erative coevolution and backpropagation-though time. They are applied to the
prediction of the wind intensity in cyclones that took place in the South Pacific
over past few decades. The results show that a minimal timespan is an impor-
tant factor that leads to the measure of robustness in prediction performance and
strategies should be taken in cases when the minimal timespan is needed.
3.1 Introduction
c
This chapter presents an empirical study on minimal timespan required for ro-
bust prediction using Elman RNNs. We train a prediction model and test it for
robustness regarding minimal timespan. We perform this by training it with dif-
ferent size of timespan and using a different size for testing. For instance, in the
case of cyclone wind-intensity prediction, this could be 36 hours ( 6 data points
recorded every 6 hours) for training and then testing it with 12 hours ( 2 data
points recorded every 6 hours). Therefore, in this case, there is a need to evalu-
ate the quality of prediction within 12 hours when a cyclone is formed. We run
several different types of experiments to test the robustness of Elman RNNs us-
ing two different training methods that include cooperative neuro-evolution and
backpropagation-though-time.
28
Figure 3.1: Cyclone wind intensity time series showing relation between train-ing and testing timespan. Training data uses embedding dimension 5 while
testing dataset uses embedding dimension 3
3.2 Problem Definition and Methodology
In this section,the minimal timespan prediction problem is identified and details of
the models used to analyse the problem is provided. Elman RNNs are used as pre-
diction model and two distinct training algorithms that include back-propagation
through time and cooperative neuro-evolution.
3.2.1 Problem Definition: Minimal Timespan Prediction
Problem
In a conventional time series prediction problem, the large time series dataset
needs to be broken down into smaller sections or snapshots called windows, which
is usually taken at regular intervals [129]. The size of the window is defined as
the timespan. In the case of financial time series, there can be an issue if a
prediction is made according to the division of stock market per month. When
a month begins, one needs to evaluate an effective prediction model to check the
29
number of days (data points) one needs in order for the model to make an efficient
prediction.
The same problem lies when it comes to cyclones, one needs to measure how many
hours after the cyclone is detected the model begin prediction regarding the track,
wind or other characteristics of the cyclone. In the case of cyclones, predictions
need to be made as quickly as possible in order to provide early warnings to
the people so that they can get prepared. For example, data about a tropical
cyclone in the South Pacific is recorded at six-hour intervals [130]. Therefore, if
the timespan used is 6 data points then, the first prediction of any system used for
predictions would come after 36 hours. It would take 36 hours to make the first
prediction about the cyclone wind intensity. By that time, a lot of damage would
have been already caused which may have been avoided if robust and accurate
warnings been issued.
The problem with the existing models such as neural networks used for cyclones
and related problems is that the minimal time required to reach a decision about
the first prediction is still unknown. We introduce the problem of minimal times-
pan that defines the minimum duration needed for a model to effectively reach a
prediction for a given time-series.
Figure 3.1 shows a portion of the wind intensity time series of tropical cyclones in
the South Pacific. The event length is represented by the different color portion
of the time series and it gives the duration of single cyclones. The timespan is
of fixed length and moves through the time-series in a windowed motion. The
movement at some points causes the timespan to overlap from one event to the
other the at the point of transition of the events (cyclones). The figure shows
how we extract two different timespan values from a single time-series.
Figure 3.2 shows the experimental method we used to identify the minimal times-
pan. The RNN was firstly trained to predict cyclone wind intensity using a times-
pan or embedding dimension W. Later each of the fully trained network (trained
with the separate timespan values) was tested with multiple values of timespan
(3,4,5,6,7,8). The predictions given by each of the timespan trained and tested
with was analyzed to identify the overall minimal timespan.
30
Figure 3.2: Elman Recurrent neural network trained with timespan W andtested with X,Y,Z for identifying the minimal timespan.
3.2.2 Methodology: Recurrent Networks for Prediction
RNNs are dynamical systems that use states from previous time steps to compute
current state; they are thus well-suited for modeling temporal sequences [1]. El-
man RNNs use a context layer to compute the new state from the previous state
and current inputs. The basic components of an observed dynamical system are
represented in an Elman network using the input, context and the output layer
[47].
Figure 3.3 shows the Elman recurrent neural used for cyclone wind intensity pre-
diction where D represents the embedding dimension. Input data is preprocessed
and it is fed to the RNN at single time-steps up till the size of the timespan being
used is reached after which the wind intensity is predicted.
We employ two distinct algorithms for training the given RNN. These include,
1) cooperative neuro-evolution and 2)back-propagation through time which are
described in detail in the background section.
3.3 Experiments and Results
This section describes the details of the experimental design and results where
RNNs are training using cooperative neuro-evolution (CNE) and back-propagation
31
Figure 3.3: Elman recurrent neural network used for tropical cyclone windintensity prediction. Time series data is preprocessed and embedded using
Taken’s theorem and fed into the Elman RNN.
through-time for the identified minimal span problem. The focus was kept on
minimal timespan for tropical cyclone wind intensity prediction as a case study.
In the testing stage, we pre-process the test dataset using different values of
the timespan. In this way, we evaluate the effectiveness of the trained RNN for
generalization performance on different values of timespan from which only value
has been used during training.
32
3.3.1 Data Preprocessing and Reconstruction
We use Taken’s theorem [109] to reconstruct the time series data into a state
space vector. The RNN unfolds k steps in time which is equal to the embedding
dimension or timespan D [5, 131, 132].
Tropical cyclone intensity data from the Southern Pacific region [130] was used
for the purpose of this experiment. The time-series data contained 6000 points in
the training set (tropical cyclones from 1985 - 2005). There were 2000 points in
the test set (tropical cyclones from 2006 - 2013) taken from the dataset. All the
cyclones in both the training and testing dataset were concatenated into a single
data stream to form the complete time series. Cyclones were placed consecutively
in the dataset based on their date of identification in ascending order.
3.3.2 Experimental Design
The sub-populations in cooperative neuro-evolution employ the generalized gen-
eration gap with parent-centric crossover (G3-PCX) evolutionary algorithm [52].
A population size of 200 is utilized with 2 parents and 2 offspring that has shown
good results in literature [5]. In the case of BPTT, a learning rate of 0.2 had
been employed. In the case of cooperative neuro-evolution, results for 3 hidden
neurons are provided as this showed optimal results in literature [24]. Training
was done with different numbers of hidden neurons in the case of BPTT and then
present the case that gives best results to test different minimal timespan.
The original dataset comprised of cyclones from the past decades where each data-
point was recorded at regular six-hour intervals. The data was reconstructed
in order to test the effectiveness of the model for prediction within 18 hours
(timespan of 3) and up to 48 hours (timespan of 8). Figure 3.2 gives more details
of the experimental set up used. The neural network was trained with timespan
W and tested with timespan X,Y,Z.
The concept of weight based learning was used to test the robustness of the
RNN. The neural network was trained at different difficulty level whereby, for the
various timespans tested, the training was done on easy mode (longer sequence
of inputs provided in training), hard mode (smaller sequence of inputs provided
33
Figure 3.4: Unfolded view of RNN.
in training) and normal mode (same number of inputs provided in both learning
and testing).
3.3.3 Results
Figure 3.5 shows the performance CNE and BPTT on the testing datasets for
the varying timespan values from the cyclone data. Each point in the bar-graph
(CNE and BPTT) gives the performance of the RNN that had been tested with
timespan ranging from 3 up to 8 with increments of 1. The 95% confidence
interval reported by RMSE of 30 independent experimental runs is given by an
error bar.
The sub-figures 3.5(a), 3.5(b) , 3.5(c), 3.5(d) and 3.5(e) are used to test the ro-
bustness of the CNE and BPTT training algorithms for evaluating the minimal
34
(a) RNN trained with Timespan = 4
(b) RNN trained with Timespan = 5
(c) RNN trained with Timespan = 6
(d) RNN trained with Timespan = 7
(e) RNN trained with Timespan = 8
Figure 3.5: Performance of CNE and BPTT in wind intensity prediction inthe testing dataset (2006 -2013) for tropical cyclones in the South Pacific.
35
Table 3.1: Best Performance of cooperative coevolution
TS(training) TS (testing) RMSE (Test) MAE (Test)
4 6 0.1312± 0.0378 25.64± 7.7495 5 0.0314 ± 0.0005 4.962 ± 0.0826 6 0.0798± 0.0290 15.06± 6.1537 7 0.0637± 0.0307 11.27± 6.0888 8 0.0704± 0.0389 11.68± 6.739
timespan. We compare the performance of training algorithms with respect to
the varied timespans ( TS ∈ {4, 5, 6, 7, 8}) used in training. Figure 3.5 (b) has
achieved the best performance which is given by the minimum error. The times-
pan of 5 has shown the best performance in testing dataset given the RNN was
trained with timespan of 5. The best performance was given when the minimum
timespan for testing dataset was the same for the training dataset. Similar trends
were seen with all the other cases of the timespan that were used in training; ex-
cept for TS4 as shown in Figure 3.5(a). The case for TS4 is highlighted here,
where there was only 0.006 difference between timespan 4 and 6. Therefore, we
could still generalize that the same timespan used for training and testing provide
the best performance.
The performance of CNE was able to beat BPTT for higher Timespans (TS
[7 and 8]). CNE also showed good prediction accuracy for Figure 3.5.(b) and
3.5.(c) when the training timespan was same as the timespan tested that is 7
and 8 respectively. In lower timespans (TS [4, 5, 6]), BPTT had shown better
performance.
Figure 3.6 gives the performance for a single run of the CNE together with the
error in prediction. The initial 100 data points are shown for clear visualizations.
The timespan of 5 is compared with timespan 6 and 7. We only used times-
pan 5, 6 and 7 for visualization purposes as timespan 5 showed most promising
performance as seen in Figure 3.5(b).
Table 3.1 summarizes the best performance of CNE. As shown by the RMSE and
MAE, the best possible value for both training and testing timespan is 5 as it has
the least error.
36
3.3.4 Discussion
Minimal timespan was defined as the least possible number of data points or
the smallest window size necessary for time-series prediction. The results, in
general, reveal that the minimal timespan is an important feature to test the
robustness of the prediction model and the training algorithm. Cyclone wind-
intensity prediction was used as it needs robust prediction model, however, other
applications can also be explored to identify the minimal timespan problem.
In terms of the training algorithm, CNE was able to outperform BPTT for the
higher timespan. CNE works towards dividing a larger problem into smaller
components and solving them. The neural network gets larger in size with large
timespan as the RNN unfolds longer in time to cater for the increased number of
inputs. Figure 3.4 shows the comparison of the size of unfolded RNN. Timespan
TS(4) and TS(7) are shown where it is evident that larger timespan unfolds into
a larger network in time. Therefore, training the larger neural network is well
suited for CNE as it is an evolutionary algorithm and the weight updates are
done according to fitness of the entire network and not through gradients as in
the case of BPTT. The results demonstrated that for TS(7) and TS(8), CNE
outperformed BPTT. This is due to the difficulty of BPTT in back-propagating
errors as the size of the network that unfolds in time gets larger. As shown in the
results, in the cases of smaller timespan, (TS4, TS5, and TS6), BPTT performs
better than CNE.
We found that training and testing timespan need to be same for best prediction
performance. This shows that the RNN’s which had been trained were unable to
generalize well for the different timespan tested which implies that the selected
training methodologies were not robust enough. This reaffirms that the choice
of timespan as a good measure of robustness for the training algorithms and the
prediction model. The challenge in future research is to develop a strategy that
is able to give good prediction performance regardless of the size of the timespan
in the testing dataset.
The results showed that the minimal timespan TS(5) gave the best performance.
This implies that prediction of the model can take place within 30 hours from the
identification of the cyclone. Since readings are taken every 6 hours, timespan of
5 is same as 30 hours from the beginning of the cyclone.
37
(a) Performance of CNE on testing dataset (b) Error in prediction by CNE
Figure 3.6: Performance of CNE for a single experimental run
38
3.4 Chapter Summary
In this chapter, minimal timespan problem for robust time series prediction with
application to cyclone wind intensity was identified. The minimal timespan has
been defined as the least possible window size necessary to begin time-series pre-
diction. Back-propagation through time and cooperative neuro-evolution algo-
rithm were used to train Elman RNNs to find out the effect of minimal timespan.
According to the results, the minimal timespan is an important characteristic for
robust time series prediction. The minimal timespan would be useful in training a
RNN models that would be able to enhance predictions in future cases of cyclones.
According to the cyclone data, the data points were collected at six-hour intervals,
we could predict cyclone wind intensity quite accurately after 30 hours from
the start of the cyclone. This can enable better preparation for the cyclone,
therefore, reducing the damages caused. The minimal timespan is of paramount
importance when it comes to problems that require faster prediction as seen
with cyclones. The problem of minimum timespan exists in a wide range of
applications, especially in engineering problems that rely on intelligent decision
making based on minimal data readings by sensors.
The next chapter uses the minimal timespan and applies it to tropical cyclone
track forecast.
39
Chapter 4
Cyclone Track Prediction Using Coevolutionary Re-current Neural Networks
In this chapter, an architecture for encoding two dimensional time series problem
into Elman recurrent neural networks composed of a single input neuron is pro-
posed. Cooperative coevolution and back-propagation through-time algorithms
had been used for training. The experiments showed an improvement in the pre-
diction accuracy when compared to previous results from literature which used a
different recurrent network architecture.
4.1 Introduction
In this chapter, Elman RNNs with a single input neuron are trained using coop-
erative coevolution and back-propagation through-time [4] for both latitude and
longitude prediction. The original two-dimensional time series was reconstructed
data using Taken’s theorem [109] and experiments with different number of hid-
den neurons in RNNs had been designed to test scalability and robustness. The
results are compared with previous work from literature which used a similar ar-
chitecture for the same prediction problem. The main contribution of this chapter
is introducing a new architecture for encoding two-dimensional time-series data
in Elman RNNs.
4.2 RNN Architecture for Cyclone Tracks
In previous work, the use of two input and output neurons for a cyclone’s longitude
and latitude, respectively, has been deployed using Elman RNN [2] as shown in
Figure 4.1.
The latitude and longitude are separate time series which are interrelated as they
are part of one variable of the cyclone, that is, the cyclone’s track. An improved
RNN model that combines the two dimensions of latitude and longitude into a
single data stream variable in an attempt to represent the direct relationship
40
Figure 4.1: Elman RNN used for prediction of cyclone latitude and longitude.Two input and output neurons are used for mapping the longitude and latitude
[2].
between the dimensions is proposed. Figure 4.2 shows the proposed architecture.
This network architecture is similar to that of Figure 4.1 but uses one single
input neuron and one output neuron which predicts both longitude and latitude
as depicted. In this model, the single neurons represent both the longitude and
latitude, thus preserving some form of correlation between the two time series.
The proposed network architecture is called single input-output neural network
(SIORNN). It will be trained using the error backpropagation through time and
the cooperative coevolution algorithm.
4.3 Simulation and Analysis
4.3.1 Data Preprocessing and Reconstruction
Takens theorem was used to reconstruct the original time series data. The the-
orem was developed for single dimension time series data. This experiment con-
siders two dimensions (latitude and longitude) hence Taken’s theorem is applied
to two dimensions.
The reconstructed vector is used to train the RNN for one-step-ahead prediction.
In the cooperative coevolutionary recurrent network (CCRNN) architecture, two
41
Figure 4.2: Proposed RNN architecture: A single input and output neuronElman recurrent neural network (SIORNN) used for predicting latitude and
longitude of the cyclone path.
neurons are used in the input and the output layer to represent the latitude,
longitude shown in Figure 4.1. In the proposed architecture (SIORNN), there is
only one neuron in the input and output layer. The processed data was laid out
in two layouts as shown in Figure 4.3 in order to be successfully handled by the
different network architectures. The recurrent network unfolds k steps in time
which is equal to the embedding dimension D [5, 131, 132].
The same time series data from the previous chapter is used however, for this
study, the cyclone track data had been extracted from the original data. Figure
4.4 shows the actual positions of the tropical cyclones from the dataset.
All the cyclone time series were combined together. Data preprocessing was done
by considering the position in the southern hemisphere and converting all into one
region. The conversion of latitude was done by multiplying the original latitude
by -1 to accommodate for South in the Southern Hemisphere. The longitudes
with East (E) coordinates remained unchanged while the West (W) coordinates
were subtracted from 360◦ to define all points in terms of East coordinates for
easier plotting of cyclone tracks on the spatial map.
The dataset contained time series of the position (latitude and longitude). The
following combinations of dimension and time lag were extracted using Taken’s
theorem.
42
Figure 4.3: Embedded data reconstructed using Taken’s theorem. Embed-ding dimension (D) of 4 is depicted. Both (a) and (b) have the two dimensions
longitude and latitude.
� Configuration A: D = 4 and T = 2, reconstructed dataset contains 3417
samples in training set and 1298 samples in test set.
� Configuration B: D = 5 and T = 3, reconstructed dataset contains 2278
samples in training set and 865 samples in test set.
43
Figure 4.4: Tropical cyclones track data in the South Pacific from 1985 to2013. (Generated using Gnuplot)
4.3.2 Experimental Design
We experimented with different number of hidden neurons in the RNN that em-
ployed sigmoid units in the hidden and output layer. We use implementation
from Smart Bilo: An Open Source Computational Intelligence Framework in our
experiments [133].
The CC-SIORNN as well as BPTT-SIORNN were both trained for and their
predictions tested for 24-hour and 30-hour advance warning. The termination
condition was set at 50,000 function evaluations for CC and 2000 epochs for
BPTT. The root mean squared error was used to evaluate the performance of the
two architectures for cyclone track prediction given in Equation 2.4.
4.4 Results and Discussion
The mean and 95% confidence interval is given from 30 experimental runs are
shown in Table 4.1 and Table 4.2. The best results with the least values of RMSE
given by each of the configurations are shown bold-faced. The test for robustness
of the proposed architecture is done using different sets of configuration to show
scalability as they contain varied dataset sizes.
44
Figure 4.5: Performance of SIORNN using cooperative coevolution for 6random cyclones from the year 2006 to 2013.
Table 4.1: Generalization performance of training models on cyclone trackprediction for Configuration A
Model Hidden RMSE (Train) RMSE (Test) Best
CCRNN 3 0.0508 ± 0.0010 0.0484 ± 0.0010 0.0455CCRNN 5 0.0493 ± 0.0006 0.0471 ± 0.0006 0.0447CCRNN 7 0.0492 ± 0.0007 0.0471± 0.0006 0.0448
CC-SIORNN 3 0.0252 ± 0.0003 0.0244 ± 0.0002 0.0238CC-SIORNN 5 0.0252 ± 0.0003 0.0245 ± 0.0003 0.0238CC-SIORNN 7 0.0266 ± 0.0033 0.0260 ± 0.0034 0.0237
BPTT-SIORNN 3 0.0265 ± 0.0002 0.0257 ± 0.0002 0.0245BPTT-SIORNN 5 0.0260 ± 0.0003 0.0254 ± 0.0002 0.0245BPTT-SIORNN 7 0.0256 ± 0.0003 0.0251 ± 0.0003 0.0242
The comparison of the combination of the single neuron and the multi-neuron
CCRNN methods shown by cyclone tracks is given in Tables 4.1 and 4.2. The
best performance was achieved by CC-SIORNN. It was able to outperform the
other methods for all the cases.
The number of hidden neurons did not make any considerable difference to the
results although 5 hidden neurons gave the best performance in configuration
B and 3 hidden neurons had better performance in configuration A. CCRNN
produced best best results with 7 neurons in the hidden layer. It seems that 3
45
(a) Performance of latitude prediction the testdataset
(b) Performance of longitude prediction the testdataset
Figure 4.6: Typical prediction performance of a single experiment (one-stepahead prediction) given by BPTT-SIORNN for Cyclone track test dataset
(2006 - 2013 tropical cyclones) where Time is taken at six hour intervals.
46
Table 4.2: Generalization performance of training models on cyclone trackprediction for Configuration B
Model Hidden RMSE (Train) RMSE (Test) Best
CCRNN 3 0.0526 ± 0.0014 0.0481± 0.0013 0.0432CCRNN 5 0.0506± 0.0007 0.0462± 0.0008 0.0430CCRNN 7 0.0497± 0.0006 0.0456± 0.0006 0.0425
CC-SIORNN 3 0.0260 ± 0.0004 0.0242 ± 0.0004 0.0232CC-SIORNN 5 0.0254 ± 0.0001 0.0237 ± 0.0001 0.0233CC-SIORNN 7 0.0256 ± 0.0003 0.0241 ± 0.0003 0.0232
BPTT-SIORNN 3 0.0254 ± 0.0002 0.0242 ± 0.0002 0.0235BPTT-SIORNN 5 0.0252 ± 0.0001 0.0239 ± 0.0001 0.0235BPTT-SIORNN 7 0.0252 ± 0.0001 0.0239 ± 0.0001 0.0235
hidden neurons were not sufficient to represent the two-dimensional time series
problem when separate neurons handled the two dimensions.
Figure 4.5 shows the performance of cooperative coevolution using the proposed
SIORNN architecture. It shows the path of 6 selected cyclones from those that
occurred between 2006 and 2013. Figure 4.6 shows the typical prediction perfor-
mance of a single experimental run given by the BPTT-SIORNN for cyclone track
test data set (2006 - 2013 tropical cyclones). The errors shown by the prediction
is also given in the graphs.
4.4.1 Discussion
Cyclone track is generally viewed as a single dimension but, it is modeled as
two separate dimensions of latitude and longitude. Through this research, we
found out that although we treat latitude and longitude independently, there is
correlation between them and when they are encoded into a recurrent neural net-
work as a single stream of data the prediction performance improves significantly
regardless of the training algorithm.
Cooperative coevolution and error backpropagation through-time gave better per-
formance when compared to the two neuron architecture due to the adapted
network configuration. The improvement in performance could be due to the
combination of the two dimensional time series consisting of a cyclone’s longitude
and latitude into a single stream of data points as represented in Figure 4.3 (b).
47
A single data stream increases the chance of preserving the interdependencies be-
tween the latitude and longitude while reaffirming the correlation between the two
inputs. Therefore, SIORNN outperforms CCRNN. In the traditional approach,
there is minimal probability of preserving interdependencies within the track at-
tributes as each input is encoded independently therefore, loosing the correlation
between latitude and longitude.
The major errors in predictions were seen at locations where there was a transition
from one cyclone to another in the data. This is due to the concatenation of
the various cyclones into a single data stream. As seen from the results of a
typical prediction given in Figure 4.6, the region where there is a switch from
one cyclone to another produces a large error in the prediction. The network
was unable to cope with the sudden change in the data due to the occurrence
of the cyclone at a different location. The data concatenated the location of
the cyclones where the end of one cyclone is adjacent to the beginning of the
other cyclone which are independent events. These have been considered as joint
events for training RNNs. Further studies need to be done in order to improve
the prediction accuracy at the beginning and end of the cyclones.
4.5 Chapter Summary
This chapter investigated how the two-dimensional time series consisting of lon-
gitude and latitude is best represented for superior prediction performance when
used for training RNNs. In the first method, the latitude and longitude is pre-
sented to a recurrent neural network with two input neurons whereas the second
methods combines both variables in a single data stream and employs a net-
work with single input neuron which interleaves successive longitude and latitude
values.
The results show that a single input and output layer neuron network trained
with either of the algorithms outperforms networks trained with separate inputs
for longitude and latitude. It is evident that it is more difficult to train the
recurrent neural network for both tasks given by the previous method as the
number of dimensions increase along with the noise in the time series and with it
the uncertainly. The proposed method has alleviated this weakness and produced
improved results that motivates real time implementation.
48
Although the results have been very promising, it may be possible to approach
the multidimensional problem as a group of single-dimensional time series prob-
lems using a mixture of computational intelligence methods for cyclone track
prediction.
There is also motivation for using additional atmospheric conditions that are
major attributes in the formation of cyclones such as the sea surface temperature,
pressure and humidity and the change of their intensity with time. Another
attribute that can be considered is the speed at which the cyclone is moving and
the geographical landscape, that is, sea and land.
The next chapter uses stacked transfer learning neural model for wind intensity
prediction.
49
Chapter 5
Stacked Transfer Learning for Tropical Cyclone In-tensity Prediction
In this chapter transfer stacking is used as a means of studying the effects of
cyclones whereby the contribution of cyclone data from different geographic loca-
tions towards improving generalization performance is evaluated. Conventional
neural networks are used for evaluating the effects of duration on cyclones in pre-
diction performance. A strategy of evaluating the relationships between different
types of cyclones through transfer learning and conventional learning methods
via neural networks was established in this chapter.
5.1 Introduction
In this chapter, stacked transfer learning is used as a means of studying the
effects of cyclones. We select the cyclones from last few decades in the South
Pacific and South Indian Oceans. Firstly, we evaluate the performance of the
standard neural networks when trained with different regions of the dataset. We
then evaluate the effects on duration of the cyclones in the South Pacific region
and their contribution towards the neural networks generalization performance.
Finally, we use transfer staking via ensembles and consider the South Pacific
region as the target model. We use South Indian Ocean as the source data and
evaluate its impact on the South Pacific Ocean. The backpropagation neural
network is used as the stacked ensembles in the transfer stacking method.
5.2 Methodology
5.2.1 Neural networks for time series prediction
Time series are data of a series of events observed over a certain time period. In
order to use neural networks for time series prediction, the original time series
is reconstructed into a state-space vector with embedding dimension (D) and
50
Figure 5.1: Neural Network Ensemble Model
time-lag (T) through Taken’s theorem [109] . We consider the backpropagation
algorithm which employs gradient descent for training [134]. The root mean
squared error which is generally used to test the performance of the FNN in
given in equation 2.4.
5.2.2 Stacked transfer learning
Transfer learning is implemented via stacked ensembles which form source and
target ensemble and a combiner ensemble that are implemented using feedfor-
ward neural networks (FNNs). Ensemble 1,2 and the combiner network in fig-
ure 5.1. We refer to the transfer learning model shown in figure 5.1 as transfer
stacking hereafter. Transfer stacking is implemented in two phases; phase one
involves training individual ensembles from the network. The second phase of
the method trains a secondary prediction model which learns from the knowledge
of the trained ensembles in phase 1. Figure 5.1 shows the broader view of the
ensemble stacking method where we have two ensemble models (FNNs) feeding
knowledge into a secondary combiner network. The source and target ensembles
are implemented using FNNs with the same topology. The combiner ensemble
topology is depended on the number of ensembles used as the source and target
ensembles. Ensemble 1 considers the South Pacific ocean data while Ensemble
51
2 considers the South Indian ocean training data. The datasets are described in
section 5.2.3.
The combiner ensemble is a feedforward network that is trained on the knowledge
processed by the ensembles. Backpropagation is used for training the combiner
network and the respective ensembles. The processed knowledge comes from
training data of the source and target data. Knowledge was gathered from cre-
ating a stacked dataset that is a direct mapping from the training data. This
was achieved by concatenating the output of all the ensembles into a new stacked
data file. The stacked dataset encompassing the knowledge of the ensembles, is
then used to train the combiner FNN network.
Similar to the two step training process, the testing done in two phases. The
testing data was also passed through the stacking module to generate a stacked
testing data. The stacked testing data was then used in the combiner network to
measure the generalization performance of transfer stacking .
5.2.3 Data Processing
The South Pacific ocean cyclone wind intensity data described in Section 3.3.1
and South Indian Ocean cyclone wind intensity data for the year 1985 to 2013
had been used for these experiments[130]. We divided the data into training and
testing sets. Cyclones occurring in the years 1985 to 2005 were used for training
while the remaining data was used to test for generalization performance. The
consecutive cyclones in the training and testing set was concatenated into a time
series for effective modeling. The data was normalized to be in the range [ 0, 1].
5.3 Experiments and Results
In this section, we present the experiments that we had used to test the transfer
stacking for cyclone intensity prediction. We later present our results based on
the root mean squared error.
52
5.3.1 Experiment Design
Standalone feed forward neural networks that were trained using the same back
propagation learning algorithm used in the experiments. Stochastic gradient de-
scent was used with 2000 epochs training time. 30 experimental runs of each of the
experiments was done in order to provide mean performance. The experiments
were designed as presented in the list.
1. Experiment 1: Vanilla FNN trained on all South Pacific ocean training data
and tested on South Pacific Ocean testing data.
2. Experiment 2: Vanilla FNN trained on all Indian ocean training data and
tested on South Pacific Ocean testing data.
3. Experiment 3: 6 experiments with vanilla FNNs trained on subset of South
Pacific ocean training data and tested twice. Once with the entire South
Pacific Ocean testing data followed by the testing with a subset of the
testing data . The subsets were created by grouping cyclones of similar
lengths into classes. Each class of cyclones had been trained and tested
with vanilla FNN model. We formulated the following experiments with
vanilla FNNs:
(a) [0− 3] day old cyclones in training set and tested with full testing set
as well as [0− 3 day old cyclones in testing set.
(b) [3− 5] day old cyclones in training set and tested with full testing set
as well as [3− 5] day old cyclones in testing set.
(c) [5− 7] day old cyclones in training set and tested with full testing set
as well as [5− 7] day old cyclones in testing set.
(d) [7− 9] day old cyclones in training set and tested with full testing set
as well as [7− 9] day old cyclones in testing set.
(e) [9− 12] day old cyclones in training set and tested with full testing set
as well as [9− 12] day old cyclones in testing set.
(f) cyclones older than 12 days in training set and tested with full testing
set as well as cyclones older than 12 day in testing set.
53
Table 5.1: Generalization performance
Experiment RMSE1 0.02863± 0.000422 0.03396± 0.000754 0.02802± 0.00039
Table 5.2: Experiment 3: Performance of FNN on different categories ofcyclones
Cyclone Category Training RMSE Categorical Testing RMSE Generalization RMSE
0-3 day 0.03932± 0.00173 0.05940± 0.00329 0.20569± 0.013393-5 day 0.03135± 0.00050 0.02580± 0.00044 0.05200± 0.005325-7 day 0.03265± 0.00027 0.02504± 0.00025 0.03070± 0.001727-9 day 0.02831± 0.00033 0.03799± 0.00065 0.03207± 0.000559-12 day 0.03081± 0.00025 0.02700± 0.00059 0.03154± 0.00047> 12 day 0.02819± 0.00019 0.03579± 0.00037 0.02875± 0.00025
4. Experiment 4: Transfer learning based stacked ensemble method for pre-
dicting South Pacific ocean tropical cyclone intensity. The mechanics of
this method is given in section 5.2.2.
5. Experiment 5: Two separate experiments were done in this experiment;
(a) FNN trained on 1985 - 1995 cyclones from south pacific data and tested
on South Pacific Ocean testing data.
(b) FNN trained on 1995 - 2005 cyclones from south pacific data and tested
on South Pacific Ocean testing data.
5.3.2 Results
We present the results of the prediction of tropical cyclones intensity in the South
Pacific form the year 2006 to 2013. The root mean squared error (RMSE) was
used to evaluate the performance as given in equation 2.4.
Table 5.1 gives the generalization performance of the respective methods on the
testing data. The results show that all the models have similar performance. It
looks at results from experiments 1, 2 and 4.
Table 5.2 gives the performance of various categories of data on the two testing
datasets; category based testing and generalized testing. Category based testing
54
Table 5.3: Experiment 5: Performance of Vanilla FNN on independent decadetraining data
Vanilla FNN Training RMSE Testing RMSE1985 - 1995 cyclones 0.04434± 0.00073 0.04671± 0.002091995 - 2005 cyclones 0.03635± 0.00062 0.05949± 0.00201
is done on cyclones which belongs to that particular category of the testing data.
Generalization testing was done on the entire testing dataset. The results show
that cyclones that ended in under three days were not good predictors for the
generalization data as they had higher testing error. Similarly, cyclones that had
duration of 3-5 days had poor performance as well giving a larger error however
not as large as is predecessor.
Cyclones with duration of 5-12 days had very similar generalization performance.
These categories of cyclones performed better than the smaller length cyclones
by giving better prediction accuracy in terms of RMSE. The final category of
cyclones gave the best generalization performance. Longer length cyclones, 12
days and over was matching the prediction accuracy of the best models that is,
the standalone FNN model of control experiment 1, periodical and spatial analysis
ensemble models.
Table 5.3 shows the generalization performance achieved by the standalone FNN
model trained with the decade 1 and decade 2 training data. The generalization
performance is rather poor with both the decades of data used independently
when compared to the concatenation of all the training data as seen in experiment
1.
5.3.3 Discussion
The results revealed interesting details about the respective experiments that were
carried out. The first two experiments considered conventional learning through
neural networks (FNN) for predicting tropical cyclone wind intensity. Exper-
iments 1 considered cyclones where the training and testing dataset from the
same region (South Pacific ocean) was used. Experiment 2 considered cyclones
where training data from South Indian ocean and testing data from South Pacific
ocean was used. According to the results, there was minimal difference in the gen-
eralization performance in the South Pacific ocean, although the training datasets
55
considered different regions. This implies that the cyclones in the South Indian
ocean have similar characteristics in terms of the change of wind-intensity when
compared to South Pacific ocean. Note that the size of the South Indian ocean
dataset was about three times larger when compared to South Pacific ocean.
Furthermore, Experiment 3 featured an investigation into the effects of the du-
ration of cyclones (cyclone lifetime) on the generalization performance. This was
done only for the case of the South Pacific ocean. Note that the generalization
performance is based on the test dataset that includes all the different types of
cyclones that occurred between year 2006 to year 2013. As seen in the results,
cyclones with shorter durations were not effective when considering the general-
ization performance. It seems that the shorter duration of the cyclones did not
give enough information to feature essential knowledge for the longer duration of
cyclones in the test dataset. The category with the cyclones with longest lengths
gave the best performance in terms of generalization performance .This implied
that it had known about all the phases of the life cycle of the cyclone, thus it was
able to effectively predict all classes of cyclones.
Transfer stacking via neural network ensembles was done in Experiment 4 where
training dataset from South Indian ocean was used as source datasets and cyclones
in the South Pacific Ocean was used in the target datasets. The generalization
performance here was similar to conventional neural networks given Experiment 1
and 2. This shows that the source data (South Indian ocean) did not make signif-
icant contributions towards improving the generalization performance, however,
we note that there was not a negative transfer of knowledge as the performance
was not deteriorated. Therefore, the knowledge from the cyclone behavior or
change in wind-intensity Indian ocean is valid and applicable for South Pacific
ocean. Further validation can be done in future through examining the track
information of the cyclones from the respective regions. This will add further in-
sights of transfer learning through stacking and establish better knowledge about
the relationship of the cyclones in the respective regions.
56
5.4 Chapter Summary
This chapter presented transfer stacking as a means of studying the contribution
of cyclones in different geographic locations for improving generalization perfor-
mance in predicting tropical cyclone wind intensity for the South Pacific ocean.
We then evaluated the effects on duration of the cyclones in the South Pacific
region and their contribution towards the neural networks generalization perfor-
mance. Cyclone duration was seen to be a major contributor in the prediction of
cyclone intensity.
We found that cyclones with duration of over 12 days could be used as good
representative for training the neural networks with competitive prediction inten-
sity. Furthermore, the results show that the Indian Ocean source dataset does not
significantly improve the generalization performance of the South Pacific target
problem in terms of generalization performance. The contributions of the Indian
ocean data was negligible as the knowledge about cyclone intensity prediction
was sufficiently learned from the South Pacific data. The change in geographical
location was unable to provide any new knowledge which would improve gener-
alization.
Further work can review other cyclone regions into the transfer learning method-
ology to further improve the generalization performance. The approach can be
extended to prediction of cyclone tracks in the related regions. Recurrent neu-
ral networks (RNN) could be used as ensemble learners to identify any temporal
sequences that FNN was unable to learn as RNNs have shown better modeling
performance. Hybrid ensemble models of neural networks could also be developed
that use evolutionary algorithms for training the FNN together with backpropa-
gation to improve the modeling.
57
Chapter 6
Conclusions and Future Work
In this thesis, novel methods with neural networks were used to predict cyclone
wind intensity and path for the cyclones in the South Pacific Ocean for the past
decades.
In chapter 7, minimal timespan problem was identified. It was defined as the
least possible number of data points required for time-series prediction. The re-
sults indicated the capability of timespan to be used as a measure of robustness
of Elman recurrent neural networks in time series prediction. Tropical cyclone
wind intensity prediction warranted the use of time span of 5 as it gave the best
result in the experiments. Cooperative neuro-evolution was found to outperform
backpropagation through-time at larger time span. The size of the problem in-
creased with time span due to multiple unfolds of the RNN over time to cater
for the increased inputs creating a bigger network which favored CC’s divide and
conquer style of optimization over BPTT. As a measure for robustness, it was
seen that the training and testing had to be the same in order to attain the best
prediction accuracy which indicated that the training difficulty had minimal effect
on the robustness of the recurrent neural network. According to the results, the
minimal timespan is an important characteristic for robust time series prediction.
The minimal timespan would be useful in training a RNN models that would be
able to enhance predictions in future cases of cyclones.
Chapter 8, proposed a new architecture for encoding two-dimensional time series
data into Elman style recurrent neural networks . The primary motivation be-
hind this new architecture was to preserve the pre-existing relationships within
the separate dimensions of the data. Single input and output neuron architecture
of the RNN to encode two-dimensional tropical cyclone track data comprising
of latitude and longitude. SIORNN was compared to literature and significant
improvements in performance was noted. The adaptability of SIORNN was also
studied by applying two learning algorithms CC and BPTT. Both learning algo-
rithms outperformed the existing encoding method indicating the superiority of
SIORNN. However, CC-SIORNN was found to have the best results in general-
ization. It was found that it is more difficult to train the recurrent neural network
for both tasks given by the previous method as the number of dimensions increase
along with the noise in the time series and with it the uncertainly. SIORNN has
58
alleviated this weakness and produced improved results that motivates real time
implementation.
Finally, Chapter 9 evaluated the performance of the standard neural networks
when trained with different regions of the dataset South Pacific region and In-
dian Ocean region. It also looked at the effects on duration of the cyclones in
the South Pacific region and their contribution towards the neural networks gen-
eralization performance. Cyclone duration was seen to be a major contributor
in the prediction of cyclone intensity. Longer duration cyclones made most con-
tributions to network performance. It was found that cyclones with duration of
over 12 days could be used as good representative for the training data. Transfer
staking via stacked ensembles was successfully used to predict South Pacific cy-
clone wind intensity. South Pacific region was the target model. We used South
Indian Ocean as the source data and evaluated its impact on the South Pacific
Ocean with regards to cyclone wind intensity prediction. The additional source
knowledge as introduced by the use of changed geographical location data was
unable to provide any new knowledge which would improve generalization.
6.1 Future Research Directions
In future work, multi-objective and multi-tasking methods could be used for the
minimal timespan problem. Further applications in other problems such as rain-
fall and those that require fast seasonal prediction at beginning of event such
as earthquakes can be explored. The versatility of SIORNN could be applied
to three or more dimensions incorporating cyclone track and wind intensity in a
single prediction network with different training algorithms.
The stacked transfer learning could be used with additional data from other re-
gions and their contribution could be evaluated. Stacking could also be applied to
uncertainty quantification of prediction using Bayesian methods with application
to different datasets. Further comparisons of stacked model could be done wit
well established methods such as SHIPS and CLIPER.
59
Appendix
Publications
The following publications have arisen during the course of my research.
R Chandra, R. Deo, K. Bali, A. Sharma. On the relationship of degree of sep-
arability with depth of evolution in decomposition for cooperative coevolution.
In IEEE Congress on Evolutionary Computation, pages 4823-4830 , Vancouver,
Canada, July 2016.
R Deo and R. Chandra. Identification of minimal timespan problem for recurrent
neural networks with application to cyclone wind-intensity prediction. In IEEE
International Joint Conference on Neural Networks, pages 489-496, Vancouver,
Canada, July 2016.
R Chandra, R. Deo, C.W. Omlin. An architecture for encoding two-dimensional
cyclone track prediction problem in coevolutionary recurrent neural networks.
In IEEE International Joint Conference on Neural Networks, pages 4865-4872,
Vancouver, Canada, July 2016.
R. Deo, R Chandra, A. Sharma. Stacked transfer learning for tropical cyclone
intensity prediction. In The Pacific-Asia Conference on Knowledge Discovery
and Data Mining, Under Review, Melbourne, Australia, June 2018.
R Deo and R. Chandra. Multi-task learning for cyclone wind intensity and path
prediction. In Geoscience and Remote Sensing, IEEE Transactions on, In Pro-
cess.
60
Bibliography
[1] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, pp.
179–211, 1990.
[2] R. Chandra, K. Dayal, and N. Rollings, “Application of cooperative neuro-
evolution of Elman recurrent networks for a two-dimensional cyclone track
prediction for the South Pacific region,” in International Joint Conference
on Neural Networks (IJCNN), Killarney, Ireland, July 2015, pp. 721–728.
[3] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent
in nervous activity,” The bulletin of mathematical biophysics, vol. 5, no. 4,
pp. 115–133, 1943.
[4] P. J. Werbos, “Backpropagation through time: what it does and how to do
it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
[5] R. Chandra and M. Zhang, “Cooperative coevolution of Elman recurrent
neural networks for chaotic time series prediction,” Neurocomputing, vol.
186, pp. 116 – 123, 2012.
[6] F. E. Tay and L. Cao, “Application of support vector machines in financial
time series forecasting,” Omega, vol. 29, no. 4, pp. 309–317, 2001.
[7] W. Du, S. Y. S. Leung, and C. K. Kwong, “Time series forecasting by
neural networks: A knee point-based multiobjective evolutionary algorithm
approach,” Expert Systems with Applications, vol. 41, no. 18, pp. 8049 –
8061, 2014.
[8] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions
on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
[9] D. Pardoe and P. Stone, “Boosting for regression transfer,” in Proceedings
of the 27th international conference on Machine learning (ICML-10), 2010,
pp. 863–870.
[10] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, “Self-taught learning:
transfer learning from unlabeled data,” in Proceedings of the 24th interna-
tional conference on Machine learning. ACM, 2007, pp. 759–766.
61
[11] N. D. Lawrence and J. C. Platt, “Learning to learn with the informative
vector machine,” in Proceedings of the twenty-first international conference
on Machine learning. ACM, 2004, p. 65.
[12] J. Gao, H. Ling, W. Hu, J. Xing et al., “Transfer learning based visual
tracking with gaussian processes regression.” in ECCV (3), 2014, pp. 188–
203.
[13] H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura,
and R. M. Summers, “Deep convolutional neural networks for computer-
aided detection: Cnn architectures, dataset characteristics and transfer
learning,” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1285–
1298, 2016.
[14] K. Emanuel, “Increasing destructiveness of tropical cyclones over the past
30 years,” Nature, vol. 436, no. 7051, pp. 686–688, 2005.
[15] A Neural Network Regression Model for Tropical Cyclone Forecast, 2005.
[16] L. Jin, C. Yao, and X.-Y. Huang, “A nonlinear artificial intelligence ensem-
ble prediction model for typhoon intensity,” Monthly Weather Review, vol.
136, pp. 4541–4554, 2008.
[17] “Tropical-cyclone forecasting: a worldwide summary of techniques and ver-
ification statistics,” Bulletin of American Meteorological Society, vol. 68.
[18] G. Holland, “Global guide to tropical cyclone forecasting. bureau of
meteorology research center, melbourne, australia,” http://cawcr.gov.
au/publications/BMRC archive/tcguide/globa guide intro.htm, 2009, ac-
cessed: January 21, 2015.
[19] C. Roy and R. Kovordanyi, “Tropical cyclone track forecasting techniques
- a review,” Atmospheric Research, vol. 104-105, pp. 40–69, 2012.
[20] S. Chaudhuri, D. Dutta, S. Goswami, and A. Middey, “Track and intensity
forecast of tropical cyclones over the north indian ocean with multilayer
feed forward neural nets,” Meteorological Applications, vol. 22, no. 3, pp.
563–575, 2015. [Online]. Available: http://dx.doi.org/10.1002/met.1488
[21] L. E. Carr III, R. L. Elsberry, and J. E. Peak, “Beta test of the systematic
approach expert system prototype as a tropical cyclone track forecasting
aid,” Weather and forecasting, vol. 16, no. 3, pp. 355–368, 2001.
62
[22] R. Kovordanyi and C. Roy, “Cyclone track forecasting based on satellite
images using artificial neural networks,” ISPRS Journal of Photogrammetry
and Remote Sensing, vol. 64, no. 6, pp. 513–521, 2009.
[23] R. Chandra, “Multi-objective cooperative neuro-evolution of recurrent neu-
ral networks for time series prediction,” in IEEE Congress on Evolutionary
Computation, Sendai, Japan, May 2015, pp. 101–108.
[24] R. Chandra and K. Dayal, “Cooperative coevolution of Elman recurrent
networks for tropical cyclone wind-intensity prediction in the South Pacific
region,” in IEEE Congress on Evolutionary Computtaion, Sendai, Japan,
May 2015, pp. 1784–1791.
[25] L. Sacchi, C. Larizza, C. Combi, and R. Bellazzi, “Data mining with tempo-
ral abstractions: learning rules from time series,” Data Mining and Knowl-
edge Discovery, vol. 15, no. 2, pp. 217–247, 2007.
[26] J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the theory of neural
computation. Basic Books, 1991, vol. 1.
[27] X. Yao, “Evolving artificial neural networks,” Proceedings of the IEEE,
vol. 87, no. 9, pp. 1423–1447, Sep 1999.
[28] N. Garcıa-Pedrajas and D. Ortiz-Boyer, “A cooperative constructive
method for neural networks for pattern recognition,” Pattern Recogn.,
vol. 40, no. 1, pp. 80–98, 2007.
[29] H. Zhang, J. Guan, and G. C. Sun, “Artificial neural network-based image
pattern recognition,” in Proceedings of the 30th Annual Southeast Regional
Conference, ser. ACM-SE 30. New York, NY, USA: ACM, 1992, pp. 437–
441.
[30] F. Gomez, J. Schmidhuber, and R. Miikkulainen, “Accelerated neural evo-
lution through cooperatively coevolved synapses,” J. Mach. Learn. Res.,
vol. 9, pp. 937–965, 2008.
[31] D. Pardoe, M. Ryoo, and R. Miikkulainen, “Evolving neural network en-
sembles for control problems,” in Proceedings of the 7th Annual Conference
on Genetic and Evolutionary Computation, ser. GECCO ’05. New York,
NY, USA: ACM, 2005, pp. 1379–1384.
63
[32] R. Chandra and M. Zhang, “Cooperative coevolution of elman recurrent
neural networks for chaotic time series prediction,” Neurocomputing, vol. 86,
pp. 116–123, 2012.
[33] A. Emam, “Optimal artificial neural network topology for foreign exchange
forecasting,” in Proceedings of the 46th Annual Southeast Regional Confer-
ence on XX, ser. ACM-SE 46. New York, NY, USA: ACM, 2008, pp.
63–68.
[34] M. Janeski and S. Kalajdziski, “Neural network model for forecasting balkan
stock exchanges,” in Proceedings of the 7th International Conference on Ad-
vanced Intelligent Computing, ser. ICIC’11. Berlin, Heidelberg: Springer-
Verlag, 2011, pp. 17–24.
[35] S. S. Haykin, S. S. Haykin, S. S. Haykin, and S. S. Haykin, Neural networks
and learning machines. Pearson Education Upper Saddle River, 2009,
vol. 3.
[36] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal
representations by error propagation,” Parallel distributed processing: ex-
plorations in the microstructure of cognition, vol. 1, pp. 318–362, 1985.
[37] A. C. Tsoi and A. D. Back, “Locally recurrent globally feedforward net-
works: a critical review of architectures,” Neural Networks, IEEE Transac-
tions on, vol. 5, no. 2, pp. 229–239, 1994.
[38] C. L. Giles, C. B. Miller, D. Chen, H.-H. Chen, G.-Z. Sun, and Y.-C. Lee,
“Learning and extracting finite state automata with second-order recurrent
neural networks,” Neural Computation, vol. 4, no. 3, pp. 393–405, 1992.
[39] M. W. Goudreau, C. L. Giles, S. T. Chakradhar, and D. Chen, “First-
order versus second-order single-layer recurrent neural networks,” Neural
Networks, IEEE Transactions on, vol. 5, no. 3, pp. 511–513, 1994.
[40] T. Lin, B. Horne, P. Tino, and C. Giles, “Learning long-term dependencies
in narx recurrent neural networks,” IEEE transactions on neural networks,
vol. 7, no. 6, pp. 1329–1338, 1996.
[41] H. T. Siegelmann, B. G. Horne, and C. L. Giles, “Computational capabil-
ities of recurrent narx neural networks,” Systems, Man, and Cybernetics,
64
Part B: Cybernetics, IEEE Transactions on, vol. 27, no. 2, pp. 208–215,
1997.
[42] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural com-
putation, vol. 9, no. 8, pp. 1735–1780, 1997.
[43] H. Jaeger, “The “echo state” approach to analysing and training recurrent
neural networks-with an erratum note,” Bonn, Germany: German National
Research Center for Information Technology GMD Technical Report, vol.
148, p. 34, 2001.
[44] W. Maass, T. Natschlager, and H. Markram, “Real-time computing without
stable states: A new framework for neural computation based on perturba-
tions,” Neural computation, vol. 14, no. 11, pp. 2531–2560, 2002.
[45] M. LukosEvicIus and H. Jaeger, “Reservoir computing approaches to re-
current neural network training,” Computer Science Review, vol. 3, no. 3,
pp. 127–149, 2009.
[46] S. C. Kremer, “On the computational power of elman-style recurrent net-
works,” Neural Networks, IEEE Transactions on, vol. 6, no. 4, pp. 1000–
1004, 1995.
[47] M. Samuelides, “Neural identification of controlled dynamical systems and
recurrent networks,” in Neural Networks. Springer, 2005, pp. 231–287.
[48] R. Chandra, M. Frean, and M. Zhang, “On the issue of separability for
problem decomposition in cooperative neuro-evolution,” Neurocomputing,
vol. 87, pp. 33–40, 2012.
[49] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies
with gradient descent is difficult,” IEEE Trans. Neural Networks, vol. 5,
no. 2, pp. 157–166, 1994.
[50] S. Hochreiter, “The vanishing gradient problem during learning recurrent
neural nets and problem solutions,” Int. J. Uncertain. Fuzziness Knowl.-
Based Syst., vol. 6, no. 2, pp. 107–116, 1998.
[51] R. Chandra, M. Frean, M. Zhang, and C. W. Omlin, “Encoding subcom-
ponents in cooperative co-evolutionary recurrent neural networks,” Neuro-
computing, vol. 74, no. 17, pp. 3223 – 3234, 2011.
65
[52] K. Deb, A. Anand, and D. Joshi, “A computationally efficient evolutionary
algorithm for real-parameter optimization,” Evol. Comput., vol. 10, no. 4,
pp. 371–395, 2002.
[53] K. A. De Jong, “Evolutionary computation: a unified approach,” 2006.
[54] M. Lozano, D. Molina, and F. Herrera, “Editorial scalability of evolutionary
algorithms and other metaheuristics for large-scale continuous optimization
problems,” Soft Computing, vol. 15, no. 11, pp. 2085–2087, 2011.
[55] J. Apolloni, E. Alba et al., “Island based distributed differential evolution:
an experimental study on hybrid testbeds,” in Eighth International Con-
ference on Hybrid Intelligent Systems. IEEE, 2008, pp. 696–701.
[56] Z. Yang, K. Tang, and X. Yao, “Large scale evolutionary optimization using
cooperative coevolution,” Inf. Sci., vol. 178, no. 15, pp. 2985–2999, 2008.
[57] B. Kazimipour, M. N. Omidvar, X. Li, and A. K. Qin, “A sensitivity analysis
of contribution-based cooperative co-evolutionary algorithms,” pp. 417–424,
2015.
[58] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist mul-
tiobjective genetic algorithm: Nsga-ii,” Evolutionary Computation, IEEE
Transactions on, vol. 6, no. 2, pp. 182–197, 2002.
[59] R. P. Wiegand, “An analysis of cooperative coevolutionary algorithms,”
Ph.D. dissertation, Citeseer, 2003.
[60] R. E. Bellman, Dynamic Programming, ser. ser. Dover Books on Mathe-
matics. Princeton University Press, 1957.
[61] D. E. Goldberg, “Real-coded genetic algorithms, virtual alphabets, and
blocking,” Urbana, vol. 51, p. 61801, 1990.
[62] L. D. Whitley et al., “The genitor algorithm and selection pressure: Why
rank-based allocation of reproductive trials is best.” in ICGA, vol. 89, 1989,
pp. 116–123.
[63] J. E. Baker, “Reducing bias and inefficiency in the selection algorithm,”
in Proc. of the 2nd Intl Conf on GA. Lawrence Erlbaum Associates, Inc.
Mahwah, NJ, USA, 1987, pp. 14–21.
66
[64] K. A. DeJong, “An analysis of the behavior of a class of genetic adaptive
systems,” Ph.D. dissertation, 1975.
[65] A. H. Wright, “Genetic algorithms for real parameter optimization,” Foun-
dations of genetic algorithms, vol. 1, pp. 205–218, 1991.
[66] K. Deb and R. B. Agrawal, “Simulated binary crossover for continuous
search space,” Complex Systems, vol. 9, no. 3, pp. 1–15, 1994.
[67] I. Ono and S. Kobayashi, “A real coded genetic algorithm for function
optimization using unimodal normal distributed crossover.” in Proceedings
of the Seventh International Conference on Genetic Algorithms. Morgan
Kaufmann, 1997, pp. 246–253.
[68] T. Higuchi, S. Tsutsui, and M. Yamamura, “Theoretical analysis of simplex
crossover for real-coded genetic algorithms,” in Parallel Problem Solving
from Nature PPSN VI. Springer, 2000, pp. 365–374.
[69] P. Posık, “Bbob-benchmarking the generalized generation gap model with
parent centric crossover,” in Proceedings of the 11th Annual Conference
Companion on Genetic and Evolutionary Computation Conference: Late
Breaking Papers. ACM, 2009, pp. 2321–2328.
[70] K. Tang, X. Yao, P. N. Suganthan, C. MacNish, Y. P. Chen, C. M. Chen,
, and Z. Yang, “Benchmark functions for the CEC’2008 special session
and competition on large scale global optimization,” Nature Inspired Com-
putation and Applications Laboratory, USTC, China, Tech. Rep., 2007,
http://nical.ustc.edu.cn/cec08ss.php.
[71] X. Li, K. Tang, M. N. Omidvar, Z. Yang, and K. Qin, “Benchmark func-
tions for the CEC’2013 special session and competition on large-scale global
optimization,” RMIT University, Melbourne, Australia, Tech. Rep., 2013,
http://goanna.cs.rmit.edu.au/ xiaodong/cec13-lsgo.
[72] M. Potter and K. De Jong, “A cooperative coevolutionary approach to
function optimization,” in Parallel Problem Solving from Nature — PPSN
III, ser. Lecture Notes in Computer Science, Y. Davidor, H.-P. Schwefel, and
R. Manner, Eds. Springer Berlin Heidelberg, 1994, vol. 866, pp. 249–257.
[73] R. Salomon, “Re-evaluating genetic algorithm performance under coordi-
nate rotation of benchmark functions. a survey of some theoretical and
67
practical aspects of genetic algorithms,” Biosystems, vol. 39, no. 3, pp. 263
– 278, 1996.
[74] M. A. Potter and K. A. De Jong, “Cooperative coevolution: An architecture
for evolving coadapted subcomponents,” Evol. Comput., vol. 8, pp. 1–29,
2000.
[75] M. Omidvar, X. Li, Y. Mei, and X. Yao, “Cooperative co-evolution with dif-
ferential grouping for large scale optimization,” Evolutionary Computation,
IEEE Transactions on, vol. 18, no. 3, pp. 378–393, June 2014.
[76] X. Li and X. Yao, “Cooperatively coevolving particle swarms for large scale
optimization,” Evolutionary Computation, IEEE Transactions on, vol. 16,
no. 2, pp. 210–224, 2012.
[77] F. Gomez and R. Mikkulainen, “Incremental evolution of complex general
behavior,” Adapt. Behav., vol. 5, no. 3-4, pp. 317–342, 1997.
[78] Y. Liu, X. Yao, Q. Zhao, and T. Higuchi, “Scaling up fast evolutionary
programming with cooperative coevolution,” in Evolutionary Computation,
Proceedings of the 2001 Congress on, San Diego, CA, USA, Jun. 2001, pp.
1101–1108.
[79] F. van den Bergh and A. Engelbrecht, “A cooperative approach to particle
swarm optimization,” Evolutionary Computation, IEEE Transactions on,
vol. 8, no. 3, pp. 225–239, Jun. 2004.
[80] Y.-j. Shi, H.-f. Teng, and Z.-q. Li, “Cooperative co-evolutionary differential
evolution for function optimization,” in Advances in Natural Computation,
ser. Lecture Notes in Computer Science, L. Wang, K. Chen, and Y. S. Ong,
Eds. Springer Berlin / Heidelberg, 2005, vol. 3611, pp. 1080–1088.
[81] M. Omidvar, X. Li, and X. Yao, “Cooperative co-evolution for large scale
optimization through more frequent random grouping,” in Evolutionary
Computation (CEC), 2010 IEEE Congress on, 2010, pp. 1754–1761.
[82] M. N. Omidvar and X. Li, “Cooperative coevolutionary algorithms for large
scale optimisation,” Technical report, The Royal Melbourne Institute of
Technology (RMIT), Tech. Rep.
68
[83] J. Vesterstrom and R. Thomsen, “A comparative study of differential evo-
lution, particle swarm optimization, and evolutionary algorithms on numer-
ical benchmark problems,” in Evolutionary Computation, 2004. CEC2004.
Congress on, vol. 2. IEEE, 2004, pp. 1980–1987.
[84] K. Tang, X. Yao, P. Suganthan, C. MacNish, Y. Chen, C. Chen, and
Z. Yang, “Benchmark functions for the cec’2008 special session and compe-
tition on large scale global optimization. nature inspired computation and
applications laboratory, ustc,” Applicat. Lab., Univ. Sci. Technol. China,
Hefei, China, Tech, rep, 2007.
[85] R. Chandra and K. Bali, “Competitive two-island cooperative coevolu-
tion for real parameter global optimisation,” in Evolutionary Computation
(CEC), 2015 IEEE Congress on. IEEE, 2015, pp. 93–100.
[86] R. Chandra, “Competition and collaboration in cooperative coevolution of
elman recurrent neural networks for time-series prediction,” 2015.
[87] M. N. Omidvar, X. Li, Y. Mei, and X. Yao, “Cooperative co-evolution with
differential grouping for large scale optimization,” Evolutionary Computa-
tion, IEEE Transactions on, vol. 18, no. 3, pp. 378–393, 2014.
[88] M. N. Omidvar, X. Li, and K. Tang, “Designing benchmark problems for
large-scale continuous optimization,” Information Sciences, 2015.
[89] Y. Liu, X. Yao, Q. Zhao, and T. Higuchi, “Scaling up fast evolutionary
programming with cooperative coevolution,” in Evolutionary Computation,
2001. Proceedings of the 2001 Congress on, vol. 2. IEEE, 2001, pp. 1101–
1108.
[90] M. N. Omidvar, Y. Mei, and X. Li, “Effective decomposition of large-scale
separable continuous functions for cooperative co-evolutionary algorithms,”
in Proc. of IEEE Congress on Evolutionary Computation, 2014, pp. 1305–
1312.
[91] S. Mahdavi, M. E. Shiri, and S. Rahnamayan, “Cooperative co-evolution
with a new decomposition method for large-scale optimization,” in Proceed-
ings of the IEEE Congress on Evolutionary Computation, CEC 2014, 2014,
pp. 1285–1292.
69
[92] W. Chen, T. Weise, Z. Yang, and K. Tang, “Large-scale global optimization
using cooperative coevolution with variable interaction learning,” in Proc.
of International Conference on Parallel Problem Solving from Nature, ser.
Lecture Notes in Computer Science, vol. 6239. Springer Berlin / Heidel-
berg, 2011, pp. 300–309.
[93] M. N. Omidvar, X. Li, and X. Yao, “Cooperative co-evolution with delta
grouping for large scale non-separable function optimization,” in Proc. of
IEEE Congress on Evolutionary Computation, 2010, pp. 1762–1769.
[94] D. H. Wolpert, “Stacked generalization,” Neural networks, vol. 5, no. 2, pp.
241–259, 1992.
[95] N. Ueda, “Optimal linear combination of neural networks for improving
classification performance,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 22, no. 2, pp. 207–215, 2000.
[96] L. K. Hansen and P. Salamon, “Neural network ensembles,” IEEE trans-
actions on pattern analysis and machine intelligence, vol. 12, no. 10, pp.
993–1001, 1990.
[97] E. Bauer and R. Kohavi, “An empirical comparison of voting classification
algorithms: Bagging, boosting, and variants,” Machine learning, vol. 36,
no. 1, pp. 105–139, 1999.
[98] H. Drucker, “Improving regressors using boosting techniques,” in ICML,
vol. 97, 1997, pp. 107–115.
[99] G. Wang, J. Hao, J. Ma, and H. Jiang, “A comparative assessment of en-
semble learning for credit scoring,” Expert systems with applications, vol. 38,
no. 1, pp. 223–230, 2011.
[100] L. Breiman, “Stacked regressions,” Machine learning, vol. 24, no. 1, pp.
49–64, 1996.
[101] P. Smyth and D. Wolpert, “Linearly combining density estimators via stack-
ing,” Machine Learning, vol. 36, no. 1, pp. 59–83, 1999.
[102] O. Erdem, M. Olutas, B. Guzelturk, Y. Kelestemur, and H. V. Demir,
“Temperature-dependent emission kinetics of colloidal semiconductor
nanoplatelets strongly modified by stacking,” The journal of physical chem-
istry letters, vol. 7, no. 3, pp. 548–554, 2016.
70
[103] E. Lorenz, “Deterministic non-periodic flows,” Journal of Atmospheric Sci-
ence, vol. 20, pp. 267 – 285, 1963.
[104] H. K. Stephen, In the Wake of Chaos: Unpredictable Order in Dynamical
Systems. University of Chicago Press, 1993.
[105] H. Jiang and W. He, “Grey relational grade in local support vector regres-
sion for financial time series prediction,” Expert Systems with Applications,
vol. 83, pp. 136–145, 2012.
[106] B. Wang, H. Huang, and X. Wang, “A novel text mining approach to finan-
cial time series forecasting,” Neurocomputing, vol. 83, pp. 136–145, 2012.
[107] M. Ardalani-Farsa and S. Zolfaghari, “Chaotic time series prediction with
residual analysis method using hybrid elman–narx neural networks,” Neu-
rocomputing, vol. 73, no. 13, pp. 2540–2553, 2010.
[108] A. Gholipour, B. N. Araabi, and C. Lucas, “Predicting chaotic time series
using neural and neurofuzzy models: A comparative study,” Neural Process.
Lett., vol. 24, pp. 217–239, 2006.
[109] F. Takens, “Detecting strange attractors in turbulence,” in Dynamical Sys-
tems and Turbulence, Warwick 1980, ser. Lecture Notes in Mathematics,
1981, pp. 366–381.
[110] E. N. Lorenz, “Empirical orthogonal functions and statistical weather pre-
diction,” 1956.
[111] L. F. Richardson, Weather prediction by numerical process. Cambridge
University Press, 2007.
[112] (2000) Bureau of meteorology research centre. Accessed: January 21, 2015.
[113] S. Debsarma, “Cyclone and its warning system in bangladesh,” National
Disaster Reduction Day, 2001.
[114] J. C. Sadler, “Tropical cyclones of the eastern north pacific as revealed by
tiros observations,” Journal of Applied Meteorology, vol. 3, pp. 347–366,
1964.
[115] F. R. W., “Upper-level structure of the formative tropical cyclone,” Monthly
Weather Review, vol. 94, pp. 9–18, 1966.
71
[116] S. Fritz, L. Hubert, and A. Timchalk, “Some inferences from satellite pic-
tures of tropical disturbances,” Monthly Weather Review, vol. 94, pp. 231–
236, 1966.
[117] M. DeMaria and J. Kaplan, “An operational evaluation of a statistical hurri-
cane intensity prediction scheme (ships),” American Meteorological Society
- 22d Conf. on Hurricanes and Tropical Meteorology, pp. 280–281, 1997.
[118] M. A. Bender and I. Ginis, “Real-case simulations of hurricane–ocean in-
teraction using a high-resolution coupled model: Effects on hurricane in-
tensity,” Monthly Weather Review, vol. 128, pp. 917–946, 2000.
[119] M. DeMaria and J. Kaplan, “A statistical hurricane intensity prediction
scheme (ships) for the atlantic basin,” Weather Forecasting, vol. 9, pp. 209–
220, 1994.
[120] V. Dvorak, “Tropical cyclone intensity analysis and forecasting from satel-
lite imagery,” Monthly Weather Review, vol. 103, pp. 420–430, 1975.
[121] B. R. Jarvinen and C. J. Neumann, “Statistical forecasts of tropical cy-
clone intensity,” NOAA Techonological Memorandum. NWS NHC-10, p.
22p, 1979.
[122] J. A. Knaff, M. DeMaria, C. R. Sampson, and J. M. Gross, “Statistical,
five-day tropical cyclone intensity forecasts derived from climatology and
persistence,” Weather Forecasting, vol. 18, pp. 80–92, 2003.
[123] M. DeMaria, M. Mainelli, L. Shay, J. Knaff, and J. Kaplan, “Futher im-
provements to the statistical hurricane intensity prediction scheme (ships),”
Weather Forecasting, vol. 20, pp. 531 – 543, 2005.
[124] J. Knaff and C. Sampson, “Southern hemisphere tropical cyclone inten-
sity forecast methods used at the joint typhoon warning center, part
ii: Statistical-dynamical forecasts,” Australian Meteorological and Oceano-
graphic Journal, vol. 58, pp. 9–18, 2009.
[125] G.-F. Lin and M.-C. Wu, “A hybrid neural network model for typhoon-
rainfall forecasting,” Journal of Hydrology, vol. 375, no. 3, pp. 450–458,
2009.
72
[126] Y. Liu, E. Racah, J. Correa, A. Khosrowshahi, D. Lavers, K. Kunkel,
M. Wehner, W. Collins et al., “Application of deep convolutional neural
networks for detecting extreme weather in climate datasets,” arXiv preprint
arXiv:1605.01156, 2016.
[127] S. Chaudhuri, S. Goswami, A. Middey, D. Das, and S. Chowdhury,
“Predictability of landfall location and surge height of tropical cyclones over
north indian ocean (nio),” Natural Hazards, vol. 75, no. 2, pp. 1369–1388,
2015. [Online]. Available: http://dx.doi.org/10.1007/s11069-014-1376-0
[128] S. Chaudhuri, D. Dutta, S. Goswami, and A. Middey, “Track and intensity
forecast of tropical cyclones over the north indian ocean with multilayer
feed forward neural nets,” Meteorological Applications, vol. 22, no. 3, pp.
563–575, 2015.
[129] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra, “Dimensionality
reduction for fast similarity search in large time series databases,”
Knowledge and Information Systems, vol. 3, no. 3, pp. 263–286, 2001.
[Online]. Available: http://dx.doi.org/10.1007/PL00011669
[130] (2015) JTWC tropical cyclone best track data site.
[131] T. Koskela, M. Lehtokangas, J. Saarinen, and K. Kaski, “Time series pre-
diction with multilayer perceptron, FIR and Elman neural networks,” in
In Proceedings of the World Congress on Neural Networks, San Diego, CA,
USA, 1996, pp. 491–496.
[132] D. Mirikitani and N. Nikolaev, “Recursive bayesian recurrent neural net-
works for time-series modeling,” Neural Networks, IEEE Transactions on,
vol. 21, no. 2, pp. 262 –274, Feb. 2010.
[133] “Smart bilo: Computational intelligence framework,” accessed: 02-12-2015.
[Online]. Available: http://smartbilo.aicrg.softwarefoundationfiji.org
[134] M. B. Nevel’son and R. Z. Khas’ minskii, Stochastic approximation and
recursive estimation. American Mathematical Society Providence, 1976,
vol. 47.
73
Recommended