NEURAL NETWORK METHODOLOGIES FOR CYCLONE WIND …digilib.library.usp.ac.fj/.../499d1bce.dir/doc.pdfAcknowledgements Foremost,IthankGodforknowledge,strengthandprotection. IwouldliketoexpressmysinceregratitudetomyadvisorDr

“ It is not what you want, its what you need that matters”

Anonymous

Abstract

Tropical cyclone wind-intensity and path prediction are challenging tasks consid-

ering drastic changes of the climate patterns over the last few decades. Cyclones

cause extensive damage to everything in its path, however, the destruction caused

by this natural calamity could be reduced immensely with accurate and timely

forecasts of cyclone track and intensity. The unpredictable nature of cyclones are

difficult for statistical predicting models to learn and make efficient and timely

predictions. Cyclones have been studied extensively and statistical models have

been used to make predictions. Time series prediction relies on past data points

to make robust predictions. Recurrent neural networks have been suitable for

time series prediction due to their architectural properties in modeling temporal

sequences. Coevolutionary recurrent neural networks have recently given very

promising performance for time series prediction. The study applies the afore-

mentioned methods to tropical cyclone wind intensity and path prediction. The

study begins with the prediction of the wind intensity of cyclones that took place

in the South Pacific Ocean over past few decades. Timespan was defined as the

amount data points necessary to start prediction using neural architecture. To im-

prove the prediction performance of the models, an empirical study on minimal

timespan is required. The cyclone track prediction is a two dimensional (mul-

tivariate) time series prediction problem that involves latitudes and longitudes

which define the position of a cyclone. An architecture for encoding the two di-

mensional time series problem into Elman recurrent neural networks composed

of a single input neuron is proposed in the thesis for cyclone path prediction.

Transfer learning incorporates knowledge from related source dataset to comple-

ment a target dataset. The additional knowledge aids in learning especially, in

cases, where there is lack of target data. Stacking is a form of ensemble learning,

focused on improving generalization performance. It has been used for transfer

learning problems which are referred to as transfer stacking. The final contri-

bution of the thesis involves transfer stacking as a means of studying the effects

of cyclones whereby the contribution of cyclone data from different geographic

locations towards improving generalization performance was evaluated.

i

Acknowledgements

Foremost, I thank God for knowledge, strength and protection.

I would like to express my sincere gratitude to my advisor Dr. Rohitash Chandra

who was my supervisor initially and later became my external supervisor. He

gave me motivation and confidence that helped me in completing this thesis.

My sincere thanks to Dr.Anurag Anand Sharma who guided me in the means of

completing this thesis towards the final stage of the masters program.

My parents, Bram Deo and Antala Devi, my brothers, Ravnil Deo and Rajneel

Deo, my sister-in-laws Karishma Devi and Marshlin Lata, have constantly been

there for me, motivating, advising and encouraging me throughout my studies.

Finally, all my friends and family members who have not been mentioned, deserve

my wholehearted acknowledgement.

ii

Contents

Abstract i

Acknowledgements ii

List of Figures v

List of Tables vi

1 Introduction 1

1.1 Premises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background and Literature Review 5

2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Feedforward Neural Networks . . . . . . . . . . . . . . . . 6

2.1.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . 7

2.1.3 Elman Neural Networks . . . . . . . . . . . . . . . . . . . 7

2.2 Training in Neural networks . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Backpropagation Based Learning . . . . . . . . . . . . . . 9

2.2.2 Backpropagation-Through-Time Based Learning . . . . . . 10

2.2.3 Coevolutionary Neuro-Evolution based Learning . . . . . . 11

2.3 Evolutionary Computation . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . 14

2.3.2 Generalized Generation Gap Parent Centric Crossover . . 16

2.3.3 Cooperative Coevolution . . . . . . . . . . . . . . . . . . 17

2.3.4 Problem Decompositions . . . . . . . . . . . . . . . . . . . 19

2.3.5 Cooperative Fitness Evaluations . . . . . . . . . . . . . . . 20

2.4 Ensemble learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.5 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.6 Time Series Prediction . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6.1 Weather Prediction . . . . . . . . . . . . . . . . . . . . . . 23

2.7 Tropical Cyclones . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.7.1 Conventional Cyclone Forecasting . . . . . . . . . . . . . 24

2.7.2 Neural Network for Cyclones . . . . . . . . . . . . . . . . 25

2.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Identification of Minimal Timespan Problem for Cyclone Wind-Intensity Prediction 28

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

iii

3.2 Problem Definition and Methodology . . . . . . . . . . . . . . . . 29

3.2.1 Problem Definition: Minimal Timespan Prediction Problem 29

3.2.2 Methodology: Recurrent Networks for Prediction . . . . . 31

3.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . 31

3.3.1 Data Preprocessing and Reconstruction . . . . . . . . . . 33

3.3.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . 33

3.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Cyclone Track Prediction Using Coevolutionary Recurrent Neu-ral Networks 40

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 RNN Architecture for Cyclone Tracks . . . . . . . . . . . . . . . 40

4.3 Simulation and Analysis . . . . . . . . . . . . . . . . . . . . . . . 41

4.3.1 Data Preprocessing and Reconstruction . . . . . . . . . . 41

4.3.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . 44

4.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5 Stacked Transfer Learning for Tropical Cyclone Intensity Pre-diction 50

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2.1 Neural networks for time series prediction . . . . . . . . . 50

5.2.2 Stacked transfer learning . . . . . . . . . . . . . . . . . . . 51

5.2.3 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . 52

5.3.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . 53

5.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Conclusions and Future Work 58

6.1 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . 59

Appendix 60

Bibliography 61

iv

List of Figures

2.1 Feed forward neural network [1] . . . . . . . . . . . . . . . . . . . 6

2.2 Elman style RNN [1] . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Neuron level problem decomposition for recurrent neural networks 12

2.4 Transfer learning for one source and one target data . . . . . . . . 22

3.1 Cyclone wind intensity time series showing relation between train-ing and testing timespan. Training data uses embedding dimension5 while testing dataset uses embedding dimension 3 . . . . . . . . 29

3.2 Elman Recurrent neural network trained with timespan W andtested with X,Y,Z for identifying the minimal timespan. . . . . . 31

3.3 Elman recurrent neural network used for tropical cyclone wind in-tensity prediction. Time series data is preprocessed and embeddedusing Taken’s theorem and fed into the Elman RNN. . . . . . . . 32

3.4 Unfolded view of RNN. . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5 Performance of CNE and BPTT in wind intensity prediction inthe testing dataset (2006 -2013) for tropical cyclones in the SouthPacific. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6 Performance of CNE for a single experimental run . . . . . . . . 38

4.1 Elman RNN used for prediction of cyclone latitude and longitude.Two input and output neurons are used for mapping the longitudeand latitude [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Proposed RNN architecture: A single input and output neuron El-man recurrent neural network (SIORNN) used for predicting lati-tude and longitude of the cyclone path. . . . . . . . . . . . . . . 42

4.3 Embedded data reconstructed using Taken’s theorem. Embeddingdimension (D) of 4 is depicted. Both (a) and (b) have the twodimensions longitude and latitude. . . . . . . . . . . . . . . . . . 43

4.4 Tropical cyclones track data in the South Pacific from 1985 to 2013.(Generated using Gnuplot) . . . . . . . . . . . . . . . . . . . . . 44

4.5 Performance of SIORNN using cooperative coevolution for 6 ran-dom cyclones from the year 2006 to 2013. . . . . . . . . . . . . . 45

4.6 Typical prediction performance of a single experiment (one-stepahead prediction) given by BPTT-SIORNN for Cyclone tracktest dataset (2006 - 2013 tropical cyclones) where Time is takenat six hour intervals. . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.1 Neural Network Ensemble Model . . . . . . . . . . . . . . . . . . 51

v

List of Tables

3.1 Best Performance of cooperative coevolution . . . . . . . . . . . . 36

4.1 Generalization performance of training models on cyclone trackprediction for Configuration A . . . . . . . . . . . . . . . . . . . . 45

4.2 Generalization performance of training models on cyclone trackprediction for Configuration B . . . . . . . . . . . . . . . . . . . . 47

5.1 Generalization performance . . . . . . . . . . . . . . . . . . . . . 54

5.2 Experiment 3: Performance of FNN on different categories of cyclones 54

5.3 Experiment 5: Performance of Vanilla FNN on independent decadetraining data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

vi

To my parents, Bram Deo and Antala Devi.

Chapter 1

Introduction

1.1 Premises

Neural networks are nature-inspired computational methods that try to model

biological neural systems [3]. Neural networks are characterized into feedfor-

ward and recurrent architectures. Networks provide a mapping from input to

output from one state to another. In contrast to feedforward networks, recurrent

neural networks are dynamical systems whose next state and output(s) depend

on the present network state and input(s). Recurrent neural networks (RNNs)

due to their architecture, are well-suited for modeling temporal sequences [1].

Gradient descent has been customarily used for training neural networks. Feed-

forward networks use backpropagation whilst a slight variation of the algorithm,

backpropagation-through-time [4] is used for recurrent networks. Neuro-evolution

has also been used to train neural networks [5]. Neural networks methods have

shown to be robust methods for time series problems [6]. Amongst popular com-

putational intelligence methods, evolutionary neural networks have shown good

potential for time series prediction [5, 7].

Transfer learning utilizes knowledge learned previously from related problems

into learning models in order to have faster training or better generalization per-

formance [8]. Transfer learning incorporates knowledge from a related problem

(also known as source) to compliment a target problem especially in cases where

there is lack of data or in cases where there is requirement to speed up the learn-

ing process. The approach has seen widespread applications with challenges on

the type of knowledge should be transferred in order to avoid negative transfers

whereby the knowledge deteriorates the performance [9–11]. Transfer Learning

has recently been used for visual tracking and computer-aided detection [12, 13].

Transfer learning has been implemented with ensemble learning methods such

as boosting and stacking [9]. The approach in the case of transfer stacking is

implementing multiple learners previously trained on the source data in order to

have a single base learner. Simple stacking on the other hand uses multiple base

learners[9].

1

Tropical cyclones have aroused much attention due to their destructive nature

[14]. Statistical models have been previously used to forecast the movement and

intensity of the cyclones. There has been a growing interest in computational

intelligence techniques for cyclone prediction systems [15, 16]. Cyclone path and

wind intensity prediction is a multidimensional time series problem.

The forecast of the intensity and track of tropical cyclones are considered to be

extremely important for avoiding casualties and mitigating damage to properties

[17, 18]. A comprehensive review of tropical cyclone track forecasting techniques

can be found in [19]. Computational intelligence methods have established them-

selves as complementary approaches for tropical cyclone tracking and prediction

[20–22]. Amongst these methods, neural networks trained using an evolutionary

learning paradigm have shown great promise [2, 5, 7, 23].

In the past, cyclone wind-intensity [24] and track prediction [2] have been tackled

by cooperative neuro-evolution of recurrent neural networks. Track prediction

was tackled as a two-dimensional time series problem where the latitude and lon-

gitude of the cyclone tracks were involved [2]. The results have been promising for

cyclones in the South Pacific region however, there is room for further improve-

ments in the accuracy of the predictions. Cyclone intensity and track prediction

need to be made as soon as possible when a cyclone is identified. It is impor-

tant to identify if a prediction model can work with the shortest duration after

which time series prediction can begin, that is, if the cyclone data is recorded

every 6 hours, an important issue is the minimal timespan required to make a

prediction. The time span is a windowed snapshot of taken at regular intervals

the observation period for a time series data [25]. The minimal timespan is an

important factor when it comes to predicting the nature of cyclones in terms of

track and wind intensity. Robust predictions can be vital in reducing the impact

of the calamity of cyclones through efficient planning and management.

1.1.1 Motivations

Elman recurrent neural networks and feed forward neural networks have been

effectively used for time series predictions. More recently, the Elman recurrent

neural network have shown good performance in predicting tropical cyclone wind

intensity and tracks for South Pacific tropical cyclones. Due to the catastrophic

2

nature of cyclones, predictions on these cyclones are very time sensitive. Thus,

the minimum timespan needed to start making accurate predictions on cyclone,

needs to be identified.

Cyclone track data is modeled as a two dimensional time series problem due to

the presence of two data streams; latitude and longitude. However, the track of

cyclone is essentially a single commodity which needs to be modeled as a single

dimension. Therefore, an encoding scheme is proposed which would allow such

modeling of two-dimensional into a single data stream.

In order to develop robust prediction models, one needs to consider different char-

acteristics of cyclones in terms of spatial and temporal characteristics. Transfer

learning can be used as a strategy to evaluate the relationship of cyclones from

different geographic regions. Note that a negative transfer is considered when the

source knowledge utilized with the target data, contributes to poor generaliza-

tion. This can be helpful in evaluating if the cyclones from a particular region

is helpful in guiding as source knowledge for decision making by models in other

geographic locations. Moreover, it is also important to evaluate the effect of the

duration of a cyclone on the generalization performance given by the model.

1.2 Research Goals

The main research goal of this thesis is to use neural network methodologies to

predict cyclone path and wind intensity.

This thesis is based on the following research objectives:

1. Identify the minimal timespan required to start making accurate and mean-

ingful predictions with applications to tropical cyclone wind intensity fore-

casting.

2. Novel architecture for encoding two-dimensional time series for path pre-

diction problem in Elman recurrent neural network.

3. Evaluate the performance of the standard neural network when trained with

different regions of cyclone data.

4. Evaluate the effects of duration of the cyclones and their contribution to-

wards the neural networks generalization performance.

3

5. Use transfer learning via stacked ensemble and consider a region as the

source and another as target dependency of the datasets.

1.3 Thesis Outline

The outline of the thesis is as follows:

� Chapter 5 Outlines the motivations, research goals and methodology of

the thesis.

� Chapter 6 Discusses the general background about neural networks, train-

ing algorithms, evolutionary algorithms ensemble learning, transfer learning

and tropical cyclone intensity and path forecasting.

� Chapter 7 Suggests the optimal time required for making predictions on

tropical cyclone wind intensity

� Chapter 8 Proposes a framework for encoding two-dimensional cyclone

track data into Elman style recurrent neural networks for efficient track

forecasting.

� Chapter 9 Uses a transfer learning with feed forward neural network based

ensemble stacking model to predict cyclone wind intensity.

� Chapter 10 Concludes the thesis with an overview of the results and anal-

ysis from Chapters 7 - 9 and discusses future research.

4

Chapter 2

Background and Literature Review

This chapter describes the general background about neural networks, training

algorithms for neural networks , evolutionary algorithms, ensemble learning and

transfer learning. The chapter also talks about tropical cyclone intensity and

path predictions.

2.1 Neural Networks

Neural Network (NN) are nature inspired computation methods which try to

model the biological neural systems [3]. It comprises of a group of objects re-

ferred to as neurons that are logically interconnected to form a network. A neuron

is a single processing unit that computes the weighted sum of its inputs. Inter-

connections between neurons are called synapses which have weights associated

with them.

These neural networks are used for modeling the relationships between input and

output values in the data [26]. They learn by training on data using algorithms

that update the weights of the synapses in order to achieve the learning objective.

The knowledge gained in training is held in distributed set of weights. The net-

works learn on one of three systems which include: supervised, unsupervised and

reinforced learning. Supervised learning uses direct comparisons between desired

and actual outputs [27] and is formulated using sum squared error. Unsupervised

learning uses the correlation between the inputs in training as there is no informa-

tion on the actual output [27]. Reinforced learning is a special form of supervised

learning wherein the exact expected output is not known [27] and as such it is

based on the correctness of the actual output. There are multiple applications

of neural networks including but not restricted to pattern recognition [28, 29],

control problems [30, 31] and time series prediction [32–34].

The two most commonly used architectures of neural networks are feedforward

and recurrent neural networks [35]. The neurons in the network is further grouped

into three layers; input, hidden and output layer neurons. The number of neurons

per layer and the number of hidden layers are variable.

5

Figure 2.1: Feed forward neural network [1]

2.1.1 Feedforward Neural Networks

Feedforward networks are static structures that consist of an input layer, a num-

ber of hidden layers and an output layer [36] . The number of neurons present

in each layer is dependent on the type of data being modeled. Additional hidden

layers are also added for different problem domains. However, once the network is

initialized, the size of the network is left constant throughout its use. Figure 2.1

shows a feedforward neural network with the three layer architecture. The con-

nections between the different layers of the network help propagate the weighted

sum of the activations from each neuron that later goes through a transfer func-

tion. The dynamics of a feedforward network is described in equation 2.1, where

the total net input activation value yi of neuron i is given for N input connections.

yi =N∑j=1

(wijxj) + δi (2.1)

where wij is the weight defined from the input signal xj to the neuron i and δi

represents the bias. The input activation yi is regularized by the transfer function

f(yi) that computes the final output of the unit. The sigmoid transfer function

is given in equation 2.2.

f(yi) =1

1 + e−yi(2.2)

6

2.1.2 Recurrent Neural Networks

Recurrent Neural Networks (RNN) are dynamical systems as opposed to feedfor-

ward networks. RNNs have multiple intermediate states whereby the next state

and output depend on the current state and inputs. This makes them highly

successful in time series prediction, pattern classification, language learning and

control [32–34] as they are highly capable of modeling dynamical systems. Re-

current neural networks are further subdivided due to their various architectures

[37]. The basic architectures of recurrent neural networks include: first-order

recurrent networks, second-order recurrent networks, NARX recurrent networks,

long short term memory networks and reservoir computing.

First-order recurrent neural network was proposed by Elman and Zipser [1] and

was referred to as the Elman recurrent neural network. This thesis uses this RNN

architecture therefore it is discussed in detail in the next subsection. Second-order

recurrent neural networks [38] have been seen to perform better then first-order

for modeling finite state behavior. The limitation of second-order recurrent neural

network is that it requires a lot of computational resources for training. Their

architecture is such that there are a higher number of weight connections per

hidden neuron when compared to first-order networks [39]. NARX recurrent

neural networks are based on non-linear autoregressive models with exogenous

inputs [40]. They have limited feedback that only comes from the output layer.

NARX networks are as powerful as fully connected recurrent networks due to

their information retention capabilities [41]. Long short term memory networks

(LSTM) [42] were specifically designed to effectively learn long-term dependencies

in data. Reservoir computation, [43] creates a recurrent neural network referred

to as a reservoir. The reservoir is randomly created and the internal weights

of the reservoir remains unchanged during the training. The weights from the

reservoir to the output neurons are updated in learning. Common approaches

to reservoir computing are Liquid State Machines (LSM) [44] and Echo State

Networks (ESN) [43, 45]

2.1.3 Elman Neural Networks

First-order recurrent neural networks have context neurons in addition to having

the input, hidden and output neurons. The context neurons are connected to the

7

Figure 2.2: Elman style RNN [1]

hidden neurons. They get input form the hidden layer neurons and then recurse

to feed the input back to the same hidden neuron in the nest time step. The

recurrent neural network is a dynamic structure that grows as it unfolds in time.

Figure 2.2 shows an Elman recurrent network at initial unfolded state. Elman

recurrent neural networks computational capabilities have been studied in [46].

The study showed that these types of networks are able to represent any finite-

state machine due to its dynamical properties. Note that the basic components

of an observed dynamical system are clearly represented in an Elman network:

the input stands for the control of the system, the contextual hidden layer stands

for the state of the system and the outputs stand for the measurement [47]. The

network is able to develop representations of unobservable states of a dynamical

system in the hidden layer through learning.

The change of the hidden state neurons’ activation in Elman style recurrent net-

works [1] is given by Equation (2.3).

yi(t) = f

(K∑k=1

vik yk(t− 1) +J∑

j=1

wijxj(t− 1)

)(2.3)

where yk(t) and xj(t) represent the output of the context state neurons and input

neurons at time step t, vik and wij represent their corresponding weights and f(.)

8

is the sigmoid transfer function.

2.2 Training in Neural networks

Neural networks are mathematically represented as a sum of their weight vectors.

These representations form objective functions that can then be optimized by

learning algorithms. Gradient descent is by far the most widely used approach

for training neural networks. Evolutionary algorithms have also been successfully

used for training recurrent neural networks [5, 48]. Supervised learning is used

by both gradient descent and evolutionary algorithms. Whereby, data used in

training the recurrent neural network contains both the input and actual output

data. The performance of the two methods is measured using the root mean

squared error (RMSE) and the mean absolute error (MAE) as given in Equation

2.4 and 2.5, respectively.

RMSE =

√√√√ 1

N

N∑i=1

(yi − yi)2 (2.4)

MAE =1

N

N∑i=1

|(yi − yi)| (2.5)

where yi and yi are the observed and predicted data, respectively. N is the length

of the observed data.

2.2.1 Backpropagation Based Learning

The backpropagation (BP) algorithm uses gradient descent for training feedfor-

ward networks [36]. Delta learning based approach to learning is used whereby

the main principle is to search the hypothesis state of weight vectors and find the

weights that best fit the training dataset. The goal is model the data as best as

possible by updating the weights. BP works in two passes: forward and backward

pass. The input is passed to the network from the input neurons which then prop-

agates the activations through the hidden layers all the way to the output layer.

9

The output from the network is mapped to the actual output and a cost function

is used to calculate the network error. Equation 2.6 gives the sum-squared-error

(SSE) cost function.

SSE =1

2

n∑k=1

(yk − tk)2 (2.6)

where n is the number of neurons in the output layer, yk is the actual output and

tk is the desired or target output of the respective neurons in the output layer.

Backward pass uses the actual descent where the gradient δSSEδw

of each weight

w in the network can be computed by propagating an error backwards through

the network. The two passes are completed for all the data points in the training

data. A complete cycle through the training data is called an epoch. The network

is trained through multiple epochs.

2.2.2 Backpropagation-Through-Time Based Learning

Backpropagation-through-time (BPTT) [4] and real-time recurrent learning [4]

are the two basic gradient descent based algorithms used in training recurrent

neural networks. The algorithm unfolds a recurrent neural network in time into

a deep multilayer feedforward network and employs the error back-propagation

for weight update. When unfolded in time, the network has the same behavior

as a recurrent neural network for a finite number of time steps. Algorithm 13

shows the BPTT algorithm which was used to train the Elman recurrent neural

network.

The algorithm starts by initializing the weight vectors of the recurrent neural

network with random numbers. Due to the use of supervised learning, with every

epoch, all data samples (sets of inputs and corresponding outputs) inputs are fed

into the network. The output recorded from the transformation of input data for

each epoch is the forward pass. The network output is compared with the actual

output from the dataset. The distance between the two outputs is calculated and

is referred to as the prediction error through which the error gradient is calculated

and used to update the weights of the synapses as errors are back propagated

10

Algorithm 1: Backpropagation Through-Time for Training Elman RNNsStep 1: Prepare Training and Testing datasetStep 2: Initialize the RNN weights with small random numbers in range [-0.5, 0.5]for each Epoch until termination do

for each Sample dofor n Time-Steps do

Forward Propagateend forfor n Time-Steps do

i) Backpropagate Errors using Gradient Descentii) Weight update

end forend for

end for

through the entire network one layer at a time. The algorithm terminates when

a fixed number of epoch or cycles have been reached.

The limitation of BPTT algorithm is that it is unable to learn sequences with

long-term dependencies [49, 50] as with these dependencies the error gradient

usually approaches zero therefore, reducing weight update.

2.2.3 Coevolutionary Neuro-Evolution based Learning

Coevolution based training of recurrent neural networks have shown a lot of

promise [5, 48].

The general cooperative coevolutionary method for training Elman recurrent neu-

ral networks is given in Algorithm 2. The recurrent neural network is decomposed

into k subcomponents using the neural level problem decomposition method [51],

where k is equal to the total number of hidden, context and output neurons. Each

subcomponent contains all the weight links from the previous layer connecting

to a particular neuron. Each hidden neuron also acts as a reference point for

the recurrent (state or context) weight links connected to it. Therefore, the sub-

components for a recurrent network with a single hidden layer are composed as

follows:

11

Figure 2.3: Neuron level problem decomposition for recurrent neural net-works

1. Hidden layer subcomponents: weight links from each neuron in the hidden(t)

layer connected to all input(t) neurons and the bias of hidden(t), where t

is time.

2. State (recurrent) neuron subcomponents: weight links from each neuron in

the hidden(t) layer connected to all hidden neurons in previous time step

hidden(t− 1).

3. Output layer subcomponents: weight links from each neuron in the output(t)

layer connected to all hidden(t) neurons and the bias of output(t).

.

The subcomponents are implemented as sub-populations that employ the gen-

eralized generation gap with parent-centric crossover operator genetic algorithm

[52]. A cycle is completed when all the sub-populations are evolved for a fixed

number of generations. In past work, it is seen that a search depth of one gener-

ation is suitable for this evolutionary process [51]. The algorithm halts when the

termination condition is satisfied: either a specified fitness has been achieved as

measured by the root mean squared error on the training dataset or the maximum

number of function evaluations has been reached.

12

Algorithm 2: Cooperative Coevolutionary Training of Elman Recurrent Net-worksStep 1: Decompose the problem into k subcomponents according to thenumber of Hidden, State, and Output neuronsStep 2: Encode each subcomponent in a sub-population in the followingorder:i) Hidden layer sub-populationsii) State (recurrent) neuron sub-populationsiii) Output layer sub-populationsStep 3: Initialize and cooperatively evaluate each sub-populationfor each cycle until termination dofor each Sub-population dofor n Generations doi) Select and create new offspringii) Cooperatively evaluate the new offspringiii) Add the new offspring to the sub-population

end forend for

end for

There are other decompositions of recurrent neural networks which include synapse

level decomposition and network level decomposition, however the neuron level

decomposition was found to outperform its competitors [2, 24]. Figure 2.3 shows

the neuron level decomposition of a recurrent neural network. Further details

about the cooperative coevolutionary architecture is explained in the next sec-

tion that talks about evolutionary computation.

2.3 Evolutionary Computation

Evolutionary Computation focus on solving mathematical problems that are not

solvable by traditional mathematical processes. Evolutionary computation is in-

spired by the process of biological evolution. It links to Darwin’s theory of evo-

lution wherein it preserves the notion of “survival of the fittest” [53] and mimics

natural processes such as reproduction, selection, mutation, and recombination

[54].

Evolutionary Algorithms (EA) are used for mathematical optimization [55] whereby,

the goal is maximization or minimization of single or multiple objective functions.

13

The candidate solutions to the problem being optimized are placed into a pop-

ulation and evolved over multiple generations to solve the problem. EA’s have

been pretty successful as genetic optimizers due to its simple design and its abil-

ity to work without prior knowledge about the problem being solved [56, 57].

Therefore, evolutionary algorithms have been quite frequently used for tackling

black-box optimization problems, job-shop scheduling as well as multiobjective

optimization. [57–59]

However, a major limitation of evolutionary algorithms is that it does not scale up

well to higher dimensions. It has been found to suffer the “curse of dimensionality”

[60] whereby, the performance EA’s deteriorates as the number of dimensions

increase with reference to large-scale optimization. Evolutionary computation is

mainly associated with swarm intelligence and evolutionary algorithms such as

genetic algorithms.

2.3.1 Genetic Algorithms

In the field of evolutionary computation, Genetic Algorithms have been widely

used as optimizer. The genetic algorithm was initially used with binary encoded

representations of candidate solutions of optimization problems. These candidate

solutions were classified as individuals and a population of such individuals would

be created. The individuals have their own genetic material or chromosomes

with different traits. The population is evaluated based on some fitness function

identifying the fittest individuals with the best chromosomes. These individuals

are selected to be parents so that after evolution, the best chromosomes are passed

onto the next generation. In the course of evolution, new offspring’s or children

were created by using the candidate solutions classified as parents. Various genetic

operators such as selection, mutation and crossover would be utilized in order

to facilitate mating and or evolution. Evolution is carried out over multiple

generations in order to attain fitter individuals.

Real Coded Genetic Algorithms (RCGA) was the successor to binary encoded

GA’s. RCGA used real numbers to encode candidate solutions as opposed to

only using binary values [61]. Real-valued encoding encouraged a more natural

representation of individuals thus eliciting the potential of RCGA to be utilized

14

in tackling real-world problems. Real coded genetic algorithms differ form one

another in the types of selection and crossover they employ.

Some of these selection strategies include; rank selection [62], tournament se-

lection [63], roulette wheel selection [63] and the elitist strategy [64]. Rank

selection assigns the fitness to all individual and selects the fittest individuals

for mating [62]. Tournament selection works on the basis of randomly choos-

ing individuals creating small groups or pools (a tournament) in the population.

Internal fitness evaluations take place within the pools and they nominate their

winner or fittest individual as a parents. Roulette Wheel selections work with

assigning a priority based selection where the fitter individuals are given more

priority then weaker individuals. The spinning wheel mechanism is used whereby

N parents are randomly picked from the population. Higher priority individuals

have a higher chance of being selected as they are given a larger percentage of the

wheel [63] howsoever, weaker individuals are also chosen sometimes as they may

contain valuable genes. In the elitist selection, some of the strongest chromosomes

are always retained in the population to ensure that the best individuals are ever

present in the population [62].

Upon identification of the parents, reproduction is performed using one of the

various crossover. Crossover is a reproduction technique that takes two parent

chromosomes and produces two child chromosomes [65]. The common crossovers

of real coded generic algorithms include; simulated binary crossover (SBX) [66].

SBX used single point crossover on binary strings in order to have a continuous

search space. SBX assigned a greater probability for offspring to remain closer

to the selected parents. Unimodal normal distribution crossover (UNDX) [67]

selects multiple parents chromosomes and creates offspring around the center of

mass of the selected parents. The simplex crossover (SPX) [68] selects parents

that mark a search space and the children are generated in that restricted search

space.

SPX has been found to outperform UNDX on test problems when used with

minimum generation gap genetic algorithm. This thesis will use Parent centric

crossover (PCX) which has outperformed other crossovers on unimodal problems

[52]. Other crossovers such as Blend crossover (BLX) [52] and Wright’s heuristic

crossover [65] have also shown improved performance on a set of optimization

functions.

15

2.3.2 Generalized Generation Gap Parent Centric Crossover

Generalized Generation Gap Parent Centric Crossover (G3-PCX) is a genetic

optimizer that is based upon the concept of sexual reproduction in animals. The

Generalized Generation Gap(G3) is different from standard genetic algorithms

in the way it selects parents for mating. It restricts mating in a population to

only selected number of patents instead of mating between all individuals of a

population. In each round of evolution, a sub-population parents and children is

created and evaluated instead of evaluating the whole population.

The Parent Centric Crossover(PCX) is used during mating to create the children.

PCX uses the orthogonal distance between the parents. The parents consist of

male and female components where, the female parent points to the search areas

and the male parent determines the extent of search of the areas pointed by the

female.

Algorithm 3 describes the basic process used by G3PCX. The algorithm starts by

assigning the number of parents and children that will be part of evolution. The

population is created and randomly initialized and all individuals are evaluated.

The best individual and α − 1 other randomly chosen individuals are chosen

as parents. These parents then mate using PCX to create β children. Two

parents are randomly chosen from the original population which combined with

the children to form a separate sub-population. The new sub-population is then

evaluated and n strong individuals are then replaced into the original population

Algorithm 3: G3-PCX Evolutionary Algorithm [52]

Set the number of parents (α) and children (β)initialize and evaluate all individuals in the populationSetup the sub-population which would contain parents and childrenwhile not optimal solution do1) Select the best parent and α - 1 parents randomly from population2) Create β children from α parents using PCX3) Choose two parents at random from the population4) From the combined sub-population of two chosen parents and β createdchildren, choose the two best individuals and replace the chosen twoparents (in Step 3) with these solutions.

end while

The PCX crossover was able to outperform the simplex crossover (SPX) and

simulated-binary crossover(SBX) using the generalized generation gap model [52].

16

The performance of the self-adapting parent-centric crossover in terms of opti-

mization time was appealing when problems were scaled up to higher dimensions

[52].

The G3-PCX algorithm requires a large population size in order to perform reli-

ably. This was identified as a limitation in study where a population size of 90

was needed to effectively solve a two-dimensional problem [69].

2.3.3 Cooperative Coevolution

Evolutionary algorithms performance deteriorates with increase in dimensionality

of the problem being optimized [70, 71]. Therefore research has been focused

on adapting and evolving these algorithms [53]. Coevolutionary algorithms have

aroused a lot of interests from researchers [72].

Cooperative Coevolution (CC) is an evolutionary computation method that solves

a larger problem by dividing it into smaller subcomponents. [72]. Cooperative

coevolution aims to simplify the complexities of a problem therefore, simplifying

the problem as a whole, which attributes to its success [48, 72, 73].

Applications of cooperative coevolution have been utilized in many research dis-

ciplines. CC has demonstrated to be very promising in solving real parameter

global optimization problems [72] as well as large scale optimization [74–76]. Co-

evolution has also been utilized in neuro-evolution for time series prediction and

pattern classification [5, 28, 30, 48, 77].

Algorithm 4: The General Cooperative Coevolution Algorithm

1) Decompose the problem into k subcomponents2) Initialize and cooperatively evaluate each subcomponent represented as asub-populationwhile until termination dofor each Subpopulation dofor n Generations doi) Select and build new individualsii) Cooperatively evaluate the new individualsiii) Update sub-population

end forend for

end while

17

Algorithm 4 depicts the basic cooperative coevolution framework. Decomposition

of the problem creates the subcomponents. The size and number of subcompo-

nents depend on the type of the optimization problem. Each subcomponent is

represented as a sub-population and assume the characteristics of species in na-

ture. Individuals are added to the sub-populations using random values and all

the species are then cooperatively evaluated. In CC, the sub-populations are

evolved in isolation and the only cooperation takes place during fitness evalua-

tion for the respective individuals in each sub-population. The evolution phase

involves evolving all the sub-populations in a round-robin fashion for a depth of

n generations. A CC cycle is completed when all the sub-populations have been

evolved for n generations. The algorithm terminates when the maximum number

of cycles have been reached or the minimum error is achieved.

Different variants of cooperative coevolution have been utilized where the issue of

problem decomposition and separability has been central [56]. The performance

of CC algorithms on high dimensional problems could be significantly enhanced

by incorporating more advanced EAs. Fast evolutionary programming in the CC

framework (FEPCC) was initially used to solve large-scale optimization problems

of up to 1000 dimensions [78]. Cooperative coevolutionary genetic algorithms

(CCGA) framework designed by [72] was used in FEPCC. CCEA does not cater

for variable interactions in subcomponents, therefore, FEPCC performed poorly

on non-separable functions. Cooperative coevolution based on particle swarm

optimization (CPSO) was later developed by [79]. CPSO decomposed the prob-

lem into m s-dimensional subcomponents where s is the number of variables in

a subcomponent. Deferential Grouping was used to optimize the subcomponents

in [80] where each subcomponent comprised half of the optimization problem.

The method was applied to problems of 100 dimensions only as the halved sub-

components were unable to cope with the higher dimensions [80].

Yang and Yao [56], proposed a cooperative co-evolutionary framework to ad-

dress high dimensional non-separable problems. They utilized random group-

ing and adaptive weighting to permit co-adaptation among subcomponents while

they are interdependent. The interacting and non interacting variables would be

grouped into separate subcomponents heuristically. Their framework performed

fairly well with large scale non-separable problems of 1000 dimensions. Amend-

ments to DECC-G was proposed by [81]. The drawback of DECC-G was that the

18

competence of random grouping in capturing two interacting into a subcompo-

nent is significantly reduced in the presence of more than two interacting variables

in the problem.

Omidvar et. al [82] later proposed a cooperative co-evolutionary differential evo-

lution to discourse high dimensional problems of up to 1000 dimensions which

showed promising results. Their goal was to advance an earlier algorithm [83] by

employing principal component analysis to condense the dimensions of a problem

[84].

More recently, Chandra et. al presented an adaptive method known as compet-

itive island based cooperative coevolution (CICC)[85] where candidate solutions

were grouped into islands that compete and collaborate. The best individual

from the winning islands is injected into the losing island to ensure fair com-

petition in different phases of evolution for global optimization [85]. The same

method was earlier used for training Elman recurrent neural networks for time

series prediction [86] with promising results. Omidvar et. al [87] presented de-

pendent variables into subcomponents based on the deferential grouping method

and achieved improvements in the problem decomposition strategies employed.

2.3.4 Problem Decompositions

Decomposition is the process of transforming a large problem into smaller ones

referred to as subcomponents. It is quite difficult to effectively decompose a

problem without having prior knowledge about its internal structure. Divide-

and-conquer is generally used to decompose large scale complex problems in CC.

Cooperative coevolution is highly sensitive to the problem decomposition there-

fore, arousing major challenges in decomposing large scale optimization problems

into smaller subproblems. To successfully utilize CC in optimization, interact-

ing variables need to be grouped into same subcomponent. There is no unique

decomposition for some classes of functions such as fully-separable, fully non-

separable or overlapping functions [88]. Fully-separable functions do not have

interacting variables therefore, they can be grouped in any fashion and optimized

independently making any decomposition viable. Contrarily, there is no unique

decomposition for non-separable functions as there are interdependencies between

all decision variables, eliminating the possibility the independent optimization.

19

Placing dependent (interacting) variables into separate subcomponents signifi-

cantly demises the optimization performance [72, 89, 90]. An ideal decomposition

would comprise of subcomponents that include parameters with minimal variable

inter-dependencies [81] and as such obtaining such an ideal decomposition is a

major hurdle [90].

Various types of decomposition methods have been explored. The two major

types are; static and dynamic. In static decomposition strategy, the problem

decomposition remain fixed throughout the optimization process. Conversely,

using dynamic decomposition method, the problem decomposition changes, that

is, the grouping is varied [91]. Dynamic decomposition methods simulating

automatic variable interaction detection have been common solutions [75, 91–93].

2.3.5 Cooperative Fitness Evaluations

Cooperative Cooperation works with multiple subcomponents as the basis of

CCEA’s are to divide a large problem into smaller subcomponents. Therefore,

the variables present in the function being optimized are placed in separate sub-

components. In evolution, each sub-population is optimized independently thus,

initially, only a certain number of individuals are evolved within a sub-population

whilst individuals in other sub-populations still contain arbitrary genes as as-

signed in the initialization phase. The fitness function requires all variables in or-

der to start computation, hence, cooperation is used by the sub-populations. This

implies that in order to assign the fitness to one individual of the sub-population

being evolved, arbitrary individuals are chosen from the other sub-populations

as representatives to complete the objective function. The fitness function in

context of neuro-evolution is the neural network. The fitness of an individual is

the inverse of the performance error for that particular individual on the neural

network training data.

2.4 Ensemble learning

Ensemble learning generally considers combination of multiple standalone learn-

ing methods to improve generalization performance when compared to standalone

20

approaches [94]. The basic concept of ensemble based learning is having multiple

learning working on the same problem would help solve the problem more effec-

tively and efficiently due to the added computation power. Ensemble learning has

been implemented with group of neural networks that are trained as an ensemble

with different configuration in parameter settings or initializations for executing

the same task[95, 96]. Popular methods in ensemble learning involve stacking,

bagging and boosting [97–99] . Stacked generalization is a form of ensemble learn-

ing where the predictions are combined from the other ensembles. The principal

to stacked generalization is that more learners would improve performance due

the additional computational power. [94]. In the literature, logistic regression has

been commonly used as the combiner layer for stacking [100]. Stacking has been

successfully used with both supervised and unsupervised learning [100, 101]. Re-

cently, the approach has been applied to modifying emission kinetics of colloidal

semiconductor nanoplatelets[102].

2.5 Transfer learning

Transfer learning utilizes knowledge learned previously from related problems

into learning models in order to have faster training or better or generalization

performance [8]. Transfer learning incorporates knowledge from a related prob-

lems (also known as source) to compliment a target problem especially in cases

where there is lack of data or in cases where there is requirement to speed up

the learning process. Figure 2.4 shows the simple transfer learning approach. It

shows that there is large quantities of pre-existing knowledge that can be used

to enhance the capabilities of target data. The smaller target data may benefit

from the previously learnt knowledge of larger source data in related context. In

practical transfer learning is derived from the basis of learning whereby learning

to drive a car would be helped further by having pre-existing knowledge about

driving tractors.

The approach has seen widespread applications with challenges on the type of

knowledge should be transferred in order to avoid negative transfers whereby the

knowledge deteriorates the performance [9–11]. Transfer Learning has recently

been used for visual tracking, computer-aided detection [12, 13] . Transfer learn-

ing has been implemented with ensemble learning methods such as boosting and

21

Figure 2.4: Transfer learning for one source and one target data

stacking [9]. The approach in the case of transfer stacking is implementing mul-

tiple learners previously trained on the source data in order to have a single base

learner. Simple stacking on the other hand uses multiple base learners[9].

2.6 Time Series Prediction

Time series are data of a series of events observed over a certain time period.

Time series prediction involves using past and present time series data for making

future predictions [103, 104]. Research has been ongoing on time series with its

applications ranging from financial prediction [105, 106] to atmospheric weather

prediction [2, 15].

A hybrid Elman-NARX neural network together with embedding theorem [107]

was developed to solve chaotic time series problems. The hybrid model per-

formed exceptionally well on benchmark datasets due to its high accuracy at

capturing relationships in the data [107]. Gholipour et al developed the Locally

linear neuro-fuzzy model - Locally linear model tree (LLNF-LoLiMot) [108] which

showed good performance with benchmark data. The model generalized well as it

prevents over fitting due to its intuitive constructive implementation which also

made this model computationally efficient. Radial basis network with orthogonal

least squares (RBF-OLS) [108] has been used for prediction of noisy data. It

gave competitive results when compared to LLNF-LoLiMot.

22

Predicting time series data using recurrent neural networks requires restructuring

of the data. We used Taken’s theorem [109] to reconstruct the time series data in

state space vector. Given an observed time series x(t), an embedded phase space

Y (t) = [(x(t), x(t − T ), ..., x(t(D − 1)T )] can be generated, where, T is the time

delay, D is the embedding dimension, t = 0, 1, 2, ..., N − DT − 1 and N is the

length of the original time series.

2.6.1 Weather Prediction

Weather prediction involves time series prediction for natural phenomenon such

rainfall prediction, cyclones, tornadoes, wave surges and droughts[110, 111]. One

needs to check how fast the prediction model can make a decision when the event

occurs. If the model is training over specific months for rainy seasons for a decade,

the system should be able to make a robust prediction from the beginning of the

rainy season. Methods for weather prediction have been discussed in the former

section.

2.7 Tropical Cyclones

A tropical cyclone is a non-frontal low pressure system with organized convection

that forms over warm tropical waters [112]. A cyclone once formed moves over

the ocean in the direction away from the equator lasting a few days to sometimes

2-3 weeks [113]. During its lifetime, a cyclone can travel hundreds of kilome-

ters and the actual position of the cyclone’s eye recorded every six hours defines

the cyclone’s track. The forecast of a cyclone comprises cyclone track, intensity,

induced storm surges, rainfall and threat to coastal areas [18]. The direction of

cyclone movement and wind intensity are the most important features in the fore-

cast as it helps the inhabitants to prepare ahead of time and minimize damage

to life and property. For this reason, forecasting cyclone track and intensity is

considered extremely important forecast functions by scientists and meteorologi-

cal agencies around the world [17]. Various combination techniques in the track

and intensity prediction models are incorporated to account for the variation in

cyclone behavior in different ocean basins and achieve highest possible accuracy

and reliability [17, 19].

23

The official tropical cyclone guidance track and intensity forecast is assigned to

the designated regions around the world. For the South Pacific, the Fiji Me-

teorological Service (FMS) in Nadi Fiji is a World Meteorological Organisation

(WMO) recognised Regional Specialised Meteorological Centre (RSMC) who is

responsible for the southwest Pacific Ocean 1. In addition to FMS, the Australian

Bureau of Meteorology (BoM), a Tropical Cyclone Warning Center (TCWC), is

also responsible for the far southwest Pacific Ocean basin 2. While the U.S naval

Joint Typhoon Warning Centre (JTWC) 3 is not a WMO recognised RSMC or

TCWC, it however also issues cyclone warnings for various ocean basins, including

northwest Pacific, North Indian, Southwest Indian, Southeast Indian/Australian,

and the Australian/Southwest Pacific basins.

2.7.1 Conventional Cyclone Forecasting

The satellite era introduced the ability of estimating the intensity of tropical

cyclones using satellite images [114]. Methods involving cloud features along

with conventional estimates of cyclone strengths had been proposed to estimate

the intensity of cyclones [115, 116]. Over the past few decades, there has been

steady advances in tropical cyclone track forecasting with increase in availability

of observation data and state-of-the-art numerical models [117]. However, there

has been little improvement in cyclone wind intensity prediction [117]. The

difficulty with identifying the best approximate of intensity lies in its non-trivial

dependency on the horizontal resolution at small grid spacing [118]. In recent

times, statistical models have shown to provide the best intensity forecasts [119].

Dvorak [120] used a hybrid model that combined meteorological analysis of satel-

lite imagery together with a model consisting a set of curves that depicted cyclone

intensity change with time and cloud feature descriptions of the cyclone at in-

tervals along the curves [120]. There has been extensive use of this technique in

cyclone track forecasting therefore, it had been referred to as the Dvorak tech-

nique.

Statistical Hurricane Intensity Forecast (SHIFOR) [121], was based on a multiple

regression statistical model which was able to make up to 72 hour forecasts of

1www.met.gov.fj2www.bom.gov.au3http://www.usno.navy.mil/JTWC/

24

cyclone wind intensities. It used various predictor variables that include: Ju-

lian day, current storm intensity, intensity change in the past 12 hours, initial

storm location (latitude and longitude), and zonal and meridional component of

storm motion. The SHIFOR model was trained with cyclones which were at least

30 nautical miles away from land [121]. However, the current SHIFOR5 equa-

tion uses 1967-1999 cyclone data with a minimum requirement that each cyclone

intensifies into tropical storms [122] and is independent of the location of the

cyclone. Climatology and Persistence (CLIPER) is one of the computer-based

forecast models which was able to give 5 days prediction, i.e., 72 hours of cyclone

intensity [122].

Statistical Hurricane Intensity Prediction Scheme (SHIPS)[119] made cyclone in-

tensity forecasts for 12-hour periods out to 120 hours.

Together with the five predictors used in SHIFOR, SHIPS uses divergence of

winds at 200 hPa, intensification potential, vertical shear of the horizontal winds

between 850-200 hPa levels, cloud top temperature measured by GOES satellite,

average 200 hPa temperature, average 850 hPa vorticity, average 500-300 hPa

layer relative humidity and oceanic heat content from altimeter measurements.

The limitation of SHIPS model it is not suitable for predicting cyclone intensities

for cyclones which are near the coasts as the model was developed using cyclone

data that did not make landfall [123].

The Southern Hemisphere Statistical Typhoon Intensity Prediction Scheme (SH

STIPS) [124] model was the successor to the Ships model. It used a consensus-

based methodology based on multiple linear regression equations for each forecast

time to make forecasts on cyclone intensity. SH STIPS took advantage of environ-

mental forecast information and used optimal combination of factors related to

climatology and persistence, vertical wind shear, intensification potential, atmo-

spheric stability and dynamic intensity forecasts. The performance of SH STIPS

was able to beat its predecessors and competitors for making cyclone intensity

forecasts in the Southern Hemisphere [124].

2.7.2 Neural Network for Cyclones

Neural network regression models have been used for the prediction of the max-

imum potential intensity of cyclones [15].The error back-propagation learning

25

algorithm was used in a feedforward neural network with two hidden layers with

binary triggers that dynamically triggered the neurons based on the regressions of

the inputs. The proposed model provided satisfactory results on Western North

Pacific tropical cyclones [15]. A model inspired by the human visual system con-

sisting of a multi-layered neural network architecture with bi-directional connec-

tions in the hidden layers was introduced in [22]. The prediction of the direction

of movement from previously unseen satellite images showed good performance.

A hybrid neural network model that clusters input data using self-organizing

maps and feeds data from the different clusters to separate networks for training

and prediction was proposed [125]. The method was used for forecasting actual

typhoon-rainfall in Taiwan’s Tanshui river basin. It showed improved performance

over the conventional prediction methods.An investigation was done on the impact

of varying the number of layers and the number of neurons per layer for the

prediction of the direction and intensity of cyclones over the North Indian Ocean

[20]. The study found that an increase in the number of hidden layers improved

the accuracy of the forecast while the number of nodes in the hidden layer had

no significant effect on performance.

Deep convolutional neural networks have been used to detect tropical cyclones

using image processing of wind vector and sea level pressure color maps [126]. The

neural network was able to achieve 99% accuracy of predicting the occurrence of

cyclone from large climate datasets such as 20 century reanalysis and NCEP-

NCAR reanalysis. An approach combining a multilayer perceptron (MLP) with

a neuro-fuzzy model for the prediction of a cyclone’s track and surge height of

cyclones for the same cyclone data showed good prediction performance [127].

Tropical cyclones reported over the Bay of Bengal and the Arabian Sea were

considered by [128] to predict the track and intensity of tropical cyclones using

multi-layer feed forward neural networks. The proposed MLP model referred to as

neural network architecture 1(NNA 1) was able to give comparable performance

with existing numerical models for 6 hour ahead forecasts. Chandra et al. [2]

proposed a method for cyclone track prediction based coevolution of Elman RNNs

for the South Pacific where the latitude and longitude were treated as separate

dimensions. A similar approach was used for the prediction of wind intensities

[24], howsoever, there is room for further improvements in prediction accuracy.

26

2.8 Chapter Summary

This chapter has provided the background about neural networks, training algo-

rithms, evolutionary algorithms, ensemble learning, transfer learning and tropical

cyclone predictions. It has also explored cooperative coevolution algorithm, prob-

lem decomposition in neuro-evolution of neural networks. An extensive review of

the recent developments in these areas are provided and their strengths and limi-

tations have been highlighted. The ensemble learning method was also discussed.

Neural ensembles have proven their worth in recent work. Transfer learning tires

to use existing knowledge to help in gaining new knowledge. Neural systems using

transfer learning are giving promising results in recent times. Therefore, giving

rise to the available applications where further explorations could be made with

these expert systems. The following chapters suggest improvements to existing

systems and then apply the upgraded systems for tropical cyclone wind intensity

and path prediction.

27

Chapter 3

Identification of Minimal Timespan Problem for Cy-clone Wind-Intensity Prediction

This chapter presents an empirical study on minimal timespan required for ro-

bust prediction using Elman recurrent neural networks. Two different training

methods are evaluated for training Elman recurrent network that includes coop-

erative coevolution and backpropagation-though time. They are applied to the

prediction of the wind intensity in cyclones that took place in the South Pacific

over past few decades. The results show that a minimal timespan is an impor-

tant factor that leads to the measure of robustness in prediction performance and

strategies should be taken in cases when the minimal timespan is needed.

3.1 Introduction

c

This chapter presents an empirical study on minimal timespan required for ro-

bust prediction using Elman RNNs. We train a prediction model and test it for

robustness regarding minimal timespan. We perform this by training it with dif-

ferent size of timespan and using a different size for testing. For instance, in the

case of cyclone wind-intensity prediction, this could be 36 hours ( 6 data points

recorded every 6 hours) for training and then testing it with 12 hours ( 2 data

points recorded every 6 hours). Therefore, in this case, there is a need to evalu-

ate the quality of prediction within 12 hours when a cyclone is formed. We run

several different types of experiments to test the robustness of Elman RNNs us-

ing two different training methods that include cooperative neuro-evolution and

backpropagation-though-time.

28

Figure 3.1: Cyclone wind intensity time series showing relation between train-ing and testing timespan. Training data uses embedding dimension 5 while

testing dataset uses embedding dimension 3

3.2 Problem Definition and Methodology

In this section,the minimal timespan prediction problem is identified and details of

the models used to analyse the problem is provided. Elman RNNs are used as pre-

diction model and two distinct training algorithms that include back-propagation

through time and cooperative neuro-evolution.

3.2.1 Problem Definition: Minimal Timespan Prediction

Problem

In a conventional time series prediction problem, the large time series dataset

needs to be broken down into smaller sections or snapshots called windows, which

is usually taken at regular intervals [129]. The size of the window is defined as

the timespan. In the case of financial time series, there can be an issue if a

prediction is made according to the division of stock market per month. When

a month begins, one needs to evaluate an effective prediction model to check the

29

number of days (data points) one needs in order for the model to make an efficient

prediction.

The same problem lies when it comes to cyclones, one needs to measure how many

hours after the cyclone is detected the model begin prediction regarding the track,

wind or other characteristics of the cyclone. In the case of cyclones, predictions

need to be made as quickly as possible in order to provide early warnings to

the people so that they can get prepared. For example, data about a tropical

cyclone in the South Pacific is recorded at six-hour intervals [130]. Therefore, if

the timespan used is 6 data points then, the first prediction of any system used for

predictions would come after 36 hours. It would take 36 hours to make the first

prediction about the cyclone wind intensity. By that time, a lot of damage would

have been already caused which may have been avoided if robust and accurate

warnings been issued.

The problem with the existing models such as neural networks used for cyclones

and related problems is that the minimal time required to reach a decision about

the first prediction is still unknown. We introduce the problem of minimal times-

pan that defines the minimum duration needed for a model to effectively reach a

prediction for a given time-series.

Figure 3.1 shows a portion of the wind intensity time series of tropical cyclones in

the South Pacific. The event length is represented by the different color portion

of the time series and it gives the duration of single cyclones. The timespan is

of fixed length and moves through the time-series in a windowed motion. The

movement at some points causes the timespan to overlap from one event to the

other the at the point of transition of the events (cyclones). The figure shows

how we extract two different timespan values from a single time-series.

Figure 3.2 shows the experimental method we used to identify the minimal times-

pan. The RNN was firstly trained to predict cyclone wind intensity using a times-

pan or embedding dimension W. Later each of the fully trained network (trained

with the separate timespan values) was tested with multiple values of timespan

(3,4,5,6,7,8). The predictions given by each of the timespan trained and tested

with was analyzed to identify the overall minimal timespan.

30

Figure 3.2: Elman Recurrent neural network trained with timespan W andtested with X,Y,Z for identifying the minimal timespan.

3.2.2 Methodology: Recurrent Networks for Prediction

RNNs are dynamical systems that use states from previous time steps to compute

current state; they are thus well-suited for modeling temporal sequences [1]. El-

man RNNs use a context layer to compute the new state from the previous state

and current inputs. The basic components of an observed dynamical system are

represented in an Elman network using the input, context and the output layer

[47].

Figure 3.3 shows the Elman recurrent neural used for cyclone wind intensity pre-

diction where D represents the embedding dimension. Input data is preprocessed

and it is fed to the RNN at single time-steps up till the size of the timespan being

used is reached after which the wind intensity is predicted.

We employ two distinct algorithms for training the given RNN. These include,

1) cooperative neuro-evolution and 2)back-propagation through time which are

described in detail in the background section.

3.3 Experiments and Results

This section describes the details of the experimental design and results where

RNNs are training using cooperative neuro-evolution (CNE) and back-propagation

31

Figure 3.3: Elman recurrent neural network used for tropical cyclone windintensity prediction. Time series data is preprocessed and embedded using

Taken’s theorem and fed into the Elman RNN.

through-time for the identified minimal span problem. The focus was kept on

minimal timespan for tropical cyclone wind intensity prediction as a case study.

In the testing stage, we pre-process the test dataset using different values of

the timespan. In this way, we evaluate the effectiveness of the trained RNN for

generalization performance on different values of timespan from which only value

has been used during training.

32

3.3.1 Data Preprocessing and Reconstruction

We use Taken’s theorem [109] to reconstruct the time series data into a state

space vector. The RNN unfolds k steps in time which is equal to the embedding

dimension or timespan D [5, 131, 132].

Tropical cyclone intensity data from the Southern Pacific region [130] was used

for the purpose of this experiment. The time-series data contained 6000 points in

the training set (tropical cyclones from 1985 - 2005). There were 2000 points in

the test set (tropical cyclones from 2006 - 2013) taken from the dataset. All the

cyclones in both the training and testing dataset were concatenated into a single

data stream to form the complete time series. Cyclones were placed consecutively

in the dataset based on their date of identification in ascending order.

3.3.2 Experimental Design

The sub-populations in cooperative neuro-evolution employ the generalized gen-

eration gap with parent-centric crossover (G3-PCX) evolutionary algorithm [52].

A population size of 200 is utilized with 2 parents and 2 offspring that has shown

good results in literature [5]. In the case of BPTT, a learning rate of 0.2 had

been employed. In the case of cooperative neuro-evolution, results for 3 hidden

neurons are provided as this showed optimal results in literature [24]. Training

was done with different numbers of hidden neurons in the case of BPTT and then

present the case that gives best results to test different minimal timespan.

The original dataset comprised of cyclones from the past decades where each data-

point was recorded at regular six-hour intervals. The data was reconstructed

in order to test the effectiveness of the model for prediction within 18 hours

(timespan of 3) and up to 48 hours (timespan of 8). Figure 3.2 gives more details

of the experimental set up used. The neural network was trained with timespan

W and tested with timespan X,Y,Z.

The concept of weight based learning was used to test the robustness of the

RNN. The neural network was trained at different difficulty level whereby, for the

various timespans tested, the training was done on easy mode (longer sequence

of inputs provided in training), hard mode (smaller sequence of inputs provided

33

Figure 3.4: Unfolded view of RNN.

in training) and normal mode (same number of inputs provided in both learning

and testing).

3.3.3 Results

Figure 3.5 shows the performance CNE and BPTT on the testing datasets for

the varying timespan values from the cyclone data. Each point in the bar-graph

(CNE and BPTT) gives the performance of the RNN that had been tested with

timespan ranging from 3 up to 8 with increments of 1. The 95% confidence

interval reported by RMSE of 30 independent experimental runs is given by an

error bar.

The sub-figures 3.5(a), 3.5(b) , 3.5(c), 3.5(d) and 3.5(e) are used to test the ro-

bustness of the CNE and BPTT training algorithms for evaluating the minimal

34

(a) RNN trained with Timespan = 4

(b) RNN trained with Timespan = 5

(c) RNN trained with Timespan = 6

(d) RNN trained with Timespan = 7

(e) RNN trained with Timespan = 8

Figure 3.5: Performance of CNE and BPTT in wind intensity prediction inthe testing dataset (2006 -2013) for tropical cyclones in the South Pacific.

35

Table 3.1: Best Performance of cooperative coevolution

TS(training) TS (testing) RMSE (Test) MAE (Test)

4 6 0.1312± 0.0378 25.64± 7.7495 5 0.0314 ± 0.0005 4.962 ± 0.0826 6 0.0798± 0.0290 15.06± 6.1537 7 0.0637± 0.0307 11.27± 6.0888 8 0.0704± 0.0389 11.68± 6.739

timespan. We compare the performance of training algorithms with respect to

the varied timespans ( TS ∈ {4, 5, 6, 7, 8}) used in training. Figure 3.5 (b) has

achieved the best performance which is given by the minimum error. The times-

pan of 5 has shown the best performance in testing dataset given the RNN was

trained with timespan of 5. The best performance was given when the minimum

timespan for testing dataset was the same for the training dataset. Similar trends

were seen with all the other cases of the timespan that were used in training; ex-

cept for TS4 as shown in Figure 3.5(a). The case for TS4 is highlighted here,

where there was only 0.006 difference between timespan 4 and 6. Therefore, we

could still generalize that the same timespan used for training and testing provide

the best performance.

The performance of CNE was able to beat BPTT for higher Timespans (TS

[7 and 8]). CNE also showed good prediction accuracy for Figure 3.5.(b) and

3.5.(c) when the training timespan was same as the timespan tested that is 7

and 8 respectively. In lower timespans (TS [4, 5, 6]), BPTT had shown better

performance.

Figure 3.6 gives the performance for a single run of the CNE together with the

error in prediction. The initial 100 data points are shown for clear visualizations.

The timespan of 5 is compared with timespan 6 and 7. We only used times-

pan 5, 6 and 7 for visualization purposes as timespan 5 showed most promising

performance as seen in Figure 3.5(b).

Table 3.1 summarizes the best performance of CNE. As shown by the RMSE and

MAE, the best possible value for both training and testing timespan is 5 as it has

the least error.

36

3.3.4 Discussion

Minimal timespan was defined as the least possible number of data points or

the smallest window size necessary for time-series prediction. The results, in

general, reveal that the minimal timespan is an important feature to test the

robustness of the prediction model and the training algorithm. Cyclone wind-

intensity prediction was used as it needs robust prediction model, however, other

applications can also be explored to identify the minimal timespan problem.

In terms of the training algorithm, CNE was able to outperform BPTT for the

higher timespan. CNE works towards dividing a larger problem into smaller

components and solving them. The neural network gets larger in size with large

timespan as the RNN unfolds longer in time to cater for the increased number of

inputs. Figure 3.4 shows the comparison of the size of unfolded RNN. Timespan

TS(4) and TS(7) are shown where it is evident that larger timespan unfolds into

a larger network in time. Therefore, training the larger neural network is well

suited for CNE as it is an evolutionary algorithm and the weight updates are

done according to fitness of the entire network and not through gradients as in

the case of BPTT. The results demonstrated that for TS(7) and TS(8), CNE

outperformed BPTT. This is due to the difficulty of BPTT in back-propagating

errors as the size of the network that unfolds in time gets larger. As shown in the

results, in the cases of smaller timespan, (TS4, TS5, and TS6), BPTT performs

better than CNE.

We found that training and testing timespan need to be same for best prediction

performance. This shows that the RNN’s which had been trained were unable to

generalize well for the different timespan tested which implies that the selected

training methodologies were not robust enough. This reaffirms that the choice

of timespan as a good measure of robustness for the training algorithms and the

prediction model. The challenge in future research is to develop a strategy that

is able to give good prediction performance regardless of the size of the timespan

in the testing dataset.

The results showed that the minimal timespan TS(5) gave the best performance.

This implies that prediction of the model can take place within 30 hours from the

identification of the cyclone. Since readings are taken every 6 hours, timespan of

5 is same as 30 hours from the beginning of the cyclone.

37

(a) Performance of CNE on testing dataset (b) Error in prediction by CNE

Figure 3.6: Performance of CNE for a single experimental run

38

3.4 Chapter Summary

In this chapter, minimal timespan problem for robust time series prediction with

application to cyclone wind intensity was identified. The minimal timespan has

been defined as the least possible window size necessary to begin time-series pre-

diction. Back-propagation through time and cooperative neuro-evolution algo-

rithm were used to train Elman RNNs to find out the effect of minimal timespan.

According to the results, the minimal timespan is an important characteristic for

robust time series prediction. The minimal timespan would be useful in training a

RNN models that would be able to enhance predictions in future cases of cyclones.

According to the cyclone data, the data points were collected at six-hour intervals,

we could predict cyclone wind intensity quite accurately after 30 hours from

the start of the cyclone. This can enable better preparation for the cyclone,

therefore, reducing the damages caused. The minimal timespan is of paramount

importance when it comes to problems that require faster prediction as seen

with cyclones. The problem of minimum timespan exists in a wide range of

applications, especially in engineering problems that rely on intelligent decision

making based on minimal data readings by sensors.

The next chapter uses the minimal timespan and applies it to tropical cyclone

track forecast.

39

Chapter 4

Cyclone Track Prediction Using Coevolutionary Re-current Neural Networks

In this chapter, an architecture for encoding two dimensional time series problem

into Elman recurrent neural networks composed of a single input neuron is pro-

posed. Cooperative coevolution and back-propagation through-time algorithms

had been used for training. The experiments showed an improvement in the pre-

diction accuracy when compared to previous results from literature which used a

different recurrent network architecture.

4.1 Introduction

In this chapter, Elman RNNs with a single input neuron are trained using coop-

erative coevolution and back-propagation through-time [4] for both latitude and

longitude prediction. The original two-dimensional time series was reconstructed

data using Taken’s theorem [109] and experiments with different number of hid-

den neurons in RNNs had been designed to test scalability and robustness. The

results are compared with previous work from literature which used a similar ar-

chitecture for the same prediction problem. The main contribution of this chapter

is introducing a new architecture for encoding two-dimensional time-series data

in Elman RNNs.

4.2 RNN Architecture for Cyclone Tracks

In previous work, the use of two input and output neurons for a cyclone’s longitude

and latitude, respectively, has been deployed using Elman RNN [2] as shown in

Figure 4.1.

The latitude and longitude are separate time series which are interrelated as they

are part of one variable of the cyclone, that is, the cyclone’s track. An improved

RNN model that combines the two dimensions of latitude and longitude into a

single data stream variable in an attempt to represent the direct relationship

40

Figure 4.1: Elman RNN used for prediction of cyclone latitude and longitude.Two input and output neurons are used for mapping the longitude and latitude

[2].

between the dimensions is proposed. Figure 4.2 shows the proposed architecture.

This network architecture is similar to that of Figure 4.1 but uses one single

input neuron and one output neuron which predicts both longitude and latitude

as depicted. In this model, the single neurons represent both the longitude and

latitude, thus preserving some form of correlation between the two time series.

The proposed network architecture is called single input-output neural network

(SIORNN). It will be trained using the error backpropagation through time and

the cooperative coevolution algorithm.

4.3 Simulation and Analysis

4.3.1 Data Preprocessing and Reconstruction

Takens theorem was used to reconstruct the original time series data. The the-

orem was developed for single dimension time series data. This experiment con-

siders two dimensions (latitude and longitude) hence Taken’s theorem is applied

to two dimensions.

The reconstructed vector is used to train the RNN for one-step-ahead prediction.

In the cooperative coevolutionary recurrent network (CCRNN) architecture, two

41

Figure 4.2: Proposed RNN architecture: A single input and output neuronElman recurrent neural network (SIORNN) used for predicting latitude and

longitude of the cyclone path.

neurons are used in the input and the output layer to represent the latitude,

longitude shown in Figure 4.1. In the proposed architecture (SIORNN), there is

only one neuron in the input and output layer. The processed data was laid out

in two layouts as shown in Figure 4.3 in order to be successfully handled by the

different network architectures. The recurrent network unfolds k steps in time

which is equal to the embedding dimension D [5, 131, 132].

The same time series data from the previous chapter is used however, for this

study, the cyclone track data had been extracted from the original data. Figure

4.4 shows the actual positions of the tropical cyclones from the dataset.

All the cyclone time series were combined together. Data preprocessing was done

by considering the position in the southern hemisphere and converting all into one

region. The conversion of latitude was done by multiplying the original latitude

by -1 to accommodate for South in the Southern Hemisphere. The longitudes

with East (E) coordinates remained unchanged while the West (W) coordinates

were subtracted from 360◦ to define all points in terms of East coordinates for

easier plotting of cyclone tracks on the spatial map.

The dataset contained time series of the position (latitude and longitude). The

following combinations of dimension and time lag were extracted using Taken’s

theorem.

42

Figure 4.3: Embedded data reconstructed using Taken’s theorem. Embed-ding dimension (D) of 4 is depicted. Both (a) and (b) have the two dimensions

longitude and latitude.

� Configuration A: D = 4 and T = 2, reconstructed dataset contains 3417

samples in training set and 1298 samples in test set.

� Configuration B: D = 5 and T = 3, reconstructed dataset contains 2278

samples in training set and 865 samples in test set.

43

Figure 4.4: Tropical cyclones track data in the South Pacific from 1985 to2013. (Generated using Gnuplot)

4.3.2 Experimental Design

We experimented with different number of hidden neurons in the RNN that em-

ployed sigmoid units in the hidden and output layer. We use implementation

from Smart Bilo: An Open Source Computational Intelligence Framework in our

experiments [133].

The CC-SIORNN as well as BPTT-SIORNN were both trained for and their

predictions tested for 24-hour and 30-hour advance warning. The termination

condition was set at 50,000 function evaluations for CC and 2000 epochs for

BPTT. The root mean squared error was used to evaluate the performance of the

two architectures for cyclone track prediction given in Equation 2.4.

4.4 Results and Discussion

The mean and 95% confidence interval is given from 30 experimental runs are

shown in Table 4.1 and Table 4.2. The best results with the least values of RMSE

given by each of the configurations are shown bold-faced. The test for robustness

of the proposed architecture is done using different sets of configuration to show

scalability as they contain varied dataset sizes.

44

Figure 4.5: Performance of SIORNN using cooperative coevolution for 6random cyclones from the year 2006 to 2013.

Table 4.1: Generalization performance of training models on cyclone trackprediction for Configuration A

Model Hidden RMSE (Train) RMSE (Test) Best

CCRNN 3 0.0508 ± 0.0010 0.0484 ± 0.0010 0.0455CCRNN 5 0.0493 ± 0.0006 0.0471 ± 0.0006 0.0447CCRNN 7 0.0492 ± 0.0007 0.0471± 0.0006 0.0448

CC-SIORNN 3 0.0252 ± 0.0003 0.0244 ± 0.0002 0.0238CC-SIORNN 5 0.0252 ± 0.0003 0.0245 ± 0.0003 0.0238CC-SIORNN 7 0.0266 ± 0.0033 0.0260 ± 0.0034 0.0237

BPTT-SIORNN 3 0.0265 ± 0.0002 0.0257 ± 0.0002 0.0245BPTT-SIORNN 5 0.0260 ± 0.0003 0.0254 ± 0.0002 0.0245BPTT-SIORNN 7 0.0256 ± 0.0003 0.0251 ± 0.0003 0.0242

The comparison of the combination of the single neuron and the multi-neuron

CCRNN methods shown by cyclone tracks is given in Tables 4.1 and 4.2. The

best performance was achieved by CC-SIORNN. It was able to outperform the

other methods for all the cases.

The number of hidden neurons did not make any considerable difference to the

results although 5 hidden neurons gave the best performance in configuration

B and 3 hidden neurons had better performance in configuration A. CCRNN

produced best best results with 7 neurons in the hidden layer. It seems that 3

45

(a) Performance of latitude prediction the testdataset

(b) Performance of longitude prediction the testdataset

Figure 4.6: Typical prediction performance of a single experiment (one-stepahead prediction) given by BPTT-SIORNN for Cyclone track test dataset

(2006 - 2013 tropical cyclones) where Time is taken at six hour intervals.

46

Table 4.2: Generalization performance of training models on cyclone trackprediction for Configuration B

Model Hidden RMSE (Train) RMSE (Test) Best

CCRNN 3 0.0526 ± 0.0014 0.0481± 0.0013 0.0432CCRNN 5 0.0506± 0.0007 0.0462± 0.0008 0.0430CCRNN 7 0.0497± 0.0006 0.0456± 0.0006 0.0425

CC-SIORNN 3 0.0260 ± 0.0004 0.0242 ± 0.0004 0.0232CC-SIORNN 5 0.0254 ± 0.0001 0.0237 ± 0.0001 0.0233CC-SIORNN 7 0.0256 ± 0.0003 0.0241 ± 0.0003 0.0232

BPTT-SIORNN 3 0.0254 ± 0.0002 0.0242 ± 0.0002 0.0235BPTT-SIORNN 5 0.0252 ± 0.0001 0.0239 ± 0.0001 0.0235BPTT-SIORNN 7 0.0252 ± 0.0001 0.0239 ± 0.0001 0.0235

hidden neurons were not sufficient to represent the two-dimensional time series

problem when separate neurons handled the two dimensions.

Figure 4.5 shows the performance of cooperative coevolution using the proposed

SIORNN architecture. It shows the path of 6 selected cyclones from those that

occurred between 2006 and 2013. Figure 4.6 shows the typical prediction perfor-

mance of a single experimental run given by the BPTT-SIORNN for cyclone track

test data set (2006 - 2013 tropical cyclones). The errors shown by the prediction

is also given in the graphs.

4.4.1 Discussion

Cyclone track is generally viewed as a single dimension but, it is modeled as

two separate dimensions of latitude and longitude. Through this research, we

found out that although we treat latitude and longitude independently, there is

correlation between them and when they are encoded into a recurrent neural net-

work as a single stream of data the prediction performance improves significantly

regardless of the training algorithm.

Cooperative coevolution and error backpropagation through-time gave better per-

formance when compared to the two neuron architecture due to the adapted

network configuration. The improvement in performance could be due to the

combination of the two dimensional time series consisting of a cyclone’s longitude

and latitude into a single stream of data points as represented in Figure 4.3 (b).

47

A single data stream increases the chance of preserving the interdependencies be-

tween the latitude and longitude while reaffirming the correlation between the two

inputs. Therefore, SIORNN outperforms CCRNN. In the traditional approach,

there is minimal probability of preserving interdependencies within the track at-

tributes as each input is encoded independently therefore, loosing the correlation

between latitude and longitude.

The major errors in predictions were seen at locations where there was a transition

from one cyclone to another in the data. This is due to the concatenation of

the various cyclones into a single data stream. As seen from the results of a

typical prediction given in Figure 4.6, the region where there is a switch from

one cyclone to another produces a large error in the prediction. The network

was unable to cope with the sudden change in the data due to the occurrence

of the cyclone at a different location. The data concatenated the location of

the cyclones where the end of one cyclone is adjacent to the beginning of the

other cyclone which are independent events. These have been considered as joint

events for training RNNs. Further studies need to be done in order to improve

the prediction accuracy at the beginning and end of the cyclones.

4.5 Chapter Summary

This chapter investigated how the two-dimensional time series consisting of lon-

gitude and latitude is best represented for superior prediction performance when

used for training RNNs. In the first method, the latitude and longitude is pre-

sented to a recurrent neural network with two input neurons whereas the second

methods combines both variables in a single data stream and employs a net-

work with single input neuron which interleaves successive longitude and latitude

values.

The results show that a single input and output layer neuron network trained

with either of the algorithms outperforms networks trained with separate inputs

for longitude and latitude. It is evident that it is more difficult to train the

recurrent neural network for both tasks given by the previous method as the

number of dimensions increase along with the noise in the time series and with it

the uncertainly. The proposed method has alleviated this weakness and produced

improved results that motivates real time implementation.

48

Although the results have been very promising, it may be possible to approach

the multidimensional problem as a group of single-dimensional time series prob-

lems using a mixture of computational intelligence methods for cyclone track

prediction.

There is also motivation for using additional atmospheric conditions that are

major attributes in the formation of cyclones such as the sea surface temperature,

pressure and humidity and the change of their intensity with time. Another

attribute that can be considered is the speed at which the cyclone is moving and

the geographical landscape, that is, sea and land.

The next chapter uses stacked transfer learning neural model for wind intensity

prediction.

49

Chapter 5

Stacked Transfer Learning for Tropical Cyclone In-tensity Prediction

In this chapter transfer stacking is used as a means of studying the effects of

cyclones whereby the contribution of cyclone data from different geographic loca-

tions towards improving generalization performance is evaluated. Conventional

neural networks are used for evaluating the effects of duration on cyclones in pre-

diction performance. A strategy of evaluating the relationships between different

types of cyclones through transfer learning and conventional learning methods

via neural networks was established in this chapter.

5.1 Introduction

In this chapter, stacked transfer learning is used as a means of studying the

effects of cyclones. We select the cyclones from last few decades in the South

Pacific and South Indian Oceans. Firstly, we evaluate the performance of the

standard neural networks when trained with different regions of the dataset. We

then evaluate the effects on duration of the cyclones in the South Pacific region

and their contribution towards the neural networks generalization performance.

Finally, we use transfer staking via ensembles and consider the South Pacific

region as the target model. We use South Indian Ocean as the source data and

evaluate its impact on the South Pacific Ocean. The backpropagation neural

network is used as the stacked ensembles in the transfer stacking method.

5.2 Methodology

5.2.1 Neural networks for time series prediction

Time series are data of a series of events observed over a certain time period. In

order to use neural networks for time series prediction, the original time series

is reconstructed into a state-space vector with embedding dimension (D) and

50

Figure 5.1: Neural Network Ensemble Model

time-lag (T) through Taken’s theorem [109] . We consider the backpropagation

algorithm which employs gradient descent for training [134]. The root mean

squared error which is generally used to test the performance of the FNN in

given in equation 2.4.

5.2.2 Stacked transfer learning

Transfer learning is implemented via stacked ensembles which form source and

target ensemble and a combiner ensemble that are implemented using feedfor-

ward neural networks (FNNs). Ensemble 1,2 and the combiner network in fig-

ure 5.1. We refer to the transfer learning model shown in figure 5.1 as transfer

stacking hereafter. Transfer stacking is implemented in two phases; phase one

involves training individual ensembles from the network. The second phase of

the method trains a secondary prediction model which learns from the knowledge

of the trained ensembles in phase 1. Figure 5.1 shows the broader view of the

ensemble stacking method where we have two ensemble models (FNNs) feeding

knowledge into a secondary combiner network. The source and target ensembles

are implemented using FNNs with the same topology. The combiner ensemble

topology is depended on the number of ensembles used as the source and target

ensembles. Ensemble 1 considers the South Pacific ocean data while Ensemble

51

2 considers the South Indian ocean training data. The datasets are described in

section 5.2.3.

The combiner ensemble is a feedforward network that is trained on the knowledge

processed by the ensembles. Backpropagation is used for training the combiner

network and the respective ensembles. The processed knowledge comes from

training data of the source and target data. Knowledge was gathered from cre-

ating a stacked dataset that is a direct mapping from the training data. This

was achieved by concatenating the output of all the ensembles into a new stacked

data file. The stacked dataset encompassing the knowledge of the ensembles, is

then used to train the combiner FNN network.

Similar to the two step training process, the testing done in two phases. The

testing data was also passed through the stacking module to generate a stacked

testing data. The stacked testing data was then used in the combiner network to

measure the generalization performance of transfer stacking .

5.2.3 Data Processing

The South Pacific ocean cyclone wind intensity data described in Section 3.3.1

and South Indian Ocean cyclone wind intensity data for the year 1985 to 2013

had been used for these experiments[130]. We divided the data into training and

testing sets. Cyclones occurring in the years 1985 to 2005 were used for training

while the remaining data was used to test for generalization performance. The

consecutive cyclones in the training and testing set was concatenated into a time

series for effective modeling. The data was normalized to be in the range [ 0, 1].

5.3 Experiments and Results

In this section, we present the experiments that we had used to test the transfer

stacking for cyclone intensity prediction. We later present our results based on

the root mean squared error.

52

5.3.1 Experiment Design

Standalone feed forward neural networks that were trained using the same back

propagation learning algorithm used in the experiments. Stochastic gradient de-

scent was used with 2000 epochs training time. 30 experimental runs of each of the

experiments was done in order to provide mean performance. The experiments

were designed as presented in the list.

1. Experiment 1: Vanilla FNN trained on all South Pacific ocean training data

and tested on South Pacific Ocean testing data.

2. Experiment 2: Vanilla FNN trained on all Indian ocean training data and

tested on South Pacific Ocean testing data.

3. Experiment 3: 6 experiments with vanilla FNNs trained on subset of South

Pacific ocean training data and tested twice. Once with the entire South

Pacific Ocean testing data followed by the testing with a subset of the

testing data . The subsets were created by grouping cyclones of similar

lengths into classes. Each class of cyclones had been trained and tested

with vanilla FNN model. We formulated the following experiments with

vanilla FNNs:

(a) [0− 3] day old cyclones in training set and tested with full testing set

as well as [0− 3 day old cyclones in testing set.

(b) [3− 5] day old cyclones in training set and tested with full testing set

as well as [3− 5] day old cyclones in testing set.

(c) [5− 7] day old cyclones in training set and tested with full testing set


(d) [7− 9] day old cyclones in training set and tested with full testing set


(e) [9− 12] day old cyclones in training set and tested with full testing set


(f) cyclones older than 12 days in training set and tested with full testing

set as well as cyclones older than 12 day in testing set.

53

Table 5.1: Generalization performance

Experiment RMSE1 0.02863± 0.000422 0.03396± 0.000754 0.02802± 0.00039

Table 5.2: Experiment 3: Performance of FNN on different categories ofcyclones

Cyclone Category Training RMSE Categorical Testing RMSE Generalization RMSE

0-3 day 0.03932± 0.00173 0.05940± 0.00329 0.20569± 0.013393-5 day 0.03135± 0.00050 0.02580± 0.00044 0.05200± 0.005325-7 day 0.03265± 0.00027 0.02504± 0.00025 0.03070± 0.001727-9 day 0.02831± 0.00033 0.03799± 0.00065 0.03207± 0.000559-12 day 0.03081± 0.00025 0.02700± 0.00059 0.03154± 0.00047> 12 day 0.02819± 0.00019 0.03579± 0.00037 0.02875± 0.00025

4. Experiment 4: Transfer learning based stacked ensemble method for pre-

dicting South Pacific ocean tropical cyclone intensity. The mechanics of

this method is given in section 5.2.2.

5. Experiment 5: Two separate experiments were done in this experiment;

(a) FNN trained on 1985 - 1995 cyclones from south pacific data and tested

on South Pacific Ocean testing data.

(b) FNN trained on 1995 - 2005 cyclones from south pacific data and tested

on South Pacific Ocean testing data.

5.3.2 Results

We present the results of the prediction of tropical cyclones intensity in the South

Pacific form the year 2006 to 2013. The root mean squared error (RMSE) was

used to evaluate the performance as given in equation 2.4.

Table 5.1 gives the generalization performance of the respective methods on the

testing data. The results show that all the models have similar performance. It

looks at results from experiments 1, 2 and 4.

Table 5.2 gives the performance of various categories of data on the two testing

datasets; category based testing and generalized testing. Category based testing

54

Table 5.3: Experiment 5: Performance of Vanilla FNN on independent decadetraining data

Vanilla FNN Training RMSE Testing RMSE1985 - 1995 cyclones 0.04434± 0.00073 0.04671± 0.002091995 - 2005 cyclones 0.03635± 0.00062 0.05949± 0.00201

is done on cyclones which belongs to that particular category of the testing data.

Generalization testing was done on the entire testing dataset. The results show

that cyclones that ended in under three days were not good predictors for the

generalization data as they had higher testing error. Similarly, cyclones that had

duration of 3-5 days had poor performance as well giving a larger error however

not as large as is predecessor.

Cyclones with duration of 5-12 days had very similar generalization performance.

These categories of cyclones performed better than the smaller length cyclones

by giving better prediction accuracy in terms of RMSE. The final category of

cyclones gave the best generalization performance. Longer length cyclones, 12

days and over was matching the prediction accuracy of the best models that is,

the standalone FNN model of control experiment 1, periodical and spatial analysis

ensemble models.

Table 5.3 shows the generalization performance achieved by the standalone FNN

model trained with the decade 1 and decade 2 training data. The generalization

performance is rather poor with both the decades of data used independently

when compared to the concatenation of all the training data as seen in experiment

1.

5.3.3 Discussion

The results revealed interesting details about the respective experiments that were

carried out. The first two experiments considered conventional learning through

neural networks (FNN) for predicting tropical cyclone wind intensity. Exper-

iments 1 considered cyclones where the training and testing dataset from the

same region (South Pacific ocean) was used. Experiment 2 considered cyclones

where training data from South Indian ocean and testing data from South Pacific

ocean was used. According to the results, there was minimal difference in the gen-

eralization performance in the South Pacific ocean, although the training datasets

55

considered different regions. This implies that the cyclones in the South Indian

ocean have similar characteristics in terms of the change of wind-intensity when

compared to South Pacific ocean. Note that the size of the South Indian ocean

dataset was about three times larger when compared to South Pacific ocean.

Furthermore, Experiment 3 featured an investigation into the effects of the du-

ration of cyclones (cyclone lifetime) on the generalization performance. This was

done only for the case of the South Pacific ocean. Note that the generalization

performance is based on the test dataset that includes all the different types of

cyclones that occurred between year 2006 to year 2013. As seen in the results,

cyclones with shorter durations were not effective when considering the general-

ization performance. It seems that the shorter duration of the cyclones did not

give enough information to feature essential knowledge for the longer duration of

cyclones in the test dataset. The category with the cyclones with longest lengths

gave the best performance in terms of generalization performance .This implied

that it had known about all the phases of the life cycle of the cyclone, thus it was

able to effectively predict all classes of cyclones.

Transfer stacking via neural network ensembles was done in Experiment 4 where

training dataset from South Indian ocean was used as source datasets and cyclones

in the South Pacific Ocean was used in the target datasets. The generalization

performance here was similar to conventional neural networks given Experiment 1

and 2. This shows that the source data (South Indian ocean) did not make signif-

icant contributions towards improving the generalization performance, however,

we note that there was not a negative transfer of knowledge as the performance

was not deteriorated. Therefore, the knowledge from the cyclone behavior or

change in wind-intensity Indian ocean is valid and applicable for South Pacific

ocean. Further validation can be done in future through examining the track

information of the cyclones from the respective regions. This will add further in-

sights of transfer learning through stacking and establish better knowledge about

the relationship of the cyclones in the respective regions.

56

5.4 Chapter Summary

This chapter presented transfer stacking as a means of studying the contribution

of cyclones in different geographic locations for improving generalization perfor-

mance in predicting tropical cyclone wind intensity for the South Pacific ocean.

We then evaluated the effects on duration of the cyclones in the South Pacific

region and their contribution towards the neural networks generalization perfor-

mance. Cyclone duration was seen to be a major contributor in the prediction of

cyclone intensity.

We found that cyclones with duration of over 12 days could be used as good

representative for training the neural networks with competitive prediction inten-

sity. Furthermore, the results show that the Indian Ocean source dataset does not

significantly improve the generalization performance of the South Pacific target

problem in terms of generalization performance. The contributions of the Indian

ocean data was negligible as the knowledge about cyclone intensity prediction

was sufficiently learned from the South Pacific data. The change in geographical

location was unable to provide any new knowledge which would improve gener-

alization.

Further work can review other cyclone regions into the transfer learning method-

ology to further improve the generalization performance. The approach can be

extended to prediction of cyclone tracks in the related regions. Recurrent neu-

ral networks (RNN) could be used as ensemble learners to identify any temporal

sequences that FNN was unable to learn as RNNs have shown better modeling

performance. Hybrid ensemble models of neural networks could also be developed

that use evolutionary algorithms for training the FNN together with backpropa-

gation to improve the modeling.

57

Chapter 6

Conclusions and Future Work

In this thesis, novel methods with neural networks were used to predict cyclone

wind intensity and path for the cyclones in the South Pacific Ocean for the past

decades.

In chapter 7, minimal timespan problem was identified. It was defined as the

least possible number of data points required for time-series prediction. The re-

sults indicated the capability of timespan to be used as a measure of robustness

of Elman recurrent neural networks in time series prediction. Tropical cyclone

wind intensity prediction warranted the use of time span of 5 as it gave the best

result in the experiments. Cooperative neuro-evolution was found to outperform

backpropagation through-time at larger time span. The size of the problem in-

creased with time span due to multiple unfolds of the RNN over time to cater

for the increased inputs creating a bigger network which favored CC’s divide and

conquer style of optimization over BPTT. As a measure for robustness, it was

seen that the training and testing had to be the same in order to attain the best

prediction accuracy which indicated that the training difficulty had minimal effect

on the robustness of the recurrent neural network. According to the results, the

minimal timespan is an important characteristic for robust time series prediction.

The minimal timespan would be useful in training a RNN models that would be

able to enhance predictions in future cases of cyclones.

Chapter 8, proposed a new architecture for encoding two-dimensional time series

data into Elman style recurrent neural networks . The primary motivation be-

hind this new architecture was to preserve the pre-existing relationships within

the separate dimensions of the data. Single input and output neuron architecture

of the RNN to encode two-dimensional tropical cyclone track data comprising

of latitude and longitude. SIORNN was compared to literature and significant

improvements in performance was noted. The adaptability of SIORNN was also

studied by applying two learning algorithms CC and BPTT. Both learning algo-

rithms outperformed the existing encoding method indicating the superiority of

SIORNN. However, CC-SIORNN was found to have the best results in general-

ization. It was found that it is more difficult to train the recurrent neural network

for both tasks given by the previous method as the number of dimensions increase

along with the noise in the time series and with it the uncertainly. SIORNN has

58

alleviated this weakness and produced improved results that motivates real time

implementation.

Finally, Chapter 9 evaluated the performance of the standard neural networks

when trained with different regions of the dataset South Pacific region and In-

dian Ocean region. It also looked at the effects on duration of the cyclones in

the South Pacific region and their contribution towards the neural networks gen-

eralization performance. Cyclone duration was seen to be a major contributor

in the prediction of cyclone intensity. Longer duration cyclones made most con-

tributions to network performance. It was found that cyclones with duration of

over 12 days could be used as good representative for the training data. Transfer

staking via stacked ensembles was successfully used to predict South Pacific cy-

clone wind intensity. South Pacific region was the target model. We used South

Indian Ocean as the source data and evaluated its impact on the South Pacific

Ocean with regards to cyclone wind intensity prediction. The additional source

knowledge as introduced by the use of changed geographical location data was

unable to provide any new knowledge which would improve generalization.

6.1 Future Research Directions

In future work, multi-objective and multi-tasking methods could be used for the

minimal timespan problem. Further applications in other problems such as rain-

fall and those that require fast seasonal prediction at beginning of event such

as earthquakes can be explored. The versatility of SIORNN could be applied

to three or more dimensions incorporating cyclone track and wind intensity in a

single prediction network with different training algorithms.

The stacked transfer learning could be used with additional data from other re-

gions and their contribution could be evaluated. Stacking could also be applied to

uncertainty quantification of prediction using Bayesian methods with application

to different datasets. Further comparisons of stacked model could be done wit

well established methods such as SHIPS and CLIPER.

59

Appendix

Publications

The following publications have arisen during the course of my research.

R Chandra, R. Deo, K. Bali, A. Sharma. On the relationship of degree of sep-

arability with depth of evolution in decomposition for cooperative coevolution.

In IEEE Congress on Evolutionary Computation, pages 4823-4830 , Vancouver,

Canada, July 2016.

R Deo and R. Chandra. Identification of minimal timespan problem for recurrent

neural networks with application to cyclone wind-intensity prediction. In IEEE

International Joint Conference on Neural Networks, pages 489-496, Vancouver,

Canada, July 2016.

R Chandra, R. Deo, C.W. Omlin. An architecture for encoding two-dimensional

cyclone track prediction problem in coevolutionary recurrent neural networks.

In IEEE International Joint Conference on Neural Networks, pages 4865-4872,

Vancouver, Canada, July 2016.

R. Deo, R Chandra, A. Sharma. Stacked transfer learning for tropical cyclone

intensity prediction. In The Pacific-Asia Conference on Knowledge Discovery

and Data Mining, Under Review, Melbourne, Australia, June 2018.

R Deo and R. Chandra. Multi-task learning for cyclone wind intensity and path

prediction. In Geoscience and Remote Sensing, IEEE Transactions on, In Pro-

cess.

60

Bibliography

[1] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, pp.

179–211, 1990.

[2] R. Chandra, K. Dayal, and N. Rollings, “Application of cooperative neuro-

evolution of Elman recurrent networks for a two-dimensional cyclone track

prediction for the South Pacific region,” in International Joint Conference

on Neural Networks (IJCNN), Killarney, Ireland, July 2015, pp. 721–728.

[3] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent

in nervous activity,” The bulletin of mathematical biophysics, vol. 5, no. 4,

pp. 115–133, 1943.

[4] P. J. Werbos, “Backpropagation through time: what it does and how to do

it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.

[5] R. Chandra and M. Zhang, “Cooperative coevolution of Elman recurrent

neural networks for chaotic time series prediction,” Neurocomputing, vol.

186, pp. 116 – 123, 2012.

[6] F. E. Tay and L. Cao, “Application of support vector machines in financial

time series forecasting,” Omega, vol. 29, no. 4, pp. 309–317, 2001.

[7] W. Du, S. Y. S. Leung, and C. K. Kwong, “Time series forecasting by

neural networks: A knee point-based multiobjective evolutionary algorithm

approach,” Expert Systems with Applications, vol. 41, no. 18, pp. 8049 –

8061, 2014.

[8] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions

on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.

[9] D. Pardoe and P. Stone, “Boosting for regression transfer,” in Proceedings

of the 27th international conference on Machine learning (ICML-10), 2010,

pp. 863–870.

[10] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, “Self-taught learning:

transfer learning from unlabeled data,” in Proceedings of the 24th interna-

tional conference on Machine learning. ACM, 2007, pp. 759–766.

61

[11] N. D. Lawrence and J. C. Platt, “Learning to learn with the informative

vector machine,” in Proceedings of the twenty-first international conference

on Machine learning. ACM, 2004, p. 65.

[12] J. Gao, H. Ling, W. Hu, J. Xing et al., “Transfer learning based visual

tracking with gaussian processes regression.” in ECCV (3), 2014, pp. 188–

203.

[13] H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura,

and R. M. Summers, “Deep convolutional neural networks for computer-

aided detection: Cnn architectures, dataset characteristics and transfer

learning,” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1285–

1298, 2016.

[14] K. Emanuel, “Increasing destructiveness of tropical cyclones over the past

30 years,” Nature, vol. 436, no. 7051, pp. 686–688, 2005.

[15] A Neural Network Regression Model for Tropical Cyclone Forecast, 2005.

[16] L. Jin, C. Yao, and X.-Y. Huang, “A nonlinear artificial intelligence ensem-

ble prediction model for typhoon intensity,” Monthly Weather Review, vol.

136, pp. 4541–4554, 2008.

[17] “Tropical-cyclone forecasting: a worldwide summary of techniques and ver-

ification statistics,” Bulletin of American Meteorological Society, vol. 68.

[18] G. Holland, “Global guide to tropical cyclone forecasting. bureau of

meteorology research center, melbourne, australia,” http://cawcr.gov.

au/publications/BMRC archive/tcguide/globa guide intro.htm, 2009, ac-

cessed: January 21, 2015.

[19] C. Roy and R. Kovordanyi, “Tropical cyclone track forecasting techniques

- a review,” Atmospheric Research, vol. 104-105, pp. 40–69, 2012.

[20] S. Chaudhuri, D. Dutta, S. Goswami, and A. Middey, “Track and intensity

forecast of tropical cyclones over the north indian ocean with multilayer

feed forward neural nets,” Meteorological Applications, vol. 22, no. 3, pp.

563–575, 2015. [Online]. Available: http://dx.doi.org/10.1002/met.1488

[21] L. E. Carr III, R. L. Elsberry, and J. E. Peak, “Beta test of the systematic

approach expert system prototype as a tropical cyclone track forecasting

aid,” Weather and forecasting, vol. 16, no. 3, pp. 355–368, 2001.

62

[22] R. Kovordanyi and C. Roy, “Cyclone track forecasting based on satellite

images using artificial neural networks,” ISPRS Journal of Photogrammetry

and Remote Sensing, vol. 64, no. 6, pp. 513–521, 2009.

[23] R. Chandra, “Multi-objective cooperative neuro-evolution of recurrent neu-

ral networks for time series prediction,” in IEEE Congress on Evolutionary

Computation, Sendai, Japan, May 2015, pp. 101–108.

[24] R. Chandra and K. Dayal, “Cooperative coevolution of Elman recurrent

networks for tropical cyclone wind-intensity prediction in the South Pacific

region,” in IEEE Congress on Evolutionary Computtaion, Sendai, Japan,

May 2015, pp. 1784–1791.

[25] L. Sacchi, C. Larizza, C. Combi, and R. Bellazzi, “Data mining with tempo-

ral abstractions: learning rules from time series,” Data Mining and Knowl-

edge Discovery, vol. 15, no. 2, pp. 217–247, 2007.

[26] J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the theory of neural

computation. Basic Books, 1991, vol. 1.

[27] X. Yao, “Evolving artificial neural networks,” Proceedings of the IEEE,

vol. 87, no. 9, pp. 1423–1447, Sep 1999.

[28] N. Garcıa-Pedrajas and D. Ortiz-Boyer, “A cooperative constructive

method for neural networks for pattern recognition,” Pattern Recogn.,

vol. 40, no. 1, pp. 80–98, 2007.

[29] H. Zhang, J. Guan, and G. C. Sun, “Artificial neural network-based image

pattern recognition,” in Proceedings of the 30th Annual Southeast Regional

Conference, ser. ACM-SE 30. New York, NY, USA: ACM, 1992, pp. 437–

441.

[30] F. Gomez, J. Schmidhuber, and R. Miikkulainen, “Accelerated neural evo-

lution through cooperatively coevolved synapses,” J. Mach. Learn. Res.,

vol. 9, pp. 937–965, 2008.

[31] D. Pardoe, M. Ryoo, and R. Miikkulainen, “Evolving neural network en-

sembles for control problems,” in Proceedings of the 7th Annual Conference

on Genetic and Evolutionary Computation, ser. GECCO ’05. New York,

NY, USA: ACM, 2005, pp. 1379–1384.

63

[32] R. Chandra and M. Zhang, “Cooperative coevolution of elman recurrent

neural networks for chaotic time series prediction,” Neurocomputing, vol. 86,

pp. 116–123, 2012.

[33] A. Emam, “Optimal artificial neural network topology for foreign exchange

forecasting,” in Proceedings of the 46th Annual Southeast Regional Confer-

ence on XX, ser. ACM-SE 46. New York, NY, USA: ACM, 2008, pp.

63–68.

[34] M. Janeski and S. Kalajdziski, “Neural network model for forecasting balkan

stock exchanges,” in Proceedings of the 7th International Conference on Ad-

vanced Intelligent Computing, ser. ICIC’11. Berlin, Heidelberg: Springer-

Verlag, 2011, pp. 17–24.

[35] S. S. Haykin, S. S. Haykin, S. S. Haykin, and S. S. Haykin, Neural networks

and learning machines. Pearson Education Upper Saddle River, 2009,

vol. 3.

[36] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal

representations by error propagation,” Parallel distributed processing: ex-

plorations in the microstructure of cognition, vol. 1, pp. 318–362, 1985.

[37] A. C. Tsoi and A. D. Back, “Locally recurrent globally feedforward net-

works: a critical review of architectures,” Neural Networks, IEEE Transac-

tions on, vol. 5, no. 2, pp. 229–239, 1994.

[38] C. L. Giles, C. B. Miller, D. Chen, H.-H. Chen, G.-Z. Sun, and Y.-C. Lee,

“Learning and extracting finite state automata with second-order recurrent

neural networks,” Neural Computation, vol. 4, no. 3, pp. 393–405, 1992.

[39] M. W. Goudreau, C. L. Giles, S. T. Chakradhar, and D. Chen, “First-

order versus second-order single-layer recurrent neural networks,” Neural

Networks, IEEE Transactions on, vol. 5, no. 3, pp. 511–513, 1994.

[40] T. Lin, B. Horne, P. Tino, and C. Giles, “Learning long-term dependencies

in narx recurrent neural networks,” IEEE transactions on neural networks,

vol. 7, no. 6, pp. 1329–1338, 1996.

[41] H. T. Siegelmann, B. G. Horne, and C. L. Giles, “Computational capabil-

ities of recurrent narx neural networks,” Systems, Man, and Cybernetics,

64

Part B: Cybernetics, IEEE Transactions on, vol. 27, no. 2, pp. 208–215,

1997.

[42] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural com-

putation, vol. 9, no. 8, pp. 1735–1780, 1997.

[43] H. Jaeger, “The “echo state” approach to analysing and training recurrent

neural networks-with an erratum note,” Bonn, Germany: German National

Research Center for Information Technology GMD Technical Report, vol.

148, p. 34, 2001.

[44] W. Maass, T. Natschlager, and H. Markram, “Real-time computing without

stable states: A new framework for neural computation based on perturba-

tions,” Neural computation, vol. 14, no. 11, pp. 2531–2560, 2002.

[45] M. LukosEvicIus and H. Jaeger, “Reservoir computing approaches to re-

current neural network training,” Computer Science Review, vol. 3, no. 3,

pp. 127–149, 2009.

[46] S. C. Kremer, “On the computational power of elman-style recurrent net-

works,” Neural Networks, IEEE Transactions on, vol. 6, no. 4, pp. 1000–

1004, 1995.

[47] M. Samuelides, “Neural identification of controlled dynamical systems and

recurrent networks,” in Neural Networks. Springer, 2005, pp. 231–287.

[48] R. Chandra, M. Frean, and M. Zhang, “On the issue of separability for

problem decomposition in cooperative neuro-evolution,” Neurocomputing,

vol. 87, pp. 33–40, 2012.

[49] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies

with gradient descent is difficult,” IEEE Trans. Neural Networks, vol. 5,

no. 2, pp. 157–166, 1994.

[50] S. Hochreiter, “The vanishing gradient problem during learning recurrent

neural nets and problem solutions,” Int. J. Uncertain. Fuzziness Knowl.-

Based Syst., vol. 6, no. 2, pp. 107–116, 1998.

[51] R. Chandra, M. Frean, M. Zhang, and C. W. Omlin, “Encoding subcom-

ponents in cooperative co-evolutionary recurrent neural networks,” Neuro-

computing, vol. 74, no. 17, pp. 3223 – 3234, 2011.

65

[52] K. Deb, A. Anand, and D. Joshi, “A computationally efficient evolutionary

algorithm for real-parameter optimization,” Evol. Comput., vol. 10, no. 4,

pp. 371–395, 2002.

[53] K. A. De Jong, “Evolutionary computation: a unified approach,” 2006.

[54] M. Lozano, D. Molina, and F. Herrera, “Editorial scalability of evolutionary

algorithms and other metaheuristics for large-scale continuous optimization

problems,” Soft Computing, vol. 15, no. 11, pp. 2085–2087, 2011.

[55] J. Apolloni, E. Alba et al., “Island based distributed differential evolution:

an experimental study on hybrid testbeds,” in Eighth International Con-

ference on Hybrid Intelligent Systems. IEEE, 2008, pp. 696–701.

[56] Z. Yang, K. Tang, and X. Yao, “Large scale evolutionary optimization using

cooperative coevolution,” Inf. Sci., vol. 178, no. 15, pp. 2985–2999, 2008.

[57] B. Kazimipour, M. N. Omidvar, X. Li, and A. K. Qin, “A sensitivity analysis

of contribution-based cooperative co-evolutionary algorithms,” pp. 417–424,

2015.

[58] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist mul-

tiobjective genetic algorithm: Nsga-ii,” Evolutionary Computation, IEEE

Transactions on, vol. 6, no. 2, pp. 182–197, 2002.

[59] R. P. Wiegand, “An analysis of cooperative coevolutionary algorithms,”

Ph.D. dissertation, Citeseer, 2003.

[60] R. E. Bellman, Dynamic Programming, ser. ser. Dover Books on Mathe-

matics. Princeton University Press, 1957.

[61] D. E. Goldberg, “Real-coded genetic algorithms, virtual alphabets, and

blocking,” Urbana, vol. 51, p. 61801, 1990.

[62] L. D. Whitley et al., “The genitor algorithm and selection pressure: Why

rank-based allocation of reproductive trials is best.” in ICGA, vol. 89, 1989,

pp. 116–123.

[63] J. E. Baker, “Reducing bias and inefficiency in the selection algorithm,”

in Proc. of the 2nd Intl Conf on GA. Lawrence Erlbaum Associates, Inc.

Mahwah, NJ, USA, 1987, pp. 14–21.

66

[64] K. A. DeJong, “An analysis of the behavior of a class of genetic adaptive

systems,” Ph.D. dissertation, 1975.

[65] A. H. Wright, “Genetic algorithms for real parameter optimization,” Foun-

dations of genetic algorithms, vol. 1, pp. 205–218, 1991.

[66] K. Deb and R. B. Agrawal, “Simulated binary crossover for continuous

search space,” Complex Systems, vol. 9, no. 3, pp. 1–15, 1994.

[67] I. Ono and S. Kobayashi, “A real coded genetic algorithm for function

optimization using unimodal normal distributed crossover.” in Proceedings

of the Seventh International Conference on Genetic Algorithms. Morgan

Kaufmann, 1997, pp. 246–253.

[68] T. Higuchi, S. Tsutsui, and M. Yamamura, “Theoretical analysis of simplex

crossover for real-coded genetic algorithms,” in Parallel Problem Solving

from Nature PPSN VI. Springer, 2000, pp. 365–374.

[69] P. Posık, “Bbob-benchmarking the generalized generation gap model with

parent centric crossover,” in Proceedings of the 11th Annual Conference

Companion on Genetic and Evolutionary Computation Conference: Late

Breaking Papers. ACM, 2009, pp. 2321–2328.

[70] K. Tang, X. Yao, P. N. Suganthan, C. MacNish, Y. P. Chen, C. M. Chen,

, and Z. Yang, “Benchmark functions for the CEC’2008 special session

and competition on large scale global optimization,” Nature Inspired Com-

putation and Applications Laboratory, USTC, China, Tech. Rep., 2007,

http://nical.ustc.edu.cn/cec08ss.php.

[71] X. Li, K. Tang, M. N. Omidvar, Z. Yang, and K. Qin, “Benchmark func-

tions for the CEC’2013 special session and competition on large-scale global

optimization,” RMIT University, Melbourne, Australia, Tech. Rep., 2013,

http://goanna.cs.rmit.edu.au/ xiaodong/cec13-lsgo.

[72] M. Potter and K. De Jong, “A cooperative coevolutionary approach to

function optimization,” in Parallel Problem Solving from Nature — PPSN

III, ser. Lecture Notes in Computer Science, Y. Davidor, H.-P. Schwefel, and

R. Manner, Eds. Springer Berlin Heidelberg, 1994, vol. 866, pp. 249–257.

[73] R. Salomon, “Re-evaluating genetic algorithm performance under coordi-

nate rotation of benchmark functions. a survey of some theoretical and

67

practical aspects of genetic algorithms,” Biosystems, vol. 39, no. 3, pp. 263

– 278, 1996.

[74] M. A. Potter and K. A. De Jong, “Cooperative coevolution: An architecture

for evolving coadapted subcomponents,” Evol. Comput., vol. 8, pp. 1–29,

2000.

[75] M. Omidvar, X. Li, Y. Mei, and X. Yao, “Cooperative co-evolution with dif-

ferential grouping for large scale optimization,” Evolutionary Computation,

IEEE Transactions on, vol. 18, no. 3, pp. 378–393, June 2014.

[76] X. Li and X. Yao, “Cooperatively coevolving particle swarms for large scale

optimization,” Evolutionary Computation, IEEE Transactions on, vol. 16,

no. 2, pp. 210–224, 2012.

[77] F. Gomez and R. Mikkulainen, “Incremental evolution of complex general

behavior,” Adapt. Behav., vol. 5, no. 3-4, pp. 317–342, 1997.

[78] Y. Liu, X. Yao, Q. Zhao, and T. Higuchi, “Scaling up fast evolutionary

programming with cooperative coevolution,” in Evolutionary Computation,

Proceedings of the 2001 Congress on, San Diego, CA, USA, Jun. 2001, pp.

1101–1108.

[79] F. van den Bergh and A. Engelbrecht, “A cooperative approach to particle

swarm optimization,” Evolutionary Computation, IEEE Transactions on,

vol. 8, no. 3, pp. 225–239, Jun. 2004.

[80] Y.-j. Shi, H.-f. Teng, and Z.-q. Li, “Cooperative co-evolutionary differential

evolution for function optimization,” in Advances in Natural Computation,

ser. Lecture Notes in Computer Science, L. Wang, K. Chen, and Y. S. Ong,

Eds. Springer Berlin / Heidelberg, 2005, vol. 3611, pp. 1080–1088.

[81] M. Omidvar, X. Li, and X. Yao, “Cooperative co-evolution for large scale

optimization through more frequent random grouping,” in Evolutionary

Computation (CEC), 2010 IEEE Congress on, 2010, pp. 1754–1761.

[82] M. N. Omidvar and X. Li, “Cooperative coevolutionary algorithms for large

scale optimisation,” Technical report, The Royal Melbourne Institute of

Technology (RMIT), Tech. Rep.

68

[83] J. Vesterstrom and R. Thomsen, “A comparative study of differential evo-

lution, particle swarm optimization, and evolutionary algorithms on numer-

ical benchmark problems,” in Evolutionary Computation, 2004. CEC2004.

Congress on, vol. 2. IEEE, 2004, pp. 1980–1987.

[84] K. Tang, X. Yao, P. Suganthan, C. MacNish, Y. Chen, C. Chen, and

Z. Yang, “Benchmark functions for the cec’2008 special session and compe-

tition on large scale global optimization. nature inspired computation and

applications laboratory, ustc,” Applicat. Lab., Univ. Sci. Technol. China,

Hefei, China, Tech, rep, 2007.

[85] R. Chandra and K. Bali, “Competitive two-island cooperative coevolu-

tion for real parameter global optimisation,” in Evolutionary Computation

(CEC), 2015 IEEE Congress on. IEEE, 2015, pp. 93–100.

[86] R. Chandra, “Competition and collaboration in cooperative coevolution of

elman recurrent neural networks for time-series prediction,” 2015.

[87] M. N. Omidvar, X. Li, Y. Mei, and X. Yao, “Cooperative co-evolution with

differential grouping for large scale optimization,” Evolutionary Computa-

tion, IEEE Transactions on, vol. 18, no. 3, pp. 378–393, 2014.

[88] M. N. Omidvar, X. Li, and K. Tang, “Designing benchmark problems for

large-scale continuous optimization,” Information Sciences, 2015.

[89] Y. Liu, X. Yao, Q. Zhao, and T. Higuchi, “Scaling up fast evolutionary

programming with cooperative coevolution,” in Evolutionary Computation,

2001. Proceedings of the 2001 Congress on, vol. 2. IEEE, 2001, pp. 1101–

1108.

[90] M. N. Omidvar, Y. Mei, and X. Li, “Effective decomposition of large-scale

separable continuous functions for cooperative co-evolutionary algorithms,”

in Proc. of IEEE Congress on Evolutionary Computation, 2014, pp. 1305–

1312.

[91] S. Mahdavi, M. E. Shiri, and S. Rahnamayan, “Cooperative co-evolution

with a new decomposition method for large-scale optimization,” in Proceed-

ings of the IEEE Congress on Evolutionary Computation, CEC 2014, 2014,

pp. 1285–1292.

69

[92] W. Chen, T. Weise, Z. Yang, and K. Tang, “Large-scale global optimization

using cooperative coevolution with variable interaction learning,” in Proc.

of International Conference on Parallel Problem Solving from Nature, ser.

Lecture Notes in Computer Science, vol. 6239. Springer Berlin / Heidel-

berg, 2011, pp. 300–309.

[93] M. N. Omidvar, X. Li, and X. Yao, “Cooperative co-evolution with delta

grouping for large scale non-separable function optimization,” in Proc. of

IEEE Congress on Evolutionary Computation, 2010, pp. 1762–1769.

[94] D. H. Wolpert, “Stacked generalization,” Neural networks, vol. 5, no. 2, pp.

241–259, 1992.

[95] N. Ueda, “Optimal linear combination of neural networks for improving

classification performance,” IEEE Transactions on Pattern Analysis and

Machine Intelligence, vol. 22, no. 2, pp. 207–215, 2000.

[96] L. K. Hansen and P. Salamon, “Neural network ensembles,” IEEE trans-

actions on pattern analysis and machine intelligence, vol. 12, no. 10, pp.

993–1001, 1990.

[97] E. Bauer and R. Kohavi, “An empirical comparison of voting classification

algorithms: Bagging, boosting, and variants,” Machine learning, vol. 36,

no. 1, pp. 105–139, 1999.

[98] H. Drucker, “Improving regressors using boosting techniques,” in ICML,

vol. 97, 1997, pp. 107–115.

[99] G. Wang, J. Hao, J. Ma, and H. Jiang, “A comparative assessment of en-

semble learning for credit scoring,” Expert systems with applications, vol. 38,

no. 1, pp. 223–230, 2011.

[100] L. Breiman, “Stacked regressions,” Machine learning, vol. 24, no. 1, pp.

49–64, 1996.

[101] P. Smyth and D. Wolpert, “Linearly combining density estimators via stack-

ing,” Machine Learning, vol. 36, no. 1, pp. 59–83, 1999.

[102] O. Erdem, M. Olutas, B. Guzelturk, Y. Kelestemur, and H. V. Demir,

“Temperature-dependent emission kinetics of colloidal semiconductor

nanoplatelets strongly modified by stacking,” The journal of physical chem-

istry letters, vol. 7, no. 3, pp. 548–554, 2016.

70

[103] E. Lorenz, “Deterministic non-periodic flows,” Journal of Atmospheric Sci-

ence, vol. 20, pp. 267 – 285, 1963.

[104] H. K. Stephen, In the Wake of Chaos: Unpredictable Order in Dynamical

Systems. University of Chicago Press, 1993.

[105] H. Jiang and W. He, “Grey relational grade in local support vector regres-

sion for financial time series prediction,” Expert Systems with Applications,

vol. 83, pp. 136–145, 2012.

[106] B. Wang, H. Huang, and X. Wang, “A novel text mining approach to finan-

cial time series forecasting,” Neurocomputing, vol. 83, pp. 136–145, 2012.

[107] M. Ardalani-Farsa and S. Zolfaghari, “Chaotic time series prediction with

residual analysis method using hybrid elman–narx neural networks,” Neu-

rocomputing, vol. 73, no. 13, pp. 2540–2553, 2010.

[108] A. Gholipour, B. N. Araabi, and C. Lucas, “Predicting chaotic time series

using neural and neurofuzzy models: A comparative study,” Neural Process.

Lett., vol. 24, pp. 217–239, 2006.

[109] F. Takens, “Detecting strange attractors in turbulence,” in Dynamical Sys-

tems and Turbulence, Warwick 1980, ser. Lecture Notes in Mathematics,

1981, pp. 366–381.

[110] E. N. Lorenz, “Empirical orthogonal functions and statistical weather pre-

diction,” 1956.

[111] L. F. Richardson, Weather prediction by numerical process. Cambridge

University Press, 2007.

[112] (2000) Bureau of meteorology research centre. Accessed: January 21, 2015.

[113] S. Debsarma, “Cyclone and its warning system in bangladesh,” National

Disaster Reduction Day, 2001.

[114] J. C. Sadler, “Tropical cyclones of the eastern north pacific as revealed by

tiros observations,” Journal of Applied Meteorology, vol. 3, pp. 347–366,

1964.

[115] F. R. W., “Upper-level structure of the formative tropical cyclone,” Monthly

Weather Review, vol. 94, pp. 9–18, 1966.

71

[116] S. Fritz, L. Hubert, and A. Timchalk, “Some inferences from satellite pic-

tures of tropical disturbances,” Monthly Weather Review, vol. 94, pp. 231–

236, 1966.

[117] M. DeMaria and J. Kaplan, “An operational evaluation of a statistical hurri-

cane intensity prediction scheme (ships),” American Meteorological Society

- 22d Conf. on Hurricanes and Tropical Meteorology, pp. 280–281, 1997.

[118] M. A. Bender and I. Ginis, “Real-case simulations of hurricane–ocean in-

teraction using a high-resolution coupled model: Effects on hurricane in-

tensity,” Monthly Weather Review, vol. 128, pp. 917–946, 2000.

[119] M. DeMaria and J. Kaplan, “A statistical hurricane intensity prediction

scheme (ships) for the atlantic basin,” Weather Forecasting, vol. 9, pp. 209–

220, 1994.

[120] V. Dvorak, “Tropical cyclone intensity analysis and forecasting from satel-

lite imagery,” Monthly Weather Review, vol. 103, pp. 420–430, 1975.

[121] B. R. Jarvinen and C. J. Neumann, “Statistical forecasts of tropical cy-

clone intensity,” NOAA Techonological Memorandum. NWS NHC-10, p.

22p, 1979.

[122] J. A. Knaff, M. DeMaria, C. R. Sampson, and J. M. Gross, “Statistical,

five-day tropical cyclone intensity forecasts derived from climatology and

persistence,” Weather Forecasting, vol. 18, pp. 80–92, 2003.

[123] M. DeMaria, M. Mainelli, L. Shay, J. Knaff, and J. Kaplan, “Futher im-

provements to the statistical hurricane intensity prediction scheme (ships),”

Weather Forecasting, vol. 20, pp. 531 – 543, 2005.

[124] J. Knaff and C. Sampson, “Southern hemisphere tropical cyclone inten-

sity forecast methods used at the joint typhoon warning center, part

ii: Statistical-dynamical forecasts,” Australian Meteorological and Oceano-

graphic Journal, vol. 58, pp. 9–18, 2009.

[125] G.-F. Lin and M.-C. Wu, “A hybrid neural network model for typhoon-

rainfall forecasting,” Journal of Hydrology, vol. 375, no. 3, pp. 450–458,

2009.

72

[126] Y. Liu, E. Racah, J. Correa, A. Khosrowshahi, D. Lavers, K. Kunkel,

M. Wehner, W. Collins et al., “Application of deep convolutional neural

networks for detecting extreme weather in climate datasets,” arXiv preprint

arXiv:1605.01156, 2016.

[127] S. Chaudhuri, S. Goswami, A. Middey, D. Das, and S. Chowdhury,

“Predictability of landfall location and surge height of tropical cyclones over

north indian ocean (nio),” Natural Hazards, vol. 75, no. 2, pp. 1369–1388,

2015. [Online]. Available: http://dx.doi.org/10.1007/s11069-014-1376-0

[128] S. Chaudhuri, D. Dutta, S. Goswami, and A. Middey, “Track and intensity

forecast of tropical cyclones over the north indian ocean with multilayer

feed forward neural nets,” Meteorological Applications, vol. 22, no. 3, pp.

563–575, 2015.

[129] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra, “Dimensionality

reduction for fast similarity search in large time series databases,”

Knowledge and Information Systems, vol. 3, no. 3, pp. 263–286, 2001.

[Online]. Available: http://dx.doi.org/10.1007/PL00011669

[130] (2015) JTWC tropical cyclone best track data site.

[131] T. Koskela, M. Lehtokangas, J. Saarinen, and K. Kaski, “Time series pre-

diction with multilayer perceptron, FIR and Elman neural networks,” in

In Proceedings of the World Congress on Neural Networks, San Diego, CA,

USA, 1996, pp. 491–496.

[132] D. Mirikitani and N. Nikolaev, “Recursive bayesian recurrent neural net-

works for time-series modeling,” Neural Networks, IEEE Transactions on,

vol. 21, no. 2, pp. 262 –274, Feb. 2010.

[133] “Smart bilo: Computational intelligence framework,” accessed: 02-12-2015.

[Online]. Available: http://smartbilo.aicrg.softwarefoundationfiji.org

[134] M. B. Nevel’son and R. Z. Khas’ minskii, Stochastic approximation and

recursive estimation. American Mathematical Society Providence, 1976,

vol. 47.

73

Documents

NEURAL NETWORK METHODOLOGIES FOR CYCLONE WIND …digilib.library.usp.ac.fj/.../499d1bce.dir/doc.pdfAcknowledgements Foremost,IthankGodforknowledge,strengthandprotection. IwouldliketoexpressmysinceregratitudetomyadvisorDr