36
Chapter-4 CLASSIFICATION USING NEURAL NETWORK

CLASSIFICATION USING NEURAL NETWORK - …shodhganga.inflibnet.ac.in/bitstream/10603/11748/10/10...modeling real world complex relationships. Fourth, neural networks are able to estimate

  • Upload
    dangdat

  • View
    218

  • Download
    2

Embed Size (px)

Citation preview

Chapter-4

CLASSIFICATION USING NEURAL

NETWORK

4.1 IntroductionClassification is one of the most active research and application

area of neural networks. Neural Networks are considered a robust classifier. This chapter summarizes some of the most important developments of neural network in pattern classification and specifically, the Pattern Classification using the polynomial neural network.

The field of Neural Networks has arisen from diverse sources, ranging from the fascination of mankind with understanding and emulating the human brain, to broader issues of copying human abilities such as Classification, it is one of the most frequently encountered decision making tasks of human activity. Classification is an essential feature to separate large datasets into classes for the purpose of Rule generation, Decision Making, Pattern recognition, Dimensionality Reduction, Data Mining etc. The Neural networks have emerged as an important tool for classification.

The recent vast research activities in classification have established that neural networks are a promising alternative to various conventional classification methods. The advantage of neural networks lies in the following theoretical aspects. First, neural networks are data driven self-adaptive methods in that they can adjust themselves to the data without any explicit specification of functional or distributional form for the underlying model. Second, they are universal functional approximators in that neural networks can approximate any function with arbitrary accuracy [42], [77], [78]. Since any classification procedure seeks a functional relationship between the group membership and the attributes of the object, accurate identification of this underlying function is doubtlessly important. Third, neural networks have shown extremely good nonlinear input-output mapping, which makes them flexible in

modeling real world complex relationships. Fourth, neural networks are able to estimate the posterior probability, which provides the basis for establishing classification rule and performing statistical analysis [171]. Finally, neural network is able to work parallel with input variables and consequently handle large sets of data swiftly. The principal strength with the network is its ability to extract the patterns and irregularities as well as detecting multi-dimensional non-linear connections in data. The latter quality is extremely useful for modeling dynamical systems, e.g. the stock market, meteorological department apart from that, neural networks are frequently used for pattern recognition tasks and non-linear regression [19]. Although significant progress has been made in classification related areas of neural networks, a number of issues in applying neural networks still remain to be solved successfully or completely. This chapter provides a quintessence of the most important advances of neural networks in classification and detail description of polynomial neural network. For a good introductory text, see Hertz et al. [73] and Wasserman [220].

4.2 Fundamentals of Biological Neural Network

The term neural network inspired from the functioning of the human brain. It’s adopted simplified models of biological neural network [4]. The biological neural network consists of nerve cells (neurons) as in Figure-4.1. The cell body of the neuron, which includes the neuron’s nucleus, is where most of the neural 'computation' takes place. Neural activity passes from one neuron to another in terms of electrical triggers which travel from one cell to the other down the neuron’s axon, by means of an electrochemical process of voltage-gated ion exchange along the axon and of diffusion of neurotransmitter molecules through the membrane over the synaptic gap. The axon can be viewed as a connection wire. However, the mechanism of signal flow is

not via electrical conduction but via charge exchange that is transported by diffusion of ions. This transportation process moves along the neuron’s cell, down the axon and then through synaptic junctions at the end of the axon via a very narrow synaptic space to the dendrites and/or soma of the next neuron at an average rate of 3 m/sec [66].

INPUT from otluer neurons O tffPBT to o tlu r m

Figure 4.1: Biological Model of Neuron [182]

Xo

NeuronOutput

Figure 4.2 : Schematic representation of a mathematical model of asimple neuron

A network may have several (hundreds of) synapses, a neuron can connect (pass its message/signal) to many (hundreds of) other neurons. Similarly, since there are many dendrites in each neuron, a single neuron can receive messages (neural signals) from many other neurons [63]. It’s important to note that not all interconnections are equally weighted. Some have a higher priority (a higher weight) than others. Also some are excitory and some are inhibitory (serving to block transmission of a message). These differences are affected by differences in chemistry and by the existence of chemical transmitter and modulating substances inside and near the neurons, the axons and in the synaptic junction. This nature of interconnection between neurons and weighting of messages is also fundamental to artificial neural networks (ANNs). Schematic diagram of A neuronal model is shown in Figure-4.2. The Simple analogy of the neural element is the common building block (neuron) of every artificial neural network, which is a network of non-linear elements interconnected through adjustable weights [66]. Pictorial presentation of cell body, dendrite, axon and synaptic junction of the biological neuron are shown in

Figure-4.1

4.3 Network Topologies

The main distinctions between the patterns of connections are Feed-forward networks and Recurrent networks. In the first one, the data flow from input to output units is strictly feed-forward. The data processing can extend over multiple (layers of) units, but no feedback connections are present, that is, connections extending from outputs units to inputs units in the same layer or previous layers. The Recurrent networks may contain feedback connections (Figure-4.4). Contrary to feed-forward networks, the dynamical properties of the network are important. In some cases, the activation values of the units undergo a

relaxation process such that the network will evolve to a stable state in which these activations do not change anymore. In other applications, the change of the activation values of the output neurons is significant; such that the dynamical behavior constitutes the output of the network [4]. The example of feed-forward networks is perceptron and the examples of recurrent networks have been presented by Anderson, Kohonen and Hopfield are discussed in Section-4.5.

4.4 Brief History and Land Marks of Neural Network

In this section we have focused on a few important

breakthroughs throughout history. In 1943 McCulloch and Pitts describe

the brain functions by mathematical means and used their neural networks

to model logical operators. In 1949 Hebb proposed that the synaptic

connections inside the brain are constantly changing as a person gains

experience. In other words, synapses are either strengthened or weakened

depending on whether neurons on either side of the synapse are activated

simultaneously or not. In the late 1950s Rosenblatt introduced the concept

of the perceptron. Basically, the Perceptron, which works as a pattem-

classifier, is a more sophisticated model of the neuron developed by

McCulloch and Pitts. Depending on the amount of neurons incorporated,

the perceptron can solve classification problems with various numbers of

classes which are linearly separable, which is a major setback. This was

shown by Minsky and Papert in 1969. Minsky and Papert also raised the

issue of the credit-assignment problem related to the multi-layer

perceptron. During the next decade the general interest in neural networks

diminish mainly as a direct consequence of the results reported in the late

sixties. Certainly the lack of powerful experimental equipment

(computers, work stations etc.) also had an influence on the decline.

The interest in neural networks is renewed in 1982 when Kohonen introduced the Self Organizing Map (SOM). SOM use an unsupervised learning algorithm for applications in specifically data mining, image processing and visualization. As a basic description one can say that high-dimensional data is transformed and organized in a low­dimensional output space. The same year Hopfield built a bridge between neural computing and physics. A Hopfield network (which consists of symmetric synaptic connections and multiple feedback loops) which is initialized with random weights eventually reaches a final state of sta­bility. From a physicists point of view a Hopfield network corresponds to a dynamical system falling into a state of minimal energy.

In 1984 the Boltzmann machine was invented by Ludwig

Boltzmann. This neural network utilizes a stochastic learning algorithm

based on properties of the Boltzmann distribution.

Rummelhart, Hinton and Williams, who discovered the

backpropagation algorithm in 1986, proved crucial step for the revival of

neural networks. Rummelhart, Hinton and Williams got the credit but it

showed that Werbos already in 1974 had introduced the error

backpropagation in his PhD thesis. This learning algorithm is

unchallenged as the most influential learning algorithm for training of

multi-layer perceptrons.

The Radial-Basis Function (RBF) was brought forward by Broomhead and Lowe in 1988. The RBF netwoik emerged as an alternative to the multi-layer perception in the search of a solution to the multivariate interpolation problem. By using a set of symmetric non-linear functions in the hidden units of a neural network new properties could be explored. Work by Moody and Darken (presented in 1989) on how to

estimate parameters in the basis functions has contributed significantly to the theory.

One of the first approaches of systematic design of nonlinear relationships comes under the name of a Group Method of Data Handling (GMDH) which was proposed in the late 1960s by Ivakhnenko. The GMDH algorithm generates an optimal structure of the model through successive generations of partial descriptions (PDs) of data being regarded as quadratic regression polynomials with two input variables. It’s identifying the nonlinear relations between input and output variables. In late 1990 this method is improved as Polynomial Neural Network. It was proposed to alleviate the problems associated with the GMDH.

4.5 Architecture of Neural Network

It is useful to make a distinction between different neural network architectures based on the way in which the network is trained and classify the patterns. Such a distinction dictates the approach to a problem at a fundamental level. The Neural Network techniques can be divided into supervised, unsupervised and reinforced techniques. In this study we will concentrate on supervised and unsupervised only.

4.5.1 Supervised Neural Network

The neural network is given the target outputs on to which it should map its inputs, i.e. it is given in paired data of input and output. The error arising from the discrepancy between the network output and the target is used to optimize the network parameters. Once the network has been trained, it is used to produce an output for unseen data.

The neurons are arranged in a layer fashion is referred to as a perceptron. Perceptrons are trained in a supervised learning fashion. This means that one tries to train the net to perform specific, known functions

[147]. We have a target test set where the outputs are known, which is used to train the net

The Widrow-Hoff (gradient descent or Delta rale) is the

most widely known supervised learning rule. The general procedure for working with a perceptron layer is to first initialize the network as

described with random weights and biases. Usually the random numbers

are kept small, and symmetrical about zero. Then an input vector is applied to the net, which generates an output. Since the net has just been

initialized, the output is generally incorrect,” that is to say, not equal to

the training target vector. The learning rule is then applied to the layer. A

simple learning rule which is widely used is called the Widrow-HofF rule

[226]:

A = d { t ) - y ( t )

Wi{t + \) = Wi{t) + v^Xi{t)

+ 1, i f input is classA 0, i f input is classB

Where 0 < 77 < 1, positive gain function. When the output is 0

then Class will be A; when the output is 1 then Class will be B [19].

This rule specifies a simple method of updating each weight. It tries to minimize the error between the target output d(t) and the

experimentally obtained output y(t) for each neuron by calculating the

error and calling it a “delta” (A). Each weight (W) is then adjusted by adding to delta multiplied by some attenuation constant 77 (Learning

Rate). This process is then iterated until the net error falls below some

threshold.

By adding its specific error to each of the weights, we are ensured that the network is being moved towards the position of minimum

error and by using attenuation constant, rather than the foil value of the error, we move it slowly towards this position of minimnm error. When correctly trained, the perceptron exhibits some highly promising behavior. The descriptions of Different models based on the perceptron are given below.

(a) Single Layer Perceptron

The perceptron’s learning theorem was formulated by Rosenblatt in 1961. The theorem states that a perceptron can learn (solve) anything it can represent (simulate). A single layer perceptron as shown in Figure-4.3 is able to learn to classify objects according to their position in n-dimensional hyperspace defined by the n inputs (Not interconnected) where the problem can be reduced to a linear separable (classification) problem [108].

When perceptrons were first introduced, they seemed revolutionary. It was a mathematical model of a structure which could be taught to classify points in hyperspace, not according to rules, but by being shown which points belonged in which sets. This freed humans from the necessity of determining rules by which the points should be classified. It was thought that they could be trained to solve any problem which could be set up as a classification problem in hyperspace.

In 1969, Minsky and Papert published a book where they pointed out as did E. B. Crane in 1965 in a less-known book, to the grave limitations in the capabilities of the perceptron, as is evident by its representation theorem. They have shown that, for example,

the perceptron cannot solve even a 2-state Exctusive-Or (XGR) problem [ (*, U*2) ] or its complement, the 2-state

contradiction problem (XNOR).It can be understand from the ball analogy. Just like a ball, when released on an uneven surfacecan come to rest in a local depression and not find a deep, but distant, whole, the net can be trained according to the Widrow-Hoff algorithm and still not find the position of global minima. It can “get stuck” in a local minimum [72].

Input Layer Output Layer

Figure 4.3 : Single layer Neural Network with n inputs and p outputs

(b) Multilayer Perceptron

With the discovery that the perceptron was unable to deal

with linearly inseparable problems, work on neural nets were ceased.

Multilayered Perceptrons (Artificial Neural Networks) introduced in

1986 [181], which was capable to train the network to separate

linearly inseparable data. It consists of large number of units

(neurons) joined together in a pattern of connections (Figure-4.5).

Units in a neural net are usually segregated into three classes: input

units, which receive information to be processed; output units, where

the results of the processing are found; and units in between known

as hidden units. There are various models defined based on

Multilayer Perceptron, some of them are defined here.

(c) Feed Forward Neural Network (FFNN)

FFNNs are a kind of multilayer neural network which

allows signals to travel one way only, from input to output. First, the

network is trained on a set of paired data to determine input-output

mapping. The weights of the connections between neurons are then

fixed and the network is used to determine the classifications of a

new set of data. During classification the signal at the input units

propagates all the way through the net to determine the activation

values at all the output units. Each input unit has an activation value

that represents some feature external to the net. Then, every input

unit sends its activation value to each of the hidden units to which it

is connected. Each of these hidden units calculates its own activation

value and this signal are then passed on to output units. The

activation value for each receiving unit is calculated according to a

simple activation function. The function sums together the

contributions of all sending units, where the contribution of a unit is

defined as the weight of the connection between the sending and

receiving units multiplied by the sending unit's activation value. This

sum is usually then further modified, for example, by adjusting the

activation sum to a value between 0 and 1 and/or by setting the

activation value to zero unless a threshold level for that sum is

reached.

Figure 4.5 : Multilayer Perceptron (Feed-forward Artificial Neural Net)

Generally, properly determining the size of the hidden

layer is a problem, because an underestimate of the number of

neurons can lead to poor approximation and generalization

capabilities, while excessive nodes can result in over-fitting and

eventually make the search for the global optimum more difficult

[30]. Kon et al. [102] also studied the minimum amount of neurons

and the number of instances necessary to program a given task into

feed-forward neural network.

Feed-forward neural networks are usually trained by the

original back propagation algorithm or by some variant. Their

greatest problem is that they are too slow for most of the

applications. One of the approaches to speed up the training rate is to

estimate optimal initial weights [229]. Another method for training

multilayered feedforward ANNs is Weight-elimination algorithm

that automatically derives the appropriate topology and therefore

avoids also the problems with overfitting [222], Genetic algorithms

have been used to train the weights of neural networks [197] and to

find the architecture of neural networks [269]. There are also Bayesian methods in existence which attempt to train neural networks [217]. A number of other techniques have emerged recently which attempt to improve Artificial Neural Nets training algorithms by changing the architecture of the networks as training proceeds. These techniques include pruning useless nodes or weights [32], and constructive algorithms, where extra nodes are added as required [157].

(d) Back Propagation (BP)

BP algorithm is used for training artificial neural networks [229]. Training is usually carried out by iterative updating of weights based on the error signal. The negative gradient of a mean-squared error function is commonly used. In the output layer, the error signal is the difference between the desired and actual output values, multiplied by the slope of a sigmoidal activation function. Then the error signal is back-propagated to the lower layers. BP is a descent algorithm, which attempts to minimize the error at the each iteration. The weights of the network are adjusted by the algorithm such that the error is decreased along a decent direction. Traditionally, two parameters, called learning rate and momentum factor, are used for controlling the weight adjustment along the descent direction and for

dampening oscillations.

The BP algorithm is used for many applications. However, its convergence rate is relatively slow, especially for networks with more than one hidden layer. The reason for this is the saturation behavior of the activation function used for the hidden and output layers. Since the output of a unit exists in the saturation area, the

corresponding descent gradient takes a very small value, even if the output error is large, leading to very little progress in the weight adjustment. The selection of the learning rate and momentum factor is arbitrary, because the error surface usually consists of many flat and steep regions and behaves differently from application to application. Large values of the learning rate and momentum factor are helpful to accelerate learning. However, this increases the possibility of the weight search jumping over steep regions and moving out of the desired regions.

In training by back-propagation algorithm, the target

values of the neuron’s output are set to 1 or 0 but there are no target

values for the hidden neuron’s output. Thus the training is not

efficient. To overcome from this problem several approaches are

developed to train the network layer by layer using [57], [117],

[178], [212], [213], [218], [219], each layer is trained layer by layer

using, as the objective function for each layer, the objective function

for the discriminant analysis that maximizes the between class scatter

while keeping the within class scatter constant. In Wang and Chan

[219] the weights between output and hidden layers and the output of

the previous layers are determined by minimizing the sum of squared

errors. Then the calculated outputs are used as a desired output of the

hidden neurons. This method can be used both for pattern

classification and function approximation [2].

INPUT LAYER

Figure 4.4 : Three-layer MLP Neural Network (Recurrent Network)

(e) RBF NN

RBF network is a three-layer feedback network, in which each hidden unit implements a radial activation function and each output un it implements a weighted sum of hidden units output. Its training procedure is usually divided into two stages. First, the centers and widths of the hidden layer are determined by clustering algorithms. Second, the weights connecting the hidden layer with the output layer are determined by Singular Value Decomposition or Least Mean Squared algorithms. The problem of selecting the appropriate number of basis functions remains a critical issue for RBF networks. The number of basis functions controls the complexity and the generalization ability of RBF networks. RBF networks with too few basis functions cannot fit the training data adequately due to limited flexibility. To sum up, RBF ANNs have been applied to many real world problems but still, their most

striking disadvantage is their lack of ability to reason about their output in a way that can be effectively communicated. For this reason many researchers have tried to address the issue of improving

the comprehensibility of neural networks, where the most attractive solution is to extract symbolic rules from trained neural networks. Setiono and Leow [119] divided the activation values of relevant hidden units into two subintervals and then found the set of relevant connections of those relevant units to construct rules. They are trying to make the network architecture flexible and optimized. Radial Basis Function (RBF) networks have been also widely applied in many sciences and engineering fields [173], [236].

(e) Group Method Data Handling (GMDH)

Prof. A. G. Ivakhnenko in the late 1960s developed Group

Method Data Handling (GMDH) as a means for identifying

nonlinear relations between input and output variables. As described

in [227] the GMDH generates successive layers with complex links

that are individual terms of a polynomial equation. GMDH offers

several advantages over conventional Feedback Neural Networks.

Since it allows a second-order method of convergence to its memory

locations, it approaches equilibrium rapidly. The memories of this

network can be located anywhere in an n-dimensional space rather

than being confined to the comers of a hypercube, as is the case with

Hopfield network and other networks which uses sigmoidal or

similar nonlinearities. The GMDH algorithms better epitomize the

biological complex neurons which are self organized. Only by this

self-organizing method can optimal non-physical models be found

for small, inaccurate or noisy data samples. Non-physical models

usually have a higher accuracy and a simpler model structure than

physical models. In a neuronet with such neurons, we will have a

twofold multilayered structure: neurons themselves are multilayered

and they will be united in a multilayered way into common matrix

[172]. GMDH algorithms are examples of complex active neurons,

because they choose the effective inputs and corresponding

coeficients by themselves during the process of self-organization.

The GMDH is improved as Polynomial Neural Network (PNN)

[146], [149] for classification purposes.

4.5.2 Unsupervised Neural Network

This is used when we have impaired data and we want to find

groupings within the data heuristically. Therefore there is no input output

function to map, as there are no targets. Unsupervised learning produces

groupings of the data based solely on the correlations in the data, and not

on associations with external parameters (the targets in the supervised

case). The Hebbian rule is the most widely known unsupervised learning

rule; it is based on work by the Canadian neuro-psychologist Donald

Hebb, who theorized that neuronal learning (i.e., synaptic change), which

is a local phenomenon which can be express in terms of the temporal

correlation between the activation values of neurons. Specifically, the

synaptic change depends on both pre-synaptic and post-synaptic activities

and states that the change in a synaptic weight is a function of the

temporal correlation between the pre-synaptic and postsynaptic activities.

The value of the synaptic weight between two neurons increases whenever

they are in the same state and decreases when they are in different states.

Unsupervised classification methods could be useful in spotting natural

___________________________________________________________ Chapter 4 : Classification Using Neural Network

groupings within a dataset when we have little knowledge of what the

groupings could be. Examples of unsupervised networks include

Kohonens Self-Organizing Map[100], the Bayesian Auto Class system of

Cheese man [270] and the Adaptive Resonance Theory (ART) [271]. For

example an unsupervised neural Network, the Kohanen’s Self Organizing Map is given below.

Kohonen Neural Network

The Self-Organizing Map [99] is a very popular artificial

neural network (ANN) algorithm based on unsupervised learning. The

SOM is used in various data mining, Pattern Recognition etc tasks. It

provides several very beneficial properties, like vector quantization and

projection. The network consists of output units, typically arranged in a

two-dimensional plane, with weights between each unit and input units.

When an input vector is fed to the network, only one output unit, which

has the best-matching weight i.e. the weight vector is the closest to the

input vector, is selected as a 'winner1. After learning, units in the network

have modified weights such that neighboring units have similar weight

vectors. Hence similar inputs are linked with winner units that are located

close to each other, while winner units for different inputs are located far

away in the network. Thus the feature map is created and inputs to the

network are automatically classified on the map. The advantage of the

feature map is that the weight of each output unit directly shows a

corresponding input vector itself. The main property of the Kohonen

network is the unsupervised learning. That permits to divide the input

vectors set in clusters without prior knowledge about their similarities.

_______________________________________________ ___________________ Chapter 4 : Classification Using Neural Network

4.6 PNN for Pattern Classification

PNN is a flexible neural architecture whose topology is not predetermined or fixed like a conventional ANN but grown through learning layer by layer. The design is based on GMDH which was invented by Prof. A. G. Ivakhnenko in the late 1960s [80], [81], [146], [149]. He developed GMDH as a means for identifying nonlinear relations between input and output variables. As described in [148], the GMDH generates successive layers with complex links that are individual terms of a polynomial equation. The individual terms generated in the layers are partial descriptions (PDs) of data being the quadratic regression polynomials as shown in Figure-3.1 and 4.7 of a basic PNN model and building blocks with all inputs. The first layer is created by computing regressions of the input variables and choosing the best ones for survival. The second layer is created by computing regressions of the values in the previous layer along with the input variables and retaining the best candidates. More layers are built until the network stops getting better based on termination criteria. The selection criterion used in this study penalizes the model that become too complex to prevent overtraining.

In a feed-forward neural network (FNN) [156] to achieve high

classification accuracy, one has to provide in advance, a well defined

structure of FNN, such as, the number of input nodes, hidden and output

neurons, and assume a proper set of relevant features. To alleviate this

drawback of ANN; PNN can be used for classification purposes. Using

this approach during learning, the PNN model generates the new

population of neurons and the number of layers and the complexity of the

network increases [138], [139] until a predefined criterion is met. Such

models can be comprehensively described by a set of short-term

polynomials thereby developing a PNN classifier. Coefficients of PNN

can be estimated by least square fitting. The network architecture grows

depending on the number of input features, PNN model selected, number

of layer required, and the number of PD’s preserved in each layer.

The GMDH belongs to a kind of inductive self-organization

data driven approach. It requires small data samples which are able to

optimize the structure of the models objectively and this relationship

between input-output variables can be well approximated by Volterra

functional series, the discrete form of which is Kolmogorov-Gabor

polynomial [81].

Let us assume that the input-output of the data is given in the

following form:

( X i i ) = ( * , 1 5 X 12 ’ X i3 ’ * 1 4 ’ X iS ’ X i 6 ...........X im ’ -V i ) 5

where i = 1, 2, 3, ...,n; n is the number of samples and m is the

number of features. In matrix form it is represented as follows:

**il 1̂2 ..... X\m * y 1■̂21 '̂ '22 ..... X2m ' y2

Xnl Xn2 ..... Xnm ' ̂ n

The input-output relationship of the above data by PNN model

can be described in the following manner: y =f(xl,x2,xJ,..... ,xm) .

The estimated output o f variables can be approximated by

Volterra functional series, the discrete form of which is Kolmogorov-

Gabor polynomial (Madala and Ivakhnenko, 1994), i.e.

y - Co +'LCk x ki + ' £ c kikx i x k2 + Y,Ckxkikx k x k x k̂ + ..... (Eq. 4.1)k xk

Where denotes the coefficients or weights of the

Kolmogorov—Gabor polynomial and x vector is the input variable. Further

a new GMDH algorithm has also been developed by Ivakhnenko [81],

[142] which is a form of Kolmogorov—Gabor polynomial. He proved that a second order polynomial, i.e.

y = a 0 + alx i + a2Xj + a3x,xj + a 4x,2 + a5x ] (Eq. 4.2)

This takes only two input variables at a time and can

reconstruct the complete Kolmogorov-Gabor polynomial through an

iterative procedure. The GMDH-type polynomial neural networks are

multi-layered models consisting of the neurons/active units/partial

descriptions (PDs), whose transfer function is a short-term polynomial

described in Eq. (4.2). At the first layer L = 1, an algorithm, using all

possible combinations of two inputs out of m variables, generates the first

population of PDs. Hence, the total number of PDs in first layer is n = m

(m-l)/2 and the output of each PD in layer L = 1 is computed by applying

the Eq. (4.2). The coefficient vector of the PDs is determined by the least

square estimation approach. The basic architecture of PD with two inputs

shown in Figure-4.6.

Let consider the equations for the first PD of layerl, which

receives input from feature 1 and 2.

d l = y l — ( C n + Cn X n + ^ 1 3 - ^ 1 2 + £14*1 1 * 1 2 + C 15X U + C 1 6 * 1 2 )

d2 = J 2 — (Cll + C 12X 21 + C 13X 22 + C 1 4 * 2 1 * 2 2 + C 15*21 + C 1 6 * 2 2 )

Where d, is the input from i* feature. q „c„,c13,....... are thecoefficients

Figure 4.6 : Basic architecture of PD

Select best performed PDs/stopping criterion

Figure 4.7 : The building blocks of the PNN model.

Let us consider the general equation for the first PD in differentlayers i.e.

where 1 < j < k, k = — ——, ancNi, 1 < i < n and the equation for the least

square fit is

U = d? +dl + .........+ d 2n

After obtaining the values of all the coefficients using least square estimation based on the training dataset we can estimate the targetas:

$ 1 = (C,l + C j l X ip + C p X iq + C j 4 X ipX iq + c j 5 x ?P + CJ6X l )

If the error level is not up to our desired value, we construct next layer of PNN by taking the output of the previous layer

and apply the same procedure.

dt = y t — (Cyi + C j 2 Z ip + C j 3 Z iq C j A Z ipZ iq + C j 5 Z ip + C j 6 Z i q )

This process is repeated till error decreases.

4.6.1 Algorithm of PNN

The algorithm of PNN is described as the following sequence of

steps:

1. Determine the system’s input variables and if needed carry out the

normalization of input data.

____________________________ ____________ __ ________________Chapter 4 : Classification Using Neural Network

2. Partition the given samples into training and testing samples. The input-output dataset is divided into two parts: training and test part. The training part is denoted as TR and the test part is denoted as TS. Let the total number of samples be n. Then, obviously, we can write n = TR + TS. Using the training part we construct the PNN model (including an estimation of coefficients of the PDs of every layer of PNN) and test data are used to evaluate the estimated PNN.

3. Select a structure of the PNN. The structure of the PNN is selected based on the number of input variables and the order of PDs in each layer. The PNN structures can be categorised into two types, namely, a basic PNN and a modified PNN. In the case of basic PNN the number of the input variables of PDs is the same in every layer, whereas in modified PNN the number of input variables of PDs varies from layer to layer.

4. Determine new input variable and the order of the polynomial form ing a partial a partial description (PD) of the data. If there is m

ml/input variables then total number of PDs will be / r'(m~r) nodes,

where r is the number of chosen input variables.

5. Estimate the coefficient of the PD. The vector of coefficients

a = (aQ, ax, a2, a3, a4, a5) derived by m inim izing the mean squared

error between y t and^,,

| TR

Where

yji= Oq +alXp +a1Xq +a4 ^ +aSXpXq ^-P ’(l - r>1>j = ̂ 2 .... ; 2

_______________________.__________________________ Chapter 4 : Classification Using Neural Network

In order to find out the coefficients, we need to minimise the error criterion E. Differentiating E with respect to all the coefficients, we get the set of linear equations. In matrix form we can write as

Y = X.A

Equivalently

X t .Y = X t.X.A =>A = ( X tX ) X t .Y

This procedure is implemented repeatedly for all nodes of the layer and also for all layers of PNN starting from the input layer and moving towards the output layer. Further the following simple algorithms can find out the index of the input features for each PD.

1. Let the number of layers be 1.

2. Let k = 1,

3. FOR i = 1 t o m - 1

4. FOR j = i + 1 to m

5. Then PDk receives input from the features

6. p = i ;andq=j ;

7. k = k + 1;

8. END FOR

9. END FOR

Select PDs with best predictive capability. Each PD is

estimated and evaluated using both the training and testing data sets.

6. Select PDs with the best classification accuracy: The PDs which give the best predictive performance will be chosen for the output variable. Normally a pre specified cutoff value of the performance for all PDs is specified. If the performance of the PD is above the pre specified cutoff value then only they will be remain in the next generation.

7. Check the Stopping Criterion.

7.1 The following stopping condition indicates that an optimal PNN model has been accomplished at the previous layer, and the modeling can be terminated. This condition reads as Ec > Ep, where Ec is a minimal identification error of the current layer and Ep denotes a minimal identification error that occurred at the previous layer

7.2 The PNN algorithm terminates when the number of iterations (predetermined by the designer) is reached. When setting up a stopping (termination) criterion, one should be prudent in achieving a balance between model accuracy and an overall computational complexity associated with the development of

the model.

8. Determine new input variables for the next layer.

If any of the above two criteria fails, then the model will be

expand to the next layer.

4.7 Neural Network Techniques in Pattern Classification -

A Review

Data classification [54], [119], [140] is a core issue in data mining, pattern recognition and forecasting. The goal of classification is to

assign a new object to a class from a given set of predefined classes based on the attribute values of the object. Furthermore, classification is based on some discovered model, which forms an important piece of knowledge about the application domain. There has been wide range of methods for classification task. One of the popular and widely used techniques is artificial neural network (ANN) [156]. The main idea behind an Artificial Neural Network is to use several simple computational units, connected by weighted links through which activation values are transmitted. The units normally have a very simple way to calculate new activation values given the values received through the connections, for example summ ing

their inputs and feeding it through a monotonous transfer function. In a classification task, the pattern which is to classify is typically fed into the network as activation of a set of input units. This activation is then spread through the network via the connections, finally resulting in activation of the output units, which is then interpreted as the classification result. Training of the network consists in showing the patterns of the training set to the network, and letting it adjust its connection weights to obtain the

correct output.

There are a large number of different neural network architectures. One of the most popular neural network architectures used for classification is the Multi-Layer Perceptron. The units are organized into different layers, and the network is said to be feed-forward because the activation values propagate in one direction only, from the units in the input layer, through a number of hidden layers, to end up in the output layer. The multi-layer perceptron is usually trained with the Error Back-

Propagation method [183].

Worth mentioning here is the Single Layer Perceptron which

is a simple perceptron [176] which preceded the multi-layer perceptron.

The original perceptron actually had what could be considered as a hidden

layer of randomly selected “higher order units”. The single layer

perceptron is perhaps more similar in structure to the Adaline [225]. The

single layer of weights between input and output units is trained, just as in

the multi-layer case, with a gradient descent method, which adjusts the

weights a small step in the direction which will make the classification of

the current pattern more correct. The reason to mention this network is

mainly because of its limitations [135] pointed out what could and could

not be done with this simple type of architecture. Since each output unit

can only perform a vector multiplication of the input vector with its

weight vector and feed this through a monotonous transfer function, the

decision boundary between any pair of output units will always be a linear

hyper plane in the input space. This means that it can only solve

classification tasks where the classes are linearly separable. The duo [135]

gave several examples of interesting tasks which have more complex

decision boundaries, and thus can not be solved with this simple

architecture regardless of what method is used to set the weights. These

limitations can be overcome for instance by using one or more hidden

layers in the network.

The idea behind the error back-propagation neural networks

is very different from the other classification methods. Rather than trying

to calculate a probability or similarity or truth value directly, they use

more of an error correcting strategy: start with random weights and adjust

them to make the results better. Still, under certain conditions the output

activities of a multi-layer perceptron, trained with error back-propagation,

can be shown to approach the conditional probabilities of the

corresponding classes [171]. This relates these neural networks to the

statistical methods. The above networks all use supervised training

methods, where the correct class label has to be given when updating the

weights. There is another kind of training called reinforcement learning

[17], in which only a global signal indicating if the answer was wrong or

right is given. This is sometimes useful when e. g. learning to play a

game, and it is only possible to know if a whole sequence of moves was

good or bad (if it lead to a win or a loss), and not exactly what should

have been done in each move [133]. To continue with some different

architecture, there is also a large group of radial basis function (RBF)

neural networks [29], [272], Whereas the units in the hidden layer of the

multi-layer perceptron each responds for inputs on one side of a hyper

plane in the input space, a unit in an RBF network responds in a radially

symmetric region of the input space. In one version there are equally

many hidden units as training samples [164], each with the center of their

radially symmetric function (typically a Gaussian function) in one of the

training samples. Although not usually considered as RBF networks,

related groups are the competitive neural networks e.g. self-organizing

maps [100], [172] and learning vector quantization [101]. Here the

units correspond to prototype patterns; codebook vectors, and responds in

relation to how close a stimulated pattern is. They are usually of the

winner-take-all type; i.e. only the most active unit wins and suppresses all

other units. This is similar to the principles used in generalized nearest

neighbor methods.

The competitive neural networks are usually trained by moving

the codebook vectors closer to the patterns they respond to, using some

scheme [100], [120], [172], Interesting is also the Competitive Selective

Learning [33] method, which includes a way to remove codebook vectors

from regions where they are too dense and add them in regions where they are too sparse.

There is another type of neural network, not primarily

associated with classification. This is the class of recurrent neural

networks, i.e. with feedback connections used to feed the outputs of units

in one layer back into the units of the same or a previous layer. Rather

than sending the pattern through the network from the input units to the

output units, the signals cycle around in the network until the activity

stabilizes. One example of this is the Hopfield Network [76]. An

important concept for Hopfield networks is the energy function, a scalar

function from the activity state of the network. During the recall phase of

the Hopfield network the activity pattern strives to attain as low energy as

possible, causing it to find local minima in the energy landscape,

corresponding to stable patterns of activity. It is possible to prove that the

network will always arrive at a stable state if the weight matrix is

symmetric, since every activity change in the network will decrease the

energy by a certain amount, and there is a minimum possible energy. The

problem of getting stuck in a local minimum, when searching for the

global m inim um , can be a severe obstacle for hill climbing methods in

general. In a neural network context it will typically come in either during

training of the weights with some gradient descent method like error back-

propagation, or during relaxation of the activities in a recurrent neural

network. One way to solve this is to use simulated annealing [97] which

is a method to add a stochastic component to the hill climbing, which

introduces a small probability of locally going in the “wrong direction”.

The amount of randomness introduced is regulated by the temperature. A

high temperature means a high probability of escaping from a local

minimum. If the temperature is initially set to a high value, and then

decreased slowly enough, the probability that the procedure will end up in

the global minimum can be made arbitrarily close to one. Unfortunately,

dependent on the task at hand, “slowly enough” may mean that it will take

exceedingly long time. A recurrent neural network which can be used for

classification tasks is the boltzmann machine [3]. It uses the concepts of

energy function and simulated annealing to represent the probability

distribution over the domain. This is done by representing the probability

distribution in the energy function, and using a dynamics which makes the

probability of a pattern of activity proportional to the represented

probability of that pattern. This type of network will eventually learn the

correct probability distribution over the domain, but both training and

recall may require prohibitively long time.

The bayesian neural network [105], [113], [114] is a network

model in which the activities of units represent probabilities. The idea is

to make the activities of the output units equal to the probabilities of the

corresponding classes given the attributes represented by the stimulated

input units. In its single-layer form it is related to the naive Bayesian

classifier; in that the key assumption is, different input attributes are

independent. The multi-layer version has a hidden layer which

compensates for dependencies between the input attributes. The kinds of

problems handled by this neural network are the same as those which can

be handled by the Bayesian belief networks or the dependency tree

method by Chow and Liu [37].

Out of all other neural network the feed forward neural

network (FNN) is one of the popular and widely used techniques [156].

Although such FNNs can learn to classify wide range of problem domain

well, the classification model cannot be comprehensible due to large

number of synaptic connections. To achieve high classification accuracy

in FNN framework, one has to provide a well defined structure of FNNs,*

such as, the number of input nodes, hidden and output neurons, and

assume a proper set of relevant features. Trial and error methodology is

used to arrive at such kind of structures, which is computationally

expensive. Similarly other methods like rule extraction and decision tree,

which provide comprehensible rule, are based on the trade-off between

the complexity and the classification accuracy of the model. Recently, a

lot of attention has been directed to advanced techniques of system

modeling. The focus of the existing methodologies and detailed

algorithms is confronted with nonlinear systems, high dimensionality of

the problems, a quest for high accuracy and generalization capabilities of

the ensuing models. Nonlinear models can address some of these issues

but they request a large amount of data. The global nonlinear behavior of

the model may also cause undesired effects (the well-known is a

phenomenon of data approximation by high order polynomials where such

approximation leads to unexpected ripples in the overall nonlinear

relationship of the model). When the complexity of the system to be

modeled increases both experimental data and some prior domain

knowledge (conveyed by the model developer) are of importance to

_____________________________ _____ ________________________Chapter 4 : Classification Using Neural Network

complete an efficient design procedure. It is also worth stressing that the

nonlinear form of the model acts as a two-edge sword: while we gain

flexibility to cope with experimental data, we are provided with an

abundance of nonlinear dependencies that need to be exploited in a systematic manner.

As compare to FNN which requires the optimized structure and

improved by trial and error methodology the group method of data

handling (GMDH) approach which is one of the first approach along

systematic design of nonlinear relationships. It was developed to

identifying nonlinear relations between input and output variables. The

GMDH [58], [81], [144], [150] algorithm generates an optimal structure

of the model through successive generations of partial descriptions (PDs)

of data being regarded as quadratic regression polynomials with two input

variables. While providing with a systematic design procedure, GMDH

has some drawbacks. First, it tends to generate quite complex polynomial

for relatively simple systems (data). Second, owing to its limited generic

structure (quadratic two variable polynomial), GMDH also tends to

produce an overly complex network (model) when it comes to highly

nonlinear systems. Third, if there are less than three input variables,

GMDH algorithm does not generate a highly versatile structure.

To alleviate the problems associated with the GMDH,

polynomial neural networks (PNN) is introduced which is useful for

classification purposes. In a nutshell, these networks come with a high

level of flexibility as each node (processing element forming a PD) can

have a different number of input variables as well as exploit a different

order of the polynomial (linear, quadratic, cubic, etc.). In comparison to

well-known neural networks whose topologies are commonly prior to all

detailed (parametric) learning, the PNN architecture is not fixed in

advance but becomes fully optimized (both structurally and

parametrically). Especially, the number of layers of the PNN architecture

can be modified with new layers added, if required. Polynomial networks

have been known in the literature for many years [31], [122], [158]. The

polynomials have powerful approximation properties [158] and excellent

properties as classifiers [31]. Because of the Weierstrass theorem,

polynomial classifiers are universal approximators to the optimal Bayes

classifier [96], [50]. The Weierstrass theorem [180] states that any

continuous function can be approximated to an arbitrary accuracy by

polynomials. The polynomial classifier is also known as higher-order

neural network [180] or functional link net [156]. With binomial

expansion on linear subspace features, the polynomial classifier has

shown superior performance to multilayer neural networks [74], [110],

[122], [196]. PNN generates layers of neurons/simulated units/partial

descriptions (PDs) and then trains and selects those neurons, which

provide the best classification. During learning the PNN model grows the

new population of neurons and the number of layers so that the

complexity of the network increases [136], [139] while a predefined

criterion is met. Due to the complex architecture it requires huge memory

and computation time. To overcome from this problem evolutionary

strategies suggested by [137], [146] where they take only one layer of

PNN model and selected an optimal set from the PD’s generated in the

first layer along with the input features using PSO technique. This optimal

set of features is fed to a single Perceptron like model of ANN. The

weights are also optimized by PSO. We have proposed a unique scheme

using GA with PNN (see Chapter-7). In this scheme GA is been used for

feature selection (see Chapter-5) and the PNN is involved to find out the

fitness of GA. This proposed scheme outperformed than other methods

such as PNN, PNN with Gradient Descent, PNN with PSO etc.