38
74 CHAPTER 4 GABOR AND NEURAL BASED FACE RECOGNITION 4.1 GABOR BASED FACE RECOGNITION In the previous chapter, a brief description about the theory and experimental results of PCA and FLDA techniques for Face recognition were presented (Duc et al 1999). In this chapter the theory and experimentation related to Gabor and Neural network based Face Recognition are discussed. Although PCA based eigenfaces method works well for a reconstruction, it is uncertain that it does the best projection for recognition. Pose, scale and illumination variations are the main problems reported for Eigen faces. FLDA algorithm, though suitable for illumination variations, is not suitable for large databases. To overcome some of these problems, wavelets are included in Face Recognition. A wavelet is a waveform of effectively limited duration but has an average value of zero. Wavelet Transform is a process that decomposes a function into components at different scales of frequencies and locations. By decomposing an image using Wavelet Transform, the resolution of the sub images and the computational complexity are reduced.

09_chapter4

Embed Size (px)

DESCRIPTION

none

Citation preview

Page 1: 09_chapter4

74

CHAPTER 4

GABOR AND NEURAL BASED FACE RECOGNITION

4.1 GABOR BASED FACE RECOGNITION

In the previous chapter, a brief description about the theory and

experimental results of PCA and FLDA techniques for Face recognition

were presented (Duc et al 1999). In this chapter the theory and

experimentation related to Gabor and Neural network based Face Recognition

are discussed.

Although PCA based eigenfaces method works well for a

reconstruction, it is uncertain that it does the best projection for recognition.

Pose, scale and illumination variations are the main problems reported for

Eigen faces. FLDA algorithm, though suitable for illumination variations, is

not suitable for large databases. To overcome some of these problems,

wavelets are included in Face Recognition.

A wavelet is a waveform of effectively limited duration but has an

average value of zero. Wavelet Transform is a process that decomposes a

function into components at different scales of frequencies and locations. By

decomposing an image using Wavelet Transform, the resolution of the sub

images and the computational complexity are reduced.

Page 2: 09_chapter4

75

4.1.1 Introduction

Generally, a wavelet can be viewed as a continuous wave

propagating in different directions ( ) and modulated by a Gaussian Envelope

with different frequencies (f). Gabor filters are most popular for automatic

face recognition system as they are motivated by their computational

properties and biological relevance. The Gabor filters represent a powerful

tool in image processing, as spatial localization, spatial frequency and

orientation selectivity are the main properties of it. The main characteristics of

Gabor wavelets are the possibility to provide a multi resolution analysis of the

image in the form of coefficient matrices. Gabor wavelets are applied in the

fiducially points on faces in order to take into account more features for best

recognition (Baochang Zhang et al 2009). An image can be represented by the

Gabor wavelet transform (Ranganath and Arun 1997) allowing the description

of both the spatial frequency structure and spatial relations.

Gabor wavelets seem to be the optimal basis to extract local

features for pattern recognition for several reasons:

1) Biological motivation: the shapes of Gabor wavelets are

similar to the receptive fields of simple cells in the primary

visual cortex.

2) Mathematical motivation: the Gabor wavelets are optimal for

measuring local spatial frequencies.

3) Empirical motivation: Gabor wavelets have been found to

yield distortion tolerant feature spaces for a number of pattern

recognition tasks, including texture segmentation, character

recognition, and finger print recognition.

Page 3: 09_chapter4

76

4.1.2 Theory and Design of Gabor Wavelets

A Gabor Wavelet is a linear filter used in image processing. The

Gabor wavelets are self similar (i.e.), some filters can be generated from one

mother wavelet by dilation and rotation. 2D Gabor (Hossein Sahoolizadeh

et al 2008) functions are similar to enhancing edge contours, as well as

valleys and ridge contours of the image. This corresponds to enhancing

eye, mouth, nose edges which are supposed to be the main important points

on a face. Moreover, such an approach also enhances moles, dimples, scars,

etc. Hence, by using such enhanced points as feature locations, a feature

map for each facial image can be obtained and each face can be

represented with its own characteristics without any initial constraints.

Having feature maps specialized for each face makes it possible to keep

overall face information while enhancing local characteristics (Zhang B et al

2007).

Gabor wavelets are used to extract facial appearance changes as a

set of multistage and multi orientation coefficients. It is shown to be robust

against noise and changes in illumination for all facial patterns. The common

approach when using Gabor filters (Chengjun Liu 2002) for face recognition

is to construct a filter bank with filters of different scales and orientations and

to filter the given face image. A well-designed Gabor filter bank can capture

the relevant frequency spectrum in all directions.

This method is based on selecting peaks (high energized points) of

the Gabor wavelet responses, as feature points. Detected feature points

together with locations are stored as feature vectors. The feature vector

consists of all useful information extracted from different frequencies,

orientations and from all locations and is hence very useful for expression

recognition. Feature vectors are generated by sampling wavelet responses of

the facial images at the specific nodes.

Page 4: 09_chapter4

77

Initially, the image of the face must be converted to wavelets

known as Jets. These jets are localized wavelets and are focused on a specific

area of an image. In this way, individual jets can be created for the eyes, nose,

mouth, and other facial features. The variation of changes the sensitivity to

edge and texture orientations. The variation of will change the scale at

which, the image is viewed. Here, the most adequate combinations of ,

and f are considered to represent particular features of face for recognition

task.

Each of these Gabor filters is convolute with the input image,

resulting in forty filtered copies of the face image. To encompass all the

features produced by the different Gabor kernels, the resulting Gabor wavelet

features are concatenated to derive an augmented Gabor feature vector. Then,

in order to reduce the dimensionality of the feature vector, both the PCA and

FLDA are implemented.

Gabor Wavelet (GW) filter works as a band pass filter for the local

spatial frequency distribution, achieving an optimal resolution in both spatial

and frequency domains. The 2D Gabor filter , ( , )f x y can be represented as

a complex sinusoidal signal modulated by a Gaussian kernel function as

follows:

)2exp(21exp),( 2

2

2

2

, ny

n

x

nf fxyxyx (4.1)

whereyxcon

yx

n

n

n

n

n

n .sincos

sin

x , y are the standard deviations of the Gaussian envelope along the x and y

dimensions, f is the central frequency of the sinusoidal plane wave and n

Page 5: 09_chapter4

78

the orientation. The rotation of the x-y plane by an angle n will result in a

Gabor filter at the orientation n. The angle n is defined by:

1n np

For n=1, 2, … ,p and p N, (4.2)

where p denotes the number of orientations. Design of Gabor filters is

accomplished by tuning the filter with a specific band of spatial frequency and

orientation by appropriately selecting the filter parameters like the spread of

the filter ( x, y), radial frequency f and the orientation of the filter n. The

important issue in the design of Gabor filters for face recognition is the choice

of filter parameters. The Gabor representation of a face image is computed by

convolving the face image with the Gabor filters. Here, f(x, y) be the intensity

at the coordinate (x, y) in a gray scale face image, its convolution with a

Gabor filter ),(, yxf is defined as:

),(),(),( ,, yxyxfyxg ff (4.3)

where denotes the convolution operator.

The response to each Gabor kernel filter representation is a

complex function with a real part R {g f, (x, y)} and an imaginary part

J {g f, (x, y)}. The magnitude response ),(, yxg f is expressed by

),(),(),( ,2

,2

, yxgJyxgRyxg fff (4.4)

This work uses the magnitude response ),(, yxg f to represent the

features. To reduce the influence of the lighting conditions, the output of

Gabor filter about each direction has been normalized. This work

organizes 40 Gabor channels consisting of eight orientation parameters

Page 6: 09_chapter4

79

57,

56,

55,

54,

53,

52,

5,0 and five spatial frequencies.

The Gabor wavelets are scale invariant and the statistics of the image must

remain constant as one magnifies any local region of the image.

Figure 4.1 illustrates the convolution result of a face image with a

Gabor filter. Here, a 2D Gabor filter is expressed as a Gaussian modulated

sinusoid in the spatial domain and as shifted Gaussian in the frequency

domain.

Figure 4.1 Network architecture of Gabor based filter

4.1.3 Gabor Wavelet Representation of Faces

Feature extraction algorithm for this proposed GW based face

recognition has two main steps as Feature point localization and Feature

vector computation (Lee 1996). In this step, feature vectors are extracted

from points with high information content on the face image. In most

feature-based methods, facial features are assumed to be the eyes, nose and

mouth (Yousra Ben Jemaa and Sana Khanfir 2009). The number of feature

vectors and their locations can vary in order to better represent diverse

Page 7: 09_chapter4

80

facial characteristics of different faces, such as pimples, moles, etc.,

which are also the features that people might use for recognizing faces.

In this work, a face image is convolved with Gabor filter of five

spatial frequencies and eight orientations, so that it can capture the whole

frequency spectrum, both amplitude and phase as shown in Figure 4.2.

Figure 4.2 Flowchart of the feature extraction stage of the facial images

From the responses of the face image to Gabor filters, peaks are

found by searching the locations in a window W0 of size (w*w) by the

following procedure:

A feature point is located at (x0, y0), if

00 0 ,, max ,j jx y W

R x y R x y (4.5)

Page 8: 09_chapter4

81

1 2

0 01 11 2

1, ,N N

j jx y

R x y R x yN N

(4.6)

j = 1… 40, where Rj is the response of the face image to the jth Gabor filter.

N1 and N2 are the sizes of the face image and the center of the window W0 is at

(x0, y0). Window size W must be chosen small enough to capture the

important features and large enough to avoid redundancy.

4.1.4 Feature Vector Generation

Feature vectors are generated at the feature points as a

composition of Gabor wavelet transform coefficients. Here kth feature vector

of ith reference face is defined as,

, ,, , ,i k k k i j k kv x y R x y (4.7)

The first two components in equation 4.6 represents the location

of the feature point by storing (x, y) coordinates. After feature vectors are

constructed from the test image, they are compared to the feature

vectors of each reference image in the database.

4.1.5 Algorithm for Gabor Wavelets

STEP 1: Get the image as the parameter by the function im2 vec.

STEP 2: Load Gabor filters.

STEP 3: Adjust the window histogram.

STEP 4: Find features matrix.

STEP 5: Change the matrix to a vector.

Page 9: 09_chapter4

82

Generally, it is difficult to deal with a high dimensional image

space. So this GW method is used to reduce the space dimension by down

sampling each G (u, v) and concatenating its rows to form a 1D feature vector

is proposed and used extensively. This algorithm is tested for the task of

identification using neural network classifier. This is explained as in

Figure 4.3.

Figure 4.3 Flow chart of Gabor based face recognition

To reduce the dimensionality of the vector space and obtain more

useful features for subsequent pattern discrimination and associative recall,

the FLDA technique is used here. The results clearly show that Gabor filters

improve the performance of raw image data with the operating points

corresponding to high false acceptance ratio (FAR).

Load Gabor filters

Input query

Adjust the window histogram

Find image vector

Find matrix features

Call Neural Networkprogram

im2vec (image)

Page 10: 09_chapter4

83

4.1.6 Experimentation and Results of Gabor Wavelets

The input of the function is a 50 * 50 window which is the resized

size of the test image whose actual size is 320 * 243. At first the function

adjusts the histogram of the window. Then to convolve the window with

Gabor filters, the window in frequency domain will be multiplied by the

Gabor filters. Gabor filters are loaded and then the window histogram is

adjusted so that the parameters are set with trial and error. The numbers in

the input vector of the neural network should be between -1 and 1. For this

the feature matrix of size 45 * 48 is formed. Thus the matrix of the image is

converted into an image vector of size 2160 * 1 by reshaping.

The input query image is shown in Figure 4.4 and is resized into matrix

of size 50 * 50.

Figure 4.4 Query image applied to Gabor wavelets

For this image matrix, the image vector which is the output of

Gabor filter is given by

0 .9 5 6 10 .9 0 2 80 .8 5 7 5

G ab o r V ec to r =0 .4 2 5 20 .5 3 7 90 .7 2 7 2

Page 11: 09_chapter4

84

Gabor wavelets technique has recently been used not only for face

recognition, but also for face tracking and face position estimation. Thus this

approach not only reduces computational complexity, but also improves the

performance in the presence of occlusions. For a given input image the Gabor

filters are formed as shown in Figure 4.11(c).

4.1.7 Summary

Gabor wavelet provides the optimized resolution in both time and

frequency domains for time frequency analysis. It saves neighborhood

relationship between pixels, performs better than the traditional approaches in

terms of recognition rate and accuracy. It is also easy to update, invariant to

homogeneous illumination changes, rotational and scale.

Despite the success of Gabor wavelet based face recognition

systems, both the feature extraction process and the huge dimension of Gabor

features extracted demand large computation and memory costs, which makes

them impractical for real applications. Also it is affected by the complex

background. Another limitation in the case of Gabor wavelets is that the time

for Gabor feature extraction is very long and its dimension is prohibitively

large.

4.2 NEURAL NETWORKS (NN)

A Neural Network is a powerful data modeling tool that is able to

capture complex input/output relationships. The neural network technology

stemmed to develop an artificial system. The network is composed of a large

number of highly interconnected processing elements, called neurons,

working in parallel to solve a specific problem.

Page 12: 09_chapter4

85

An artificial neural network is a computing system that consists of a

collection of artificial neurons connected with each other. An artificial neuron

simulates performance of a biological neuron. The essence of this algorithm is

that various patterns are forwarded to the inputs of a simple neuron. The

neural element transforms input signals into the output signal, the latter is

compared with the expected results and if the real output does not coincide

with the expected one, the algorithm is being corrected. The samples are

forwarded to the outputs one by one until the result is satisfactory.

4.2.1 Introduction

A boosting learning process is used to reduce the feature

dimensions and make the Gabor feature extraction process substantially more

efficient (Daugman 1988). Combining optimized Gabor features with Neural

Networks (Rowley et al 1996) reduces computation and memory cost of the

feature extraction process, but also achieves very accurate recognition

performance. Actually, training process in a neural network does not consist

of a single call to a training function. Instead, the network was trained several

times on various noisy images (Hutchinson and Welsh 1989).

In the previous chapters PCA and LDA based face reconstruction

and discrimination are done in an effective way. But the classification of face

and non-face are not carried out and the images are reconstructed for a rose as

shown in Figure 4.5. In this work, neural networks (Agui et al 1992)

effectively classify a face and non-face by BPNN algorithm.

Page 13: 09_chapter4

86

Figure 4.5 Image reconstructions by PCA

4.2.2 Theory of NN for Face Recognition

Neural networks are particularly effective for predicting events

when the networks have a large database of prior stored data base. Neural

networks can be used to extract patterns and detect trends that are too

complex to be noticed by either humans or other computer techniques. Neural

networks exhibit the ability (Hutchinson and Welsh 1989) of adaptive

learning, which is the ability to learn how to do tasks based on the data given

for training or initial experience.

To reduce complexity, neural network (Jahan Zeb et al 2007) is

often applied to the face recognition phase rather than to the feature extraction

phase. The network is initialized with random weights at first, and the data is

then fed into the network. As each data is tested, the result is checked. The

square of the difference between the expected and actual result is calculated,

and this data is used to adjust the weights of each connection accordingly. The

accuracy of neural networks is mostly a function of the size of their training

Page 14: 09_chapter4

87

set rather than their complexity. The procedure for face recognition using

neural network is shown in Figure 4.6.

Figure 4.6 Face recognition using neural networks

The gradient descent with momentum and adaptive learning with

Back Propagation Neural Network (BPNN) learning algorithm has been used

to implement the supervised learning in such a way that both the inputs and

corresponding outputs are provided at the time of training the network. Thus

an inherent clustering and optimized learning of weights provide efficient and

better results.

4.2.3 Back-Propagation Neural Network

Neural Network is a good tool for classification purposes. It can

approximate almost any regularity between its input and output. The delta rule

is often utilized by the most common class of ANNs called back propagation

neural networks.The NN weights are adjusted by supervised training

procedure called back propagation. Back propagation performs a gradient

descent within the solution's vector space towards a global minimum. The

Page 15: 09_chapter4

88

flow chart for BPNN Algorithm to identify whether the given image is face or

non-face is as shown in Figure 4.7.

Figure 4.7 Flow chart for neural network based face recognition

Back propagation is a kind of the gradient descent method, which

searches an acceptable local minimum in the NN weight space in order to

achieve minimal error. In principle, NNs can compute any computable

function, i.e., they can do everything a normal digital computer can do

(Kurita et al 2003). Almost any mapping between vector spaces can be

Load new Neural Network

Train the Neural Network

Find the index for the inputvector

F=0

F=1

If result>0.1The given

sample is nonface

The givensample is a face

Call PCA program

If F=1

No

Yes

Page 16: 09_chapter4

89

approximated to arbitrary precision by feed forward neural networks

(Lin Shang-Hung et al 1997).

4.2.4 BPNN Algorithm

STEP 1: Load the new neural network using the mat lab function.

STEP 2: Call the special function ‘sim’ by sending the new neural

network and the image vector as parameters.

STEP 3: Train the neural network.

STEP 4: And obtain the return variable in ‘result’.

STEP 5: If result is greater than 0.1, then print the given image as

a face.

STEP 6: And make F=1.

STEP 7: Else print the given image as a non-face.

STEP 8: Also make F=0

STEP 9: If F=1 call PCA Program.

The BPNN algorithm involves two phases, during the first phase,

the input vector is presented and propagated forward through the network to

compute the output values ok for each output unit. This output is compared

with its desired value, resulting in an error signal i for each output unit. The

second phase involves a backward pass through the network during which the

error signal is passed to each unit in the network and appropriate weight

changes are calculated.

Learning process in back propagation requires providing pairs of

input and target vectors. The output vector ‘o’ of each input vector is

compared with target vector ‘t’. In case of difference of these two, the weights

are adjusted to minimize the difference. Initially, random weights and

Page 17: 09_chapter4

90

thresholds are assigned to the network. These weights are updated every

iteration in order to minimize the cost function or the mean square error

between the output vector and the target vector. The BPNN algorithm applied

in face recognition is shown in Figure 4.8.

Figure 4.8 Back propagation neural networks algorithm

Input for hidden layer is given by

n

zmzzm wxnet

1 (4.8)

The units of output vector of hidden layer after passing through the

activation function are given by

mm net

hexp1

1 (4.9)

Page 18: 09_chapter4

91

In same manner, input for output layer is given by

kz

m

zzk whnet

1

(4.10)

and the units of output vector of output layer are given by

kk net

oexp1

1 (4.11)

For updating the weights, we need to calculate the error. This can

be done by

k

liii toE 2

21 (4.12)

If the error is minimum than a predefined limit, training process

will stop; otherwise weights need to be updated.

Each hidden unit sums its delta inputs from the above and

multiplied by the derivative of its activation function; it also computes its own

weight correction term and its bias correction term. Each output unit updates

its weights and bias. Each training cycle is called an epoch and the weights

are updated in each cycle. It is not analytically possible to determine where

the global minimum is. Eventually the algorithm stops in a low point, which

may just be a local minimum.

For weights between hidden layer and output layer, the change in

weights is given by

jiij hw (4.13)

Page 19: 09_chapter4

92

where is a learning rate coefficient that is restricted to the range [0.01, 1.0].

The learning coefficient controls the size of a step against the direction of

the gradient. If is too small, learning is slow; if too large, the process of the

error minimization can be oscillatory. Here, hj is the output of neuron j in the

hidden layer and i can be obtained by

i i i i it o o l o (4.14)

oi and ti represents the real output and target output at neuron i in the output

layer respectively. Similarly, the change of the weights between hidden layer

and output layer is given by

jHiij xw (4.15)

where is a training rate coefficient that is restricted to the range [0.01,1.0],

xj is the output of neuron j in the input layer. A hidden unit ‘h’ receives a delta

from each output unit o equal to the delta of that output unit weighted with the

weight of the connection between those units. Hi can be obtained by

ij

k

jjiiHi wxlx

1

(4.16)

xi is the output at neuron i in the input layer, and summation term represents

the weighted sum of all j values corresponding to neurons in output layer .

After calculating the weight change in all layers, the weights can simply

updated by

ij ij ijw new w old w (4.17)

Updating hidden units process is repeated for each instance in the

training set until the error for the entire system is acceptably low, or the pre-

Page 20: 09_chapter4

93

defined number of iterations is reached. Given image is identified as a face or

Non face, according to the value of error ‘E’ as per equation (4.12).

To effectively increase the learning rate is to modify the delta rule

by including a momentum term.

w (N+1) = m w (N) – E (w (N)) (4.18)

where m is a positive constant, 0 m < 0.9, termed the momentum constant

and this is called the generalized delta rule. The effect is that if the basic delta

rule is consistently pushing a weight in the same direction, then it gradually

gathers "momentum" in that direction. If momentum term is included, it will

have the effects of smoothening the weight changes, amplifies the learning

rate causing a faster convergence enabling to escape from small local minima

on the error surface.

The feature representation vectors from PCA and LDA are then

used to train the weighting factors in the combined neural networks. One of

the algorithms developed for non-linear optimization problems, following the

ideas of steepest descent are also called as gradient descent. BPNN algorithm

is made in the space of variables in the direction opposite to the direction of

the gradient of the minimized function.

A large number of neurons in the hidden layer can give high

generalization error due to over fitting and high variance. On the other hand,

by having less neurons, high training error and high generalization error is

obtained due to under fitting and high statistical bias. 'Over fitting' is the

phenomenon that in most cases a network gets worse instead of better after a

certain point during training when it is trained to as low errors as possible.

Page 21: 09_chapter4

94

4.2.5 Experimental Results of Gabor Based BPNN

In this work, BPNN algorithm gives an amazing capacity to

actually learn from input data. Various parameters assumed for this network

are as follows:

No. of Input unit = 1 feature vector

No. of hidden neurons = 70

No. of output unit = 1

Learning rate = 0.4

No. of epochs = 400

Optimum value of goal = 0.01

Momentum = 0.9

The output of the Gabor wavelet is an image vector of size 2160 * 1.

A new neural network is loaded using the Matlab special function ‘load

newnet’. This neural network and the input image vector are sent as

parameters to the function ‘sim’ and the index of this input image vector is

found. If this index is positive, then the given image will be declared as a face.

If the index value is negative or zero, then the given image will be declared as

a non face. Then if it is a face, FLDA program will be called.

Face and non face images are given as in Figures 4.9 and 4.10, the

results are as follows:

TrainDatabasePath =C:\Desktop\proj(2010) 22.4\PCA_pgm\TrainDatabase1

TestDatabasePath =C:\Desktop\proj(2010) 22.4\PCA_pgm\non-face

Page 22: 09_chapter4

95

Figure 4.9 A non-face query image

Result = -0.8227, the given Sample image is a non-face.

TrainDatabasePath =C:\Desktop\Proj(2010)22.4\PCA_pgm\TrainDatabase1

TestDatabasePath = C:\Desktop\proj(2010)22.4\PCA_pgm\TestDatabase1

Figure 4.10 A Face query

Result = 1.2707, The given Sample image is a face and is 33.jpeg

Experiments are carried over on the face images of different images

of Yale database by BPNN and its results are presented as in a, b, c, d and e of

Figure 4.11.

Page 23: 09_chapter4

96

(a)

(b)

(c)

Figure 4.11 (Continued)

Page 24: 09_chapter4

97

(d)

(e)

Figure 4.11 Images of Gabor based neural network

4.2.6 Advantages, Disadvantages and Applications of Neural

Networks

High accuracy, more than 90 % recognition rate, easy to implement

and reduced execution time are the main advantages of Neural Network based

Face Recognition. Neural Networks are more flexible for solving non-linear

tasks. As gradient based method is applied, some inherent problems like slow

convergence and escaping from local minima are encountered here.

In practice, NNs are especially useful for classification and

approximation problems when rules such as those that might be used in an

expert system cannot easily be applied. NNs are, at least today, difficult to

Page 25: 09_chapter4

98

apply successfully to problems that concern manipulation of symbols and

memory.

Comparisons of PCA, FLDA and NN based Face Recognition on

different databases for 400 images are presented as in Table 4.1 and

Figure 4.12.

Table 4.1 Comparison of recognition rate for PCA, FLDA and BPNN

algorithm

Figure 4.12 Comparison of recognition rate for PCA, FLDA and BPNN

algorithm

No of Images

Algorithm

PCA FLDA BPNN

50 89 92 94

100 86 88 90

200 83 86 88

300 80 83 86

400 75 79 82

Page 26: 09_chapter4

99

4.3 CASCADE CORRELATION NEURAL NETWORKS (CCNN)

The Cascade-Correlation learning network algorithm was

developed in an attempt to overcome the problem of time complexity in

the popular back-propagation learning algorithm. CCNN algorithm

(Fahlman and Liebiere 1990) and it not only trains a neural network but also

dynamically builds the network architecture. In this network, the number of

hidden layers is not assigned in advance, but is determined during the process

of learning. It means that the topology of a cascade-correlation neural network

only depends on the task being solved and on the nature of data forwarded to

the network inputs.

Cascade-Correlation is a new architecture and a supervised learning

algorithm for artificial neural networks. Instead of just adjusting the weights

in a network of fixed topology, Cascade-Correlation begins with a minimal

network, then automatically trains and adds new hidden units one by one,

creating a multi-layer structure. Once a new hidden unit has been added to the

network, its input-side weights are frozen. This unit then becomes a

permanent feature-detector in the network, available for producing outputs or

for creating other more complex feature detectors.

The idea behind the cascade-correlation architecture is to build the

architecture by adding new neurons together with their connections to all the

inputs as well as to the previous hidden neurons and to learn the newly

created neuron by fitting its weights so as to minimize the residual error of the

network.

4.3.1 Cascade Correlation Network Architecture

A cascade correlation network (Fe´raud et al 2001) consists of a

cascade architecture, in which hidden neurons are added to the network one at

Page 27: 09_chapter4

100

a time and do not change after they have been added. It is called a cascade

because the output from all neurons already in the network feed into new

neurons. As new neurons are added to the hidden layer, the learning algorithm

attempts to maximize the magnitude of the correlation between the new

neuron’s output and the residual error of the network which is to be

minimized. A cascade correlation neural network has three layers: input,

hidden and output.

Input Layer: A vector of predictor variable values (x1…xp) of the

given image is presented to the input layer. The input neurons perform no

action on the values other than distributing them to the neurons in the hidden

and output layers. In addition to the predictor variables, there is a constant

input of 1.0, called the bias that is fed to each of the hidden and output

neurons. The bias is multiplied by a weight and added to the sum going into

the hidden neuron.

Hidden Layer: Arriving at a neuron in the hidden layer, the value

from each input neuron is multiplied by a weight, and the resulting weighted

values are added together producing a combined value. The weighted sum is

fed into a transfer function, which outputs a value. The outputs from the

hidden layer are distributed to the output layer.

Output Layer: Each output neuron receives values from all of the

input neurons and all the hidden layer neurons, with the bias values. Each

value presented to the output neuron is multiplied by a weight, and the

resulting weighted values are added together producing a combined output

value. The weighted sum is fed into a transfer function, which outputs a final

value for classification. For regression problems, a linear transfer function is

used in the output neurons. But for classification problems, there is a neuron

for each category of the target variable and a sigmoid transfer function is used.

Page 28: 09_chapter4

101

4.3.2 Cascade-Correlation Learning Algorithm

Cascade-Correlation (David DeMers and Cottrell 1993) combines

two key ideas: The first is the cascade architecture, in which hidden units are

added to the network one at a time and do not change after they have been

added. The second is the learning algorithm, which creates and installs the

new hidden units. For each new hidden unit, an attempt is made to maximize

the magnitude of the correlation between the new unit’s output and the

residual error signal. The training steps for CCNN algorithm is as follows

Step1: Initiate a cascade correlation neural network with only

the input and output layer neurons with no hidden layer

neurons. Train the initial net until the mean square error

E reaches a minimum.

Step2: A hidden candidate node is installed. Initialize weights

and learning constants.

Step3: The hidden candidate node is trained. Stop if the

correlation between its output and the network output

error is maximized.

Step4: The hidden candidate unit to the main net is added, i.e.

freeze its weights, connect it to the other hidden units and

connect to the network outputs.

Step5: The main net that includes a hidden unit is trained. Stop

if the minimum mean square error is reached.

Step6: Another hidden unit is added. Repeat steps 2-5, until the

mean square error value is acceptable.

Page 29: 09_chapter4

102

The cascade-correlation learning algorithm exemplifies the

supervised learning. While learning, it constructs the minimal network that is

a network with the minimal possible number of hidden layers. Learning starts

when the network is minimal, i.e. when there is an input layer, an output layer

and no hidden layers. For learning, an algorithm is used that minimizes the

value of the network output error E. Every input is connected to every output

neuron by connection with an adjustable weight, as shown in Figure 4.13.

Figure 4.13 CCNN with No hidden units

The input and the output neurons are linked by a weight value.

Values on a vertical line are added together after being multiplied by their

weights. Every input is connected to every output unit by a connection with

an adjustable weight. There is also a bias input, permanently set to +1. The

output units may just produce a linear sum of their weighted inputs, or they

may employ some non-linear activation function. So each output neuron

receives a weighted sum from all of the input neurons including the bias. The

cascade architecture with one hidden unit is shown in Figure 4.14.

Page 30: 09_chapter4

103

Figure 4.14 CCNN with one hidden unit

The output neuron sends this weighted input sum through its

transfer function to produce the final output. Even a simple cascade

correlation network with no hidden neurons has considerable predictive

power. For a fair number of problems, a cascade correlation network with just

input and output layers provides excellent predictions. After the addition of

the first hidden neuron, the network would have this structure.

The input weights for the hidden neuron are shown as square boxes

to indicate that they are fixed once the neuron has been added. Weights for

the output neurons shown as ‘x’ can be adjustable. Here is a schematic

representation of a network with two hidden neurons. The cascade

architecture with two hidden units is illustrated in Figure 4.15.

Each new hidden unit receives a connection from each of the

network’s original inputs and also from every pre-existing hidden unit. The

hidden unit’s input weights are frozen at the time the unit is added to the net;

only the output connections are trained repeatedly. Each new unit therefore

adds a new one-unit “layer” to the network, unless some of its incoming

weights happen to be zero. This leads to the creation of very powerful high-

Page 31: 09_chapter4

104

order feature detectors; it also may lead to very deep networks and high fan-in

to the hidden units.

Figure 4.15 CCNN with two hidden units

Network learning is considered to be completed when the

convergence of the network is achieved, that is, the value of the error stops to

change or if the value of the error is sufficiently small and does not exceed

earlier set maximal error value. In case the error value does not meet the

above requirements, learning should be continued. For this, a new hidden

layer is added to the network. This node is called a candidate node and its

output is not activated in the main network at this stage.

After a new hidden layer is added, all patterns out of the training

sample are then passed through this node. The candidate node learns, that is

its weights are being revised. The aim of the candidate node weight correction

is to maximize the value of correlation ‘C’ between the output of the

candidate node and network output error.

C y y e ep op opo

( )( ) (4.19)

Page 32: 09_chapter4

105

where y and eo are the mean values of the outputs and output errors over the

all patterns ‘p’ of the training sample.

After learning, the candidate-node is added to the main net. The

weights of this added node are frozen. The output of this node, in its turn, can

either be forwarded to the output of the main net or serve as one of inputs for

the hidden units. One by one added hidden nodes thus make cascade

architecture as shown in Figure 4.16.

Outputun its

H id de n uni t 2

H id de n uni t 1

Inpu tun its

Figure 4.16 Cascade architecture of CCNN

During the process of the cascade-correlation network learning, the

gradient descent and the gradient ascent are used, respectively, to minimize

the value of error E and to maximize the value of correlation ( Fahlman S.E.

and Liebiere C 1990).

For error E, it is computed as

op ippoi

E e Xw

(4.20)

where ( )op op op pe y t f

Page 33: 09_chapter4

106

For correlation C it will look as

p ippi

C Xw

(4.21)

where ( )p o op o po

e e f

If Ew

and Cw

will be denoted as S, weights correction formula

will look as follows:

W (t+1)=w(t)+ w(t) (4.22)

where wt = S(t) if wt-1=0,

wt = wt-1S(t)/(S(t-1)-S(t)) if wt-1< >0 and S(t)/S(t-1)-S(t)) <

wt = w(t-1) in all other cases.

Here is the error correction step, but is the minimal correction step.

Instead of a single candidate unit, it is possible to use a pool of

candidate units, each with a different set of random initial weights. All receive

the same input signals and see the same residual error for each training pattern.

Because they do not interact with one another or affect the active network

using training, all of these candidate units can be trained in parallel; whenever

no further progress is being made, the candidate whose correlation score is the

best is installed.

The use of this pool of candidates is beneficial in two ways: it

greatly reduces the chance that a useless unit will be permanently installed

because an individual candidate got stuck during training and it can speed up

the training because many parts of weight-space can be explored

Page 34: 09_chapter4

107

simultaneously. One final note on the implementation of this algorithm:

While the weights in the output layer are being trained, the other weights in

the active network are frozen. While the candidate weights are being trained,

none of the weights in the active network are changed. In a machine with

plenty of memory, it is possible to record the unit-values and the output errors

for an entire epoch, and then to use these cached values repeatedly during

training, rather than recomputing them for each training case. A reasonably

small net is built automatically. This can result in a tremendous speedup,

especially for large networks.

4.3.3 Advantages and Disadvantages of Cascade Correlation

Algorithm

Cascade-Correlation Network is useful for incremental learning, in

which new information is added to an already-trained net. Once built, a

feature detector is never cannibalized. It is available from that time on for

producing outputs or more complex features. Training on a new set of

examples may alter a network’s output weights, but these are quickly restored

on return to the original problem. At any given time, only one layer of

weights in the network can be trained. The rest of the network is not changing.

In CCNN, there is no need to guess the size, depth, and

connectivity pattern of the network in advance. It may be possible to build

networks with a mixture of nonlinear types. Cascade-Correlation learns fast.

In backpropogation, the hidden units engage in a complex way before they

settle into distinct useful roles: In Cascade-Correlation, each unit sees a fixed

problem and can move decisively to solve that problem. The learning time in

epochs grows very roughly as NlogN, where N is the number of hidden units

ultimately needed to solve the problem. Cascade-Correlation can build deep

nets (high-order feature detectors) without the dramatic slowdown that is seen

in back-propagation networks with more than one or two hidden layers.

Page 35: 09_chapter4

108

Here error signals are not propagated backwards as in BPNN, but a

single residual error signal can be broadcast to all candidates. The weighted

connections transmit signals in only one direction, eliminating one

troublesome difference between backpropogation connections and biological

synapses. The candidate units do not interact with one another, except to pick

a winner.

Cascade-correlation can converge quickly and it is less likely to get

trapped in local minima than multilayer perceptron networks. But cascade

correlation scales up to handle large problems far better than probabilistic or

general regression networks. Training time is very fast in CNN; hence it is

suitable for large training sets. Typically, cascade correlation networks are

fairly small, often having fewer than a dozen neurons in the hidden layer.

Contrast this to probabilistic neural networks which require a hidden-layer

neuron for each training case.

As with all types of models, there are some disadvantages to

cascade correlation networks. They have an extreme potential for over-fitting

the training data. Over-fitting can also take place in the presence of noisy

features. This results in excellent accuracy on the training data but poor

accuracy on new, unseen data. Cascade correlation networks usually are less

accurate than probabilistic and general regression neural networks for small to

medium size problems.

Experimental results and comparison of BPNN and CNN based on

recognition rate and execution time for ORL database are presented in Tables

4.2 and 4.3 and Figures 4.17 and 4.18 respectively.

Page 36: 09_chapter4

109

Table 4.2 Comparison of recognition rate of BPNN and CNN

Figure 4.17 Comparison of recognition rate of BPNN and CNN

No. of images

Recognition rate (%)

BPNN CNN

50 94 95

100 90 92

200 88 89

300 86 88

400 82 84

Page 37: 09_chapter4

110

Table 4.3 Comparison of Execution time of BPNN and CNN

Figure 4.18 Comparison of execution time of BPNN and CNN

No. of images

Execution time (sec)

BPNN+FLDA CNN+FLDA

50 30.09 25.36

100 39.51 33.04

200 45.26 39.41

300 51.02 45.21

400 60.26 53.34

Page 38: 09_chapter4

111

4.3.4 Summary

Neural Networks (NN) have found use in a large number of

computational disciplines. The well known PCA and FLDA algorithms are

applied with BPNN to improve the performance. LDA is a robust algorithm

for the illumination variance. The performances of the LDA with BPNN are

discussed here, with various databases and with diverse environments.

BPNN enhances the classification and the performance of LDA

with BPNN resulted in more than 90 % recognition rate. CNN is better for the

large number of database images, as it has a fast recognition. Mostly, the

execution time of CNN is 20% less, when compared to BPNN.

Neural networks are currently used prominently in voice

recognition systems, image recognition systems, industrial robotics, medical

imaging and data mining and aerospace applications.