Computational Intelligence: for Engineering and Manufacturing

Computational Intelligence

ComputationalIntelligence

for Engineering and Manufacturing

Edited by

Diego AndinaTechnical University of Madrid (UPM), Spain

Duc Truong PhamManufacturing Engineering Center, Cardiff University, Cardiff

A C.I.P. Catalogue record for this book is available from the Library of Congress.

ISBN-10 0-387-37450-7 (HB)ISBN-13 978-0-387-37450-5 (HB)ISBN-10 0-387-37452-3 (e-book)ISBN-13 978-0-387-37452-9 (e-book)

Published by Springer,P.O. Box 17, 3300 AA Dordrecht, The Netherlands.

www.springer.com

Printed on acid-free paper

All Rights Reserved© 2007 SpringerNo part of this work may be reproduced, stored in a retrieval system, or transmittedin any form or by any means, electronic, mechanical, photocopying, microfilming, recordingor otherwise, without written permission from the Publisher, with the exceptionof any material supplied specifically for the purpose of being enteredand executed on a computer system, for exclusive use by the purchaser of the work.

This book is dedicated to thememory of

Roberto Carranza E., whoinduced the authors the

enthusiasm to jointly preparethis book.

CONTENTS

Contributing Authors ix

Preface xi

Acknowledgements xiii

1. Soft Computing and its Applications in Engineering and Manufacture 1D. T. Pham, P. T. N. Pham, M. S. Packianather, A. A. Afify

2. Neural Networks Historical Review 39D. Andina, A. Vega-Corona, J. I. Seijas,J. Torres-García

3. Artificial Neural Networks 67D. T. Pham, M. S. Packianather, A. A. Afify

4. Application of Neural Networks 93D. Andina, A. Vega-Corona, J. I. Seijas,M. J. Alarcón

5. Radial Basis Function Networks and their Applicationin Communication Systems 109Ascensión Gallardo Antolín, Juan Pascual García,José Luis Sancho Gómez

6. Biological Clues for Up-to-Date Artificial Neurons 131Javier Ropero Peláez, Jose Roberto Castillo Piqueira

7. Support Vector Machines 147Jaime Gómez Sáenz de Tejada, Juan Seijas Martínez-Echevarría

8. Fractals as Pre-Processing Tool for Computational IntelligenceApplication 193Ana M. Tarquis, Valeriano Méndez, Juan B. Grau, José M. Antón,Diego Andina

vii

CONTRIBUTING AUTHORS

D. Andina, J. I. Seijas, J. Torres-García, M. J. Alarcón, A. Tarquis, J. B. Grauand J. M. Antón work for Technical University of Madrid (UPM), Spain, wherethey form the Group for Automation and Soft Computing (GASC).

D. T. Pham, P. T. N. Pham, M. S. Packianather and A. A. Afify work for CardiffUniversity .

Javier Ropero Peláez, José Roberto Castillo Piqueira work for Escola Politecnicada Universidade de Sao Paulo Departamento de Engenharia de Telecomunicaçoese Controle, Brazil.

A. Gallardo Antolín, J. Pascual García and J. L. Sancho Gómez work forUniversity Carlos III of Madrid, Spain,

A. Vega-Corona, V. Méndez and J. Gómez Sáenz de Tejada work for Universityof Guanajuato, Mexico, Technical University of Madrid and Universidad Autónomaof Madrid, Spain, respectively.

ix

PREFACE

This book presents a selected collection of contributions on a focused treatmentof important elements of Computational Intelligence. Unlike traditional comput-ing, Computational Intelligence (CI) is tolerant of imprecise information, partialtruth and uncertainty. The principle components of CI that currently have frequentapplication in Engineering and Manufacturing are: Neural Networks (NN), fuzzylogic (FL) and Support Vector Machines (SVM). In CI, NN and SVM are concernedwith learning, while FL with imprecision and reasoning.

This volume mainly covers a key element of Computational Intelligence∗

learning. All the contributions in this volume have a direct relevance to neuralnetwork learning∗ from neural computing fundamentals to advanced networks suchas Multilayer Perceptrons (MLP), Radial Basis Function Networks (RBF), and theirrelations with fuzzy set and support vector machines theory. The book also dis-cusses different applications in Engineering and Manufacturing. These are amongapplications where CI have excellent potentials for use.

Both novice and expert readers should find this book a useful reference in the fieldof Computational Intelligence. The editors and the authors hope to have contributedto the field by paving the way for learning paradigms to solve real-world problems

D. Andina

xi

ACKNOWLEDGEMENTS

This document has been produced with the financial assistance of the EuropeanCommunity, ALFA project II-0026-FA. The views expressed herein are those ofthe Authors and can therefore in no way be taken to reflect the official opinion ofthe European Community.

The editors wish to thank Dr A. Afify of Cardiff University and Mr A. Jevtic ofthe Technical University of Madrid for their support and helpful comments duringthe revision of this text.

The editors also wish to thank Nagib Callaos, President of the InternationalInstitute of Informatics and Systemics, IIIS, for his permission and freedom toreproduce in Chapters 2 and 4 of this book contents from the book by D.Andinaand F.Ballesteros (Eds), “Recent Advances in Neural Networks” Ed. IIIS press,ILL, USA (2000).

xiii

CHAPTER 1

SOFT COMPUTING AND ITS APPLICATIONSIN ENGINEERING AND MANUFACTURE

D. T. PHAM, P. T. N. PHAM, M. S. PACKIANATHER, A. A. AFIFYManufacturing Engineering Centre, Cardiff University, Cardiff CF24 3AA, United Kingdom

INTRODUCTION

Soft computing is a recent term for a computing paradigm that has been in existencefor almost fifty years. This chapter reviews five soft computing tools. They are:knowledge-based systems, fuzzy logic, inductive learning, neural networks andgenetic algorithms. All of these tools have found many practical applications.Examples of applications in engineering and manufacture will be given in thechapter.

1. KNOWLEDGE-BASED SYSTEMS

Knowledge-based systems, or expert systems, are computer programs embodyingknowledge about a narrow domain for solving problems related to that domain.An expert system usually comprises two main elements, a knowledge base and aninference mechanism. The knowledge base contains domain knowledge which maybe expressed as any combination of “IF-THEN” rules, factual statements (or asser-tions), frames, objects, procedures and cases. The inference mechanism is that partof an expert system which manipulates the stored knowledge to produce solutionsto problems. Knowledge manipulation methods include the use of inheritance andconstraints (in a frame-based or object-oriented expert system), the retrieval andadaptation of case examples (in a case-based expert system) and the application ofinference rules such as modus ponens (If A Then B; A Therefore B) and modustollens (If A Then B; NOT B Therefore NOT A) according to “forward chaining” or“backward chaining” control procedures and “depth-first” or “breadth-first” searchstrategies (in a rule-based expert system). With forward chaining or data-driveninferencing, the system tries to match available facts with the IF portion of the

1

D. Andina and D.T. Pham (eds.), Computational Intelligence, 1–38.© 2007 Springer.

2 CHAPTER 1

IF-THEN rules in the knowledge base. When matching rules are found, one of themis “fired”, i.e. its THEN part is made true, generating new facts and data which inturn causes other rules to “fire”. Reasoning stops when no more new rules can fire.In backward chaining or goal-driven inferencing, a goal to be proved is specified.If the goal cannot be immediately satisfied by existing facts in the knowledge base,the system will examine the IF-THEN rules for rules with the goal in their THENportion. Next, the system will determine whether there are facts that can cause anyof those rules to fire. If such facts are not available they are set up as subgoals.The process continues recursively until either all the required facts are found andthe goal is proved or any one of the subgoals cannot be satisfied, in which casethe original goal is disproved. Both control procedures are illustrated in Figure 1.Figure 1a shows how, given the assertion that a lathe is a machine tool and a set ofrules concerning machine tools, a forward-chaining system will generate additionalassertions such as “a lathe is power driven” and “a lathe has a tool holder”. Figure 1bdetails the backward-chaining sequence producing the answer to the query “does alathe require a power source?”.

In the forward chaining example of Figure 1a, both rules R2 and R3 simulta-neously qualify for firing when inferencing starts as both their IF parts match thepresented fact F1. Conflict resolution has to be performed by the expert system todecide which rule should fire. The conflict resolution method adopted in this exam-ple is “first come, first served”: R2 fires as it is the first qualifying rule encountered.Other conflict resolution methods include “priority”, “specificity” and “recency”.

The search strategies can also be illustrated using the forward chaining example ofFigure 1a. Suppose that, in addition to F1, the knowledge base also initially containsthe assertion “a CNC turning centre is a machine tool”. Depth-first search involvesfiring rules R2 and R3 with X instantiated to “lathe” (as shown in Figure 1a) beforefiring them again with X instantiated to “CNC turning centre”. Breadth-first searchwill activate rule R2 with X instantiated to “lathe” and again with X instantiated to“CNC turning centre”, followed by rule R3 and the same sequence of instantiations.Breadth-first search finds the shortest line of inferencing between a start positionand a solution if it exists. When guided by heuristics to select the correct searchpath, depth-first search might produce a solution more quickly, although the searchmight not terminate if the search space is infinite [Jackson, 1999].

For more information on the technology of expert systems, see [Pham and Pham,1988; Durkin, 1994; Giarratano and Riley, 1998; Darlington, 1999; Jackson, 1999;Badiru and Cheung, 2002; Nurminen et al., 2003].

Most expert systems are nowadays developed using programs known as “shells”.These are essentially ready-made expert systems complete with inferencing andknowledge storage facilities but without the domain knowledge. Some sophisticatedexpert systems are constructed with the help of “development environments”. Thelatter are more flexible than shells in that they also provide means for users toimplement their own inferencing and knowledge representation methods. Moredetails on expert systems shells and development environments can be found in[Price, 1990].

SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE 3

KNOWLEDGE BASE (Initial State)

Fact : F1 - A lathe is a machine tool

Fact : F1 - A lathe is a machine tool F2 - A lathe has a tool holder

Rules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven

F1 & R2 match

KNOWLEDGE BASE (Intermediate State)


F1 & R3 match

F3 & R1 match

KNOWLEDGE BASE (Intermediate State)

Fact : F1 - A lathe is a machine tool F2 - A lathe has a tool holder F3 - A lathe is power driven


KNOWLEDGE BASE (Final State)

Fact : F1 - A lathe is a machine tool F2 - A lathe has a tool holder F3 - A lathe is power driven F4 - A lathe requires a power sourceRules : R1 - If X is power driven Then X requires a power source R2 - If X is a machine tool Then X has a tool holder R3 - If X is a machine tool Then X is power driven

Figure 1a. An example of forward chaining

4 CHAPTER 1

G1 - A lathe requires a power source

G1 - A lathe requires a power sourceGoal :

F1 & R3

F2 & R1

G2 & R3

G1 & R1

KNOWLEDGE BASE(Intermediate State)

Fact : Fact :

Fact :

Goal : Satisfied?

G2 - A lathe is a power driven ?

KNOWLEDGE BASE(Initial State)

Rules :

Rules :

GOAL STACK

GOAL STACK

GOAL STACK

Goal : SatisfiedG1 - A lathe requires a power source ?

Goal : Satisfied

Goal :

Yes

GOAL STACK

? ? Yes

Goal :Satisfied Satisfied

G2 - A lathe is a power driven ? G3 - A lathe is a machine tool G3 - A lathe is a machine tool ?

KNOWLEDGE BASE KNOWLEDGE BASE

KNOWLEDGE BASE

(Final State) KNOWLEDGE BASE

? G1 - A lathe requires a power source G2 - A lathe is a power driven

GOAL STACK

G1 - A lathe requires a power source G2 - A lathe is a power driven

? Yes

G1 - A lathe requires a power source

(Intermediate State) (Intermediate State)

GOAL STACK

(Intermediate State)

Satisfied

F1 -A lathe is a machine tool F1 -A lathe is a machine tool



F2 -A lathe is power driven

F2 -A lathe is power driven F3 -A lathe requires a power source R1 - If X is power driven Then X requires a power

source

R1 - If X is power driven Then X requires a powersource





R2 - If X is a machine tool Then X has a tool holderR3 - If X is a machine tool Then X is power driven






Rules :

Fact : Fact :

Fact :

Rules :

Rules : Rules :

Figure 1b. An example of backward chaining


Among the five tools considered in this chapter, expert systems are probablythe most mature, with many commercial shells and development tools availableto facilitate their construction. Consequently, once the domain knowledge to beincorporated in an expert system has been extracted, the process of building thesystem is relatively simple. The ease with which expert systems can be developedhas led to a large number of applications of the tool. In engineering, applicationscan be found for a variety of tasks including selection of materials, machine ele-ments, tools, equipment and processes, signal interpreting, condition monitoring,fault diagnosis, machine and process control, machine design, process planning,production scheduling and system configuring. Some recent examples of specifictasks undertaken by expert systems are:• identifying and planning inspection schedules for critical components of an

offshore structure [Peers et al., 1994];• automating the evaluation of manufacturability in CAD systems [Venkatachalam,

1994];• choosing an optimal robot for a particular task [Kamrani et al., 1995];• monitoring the technical and organisational problems of vehicle maintenance in

coal mining [Streichfuss and Burgwinkel, 1995];• configuring paper feeding mechanisms [Koo and Han, 1996];• training technical personnel in the design and evaluation of energy cogeneration

plants [Lara Rosano et al., 1996];• storing, retrieving and adapting planar linkage designs [Bose et al., 1997];• designing additive formulae for engine oil products [Shi et al., 1997];• carrying out automatic remeshing during a finite-elements analysis of forging

deformation [Yano et al., 1997];• designing of products and their assembly processes [Zha et al., 1998];• modelling and control of combustion processes [Kalogirou, 2003];• optimising the transient performances in the adaptive control of a planar robot

[De La Sen et al., 2004].

2. FUZZY LOGIC

A disadvantage of ordinary rule-based expert systems is that they cannot handlenew situations not covered explicitly in their knowledge bases (that is, situationsnot fitting exactly those described in the “IF” parts of the rules). These rule-basedsystems are completely unable to produce conclusions when such situations areencountered. They are therefore regarded as shallow systems which fail in a “brittle”manner, rather than exhibit a gradual reduction in performance when faced withincreasingly unfamiliar problems, as human experts would.

The use of fuzzy logic [Zadeh, 1965] which reflects the qualitative and inexactnature of human reasoning can enable expert systems to be more resilient. Withfuzzy logic, the precise value of a variable is replaced by a linguistic description,the meaning of which is represented by a fuzzy set, and inferencing is carried

6 CHAPTER 1

out based on this representation. Fuzzy set theory may be considered an extensionof classical set theory. While classical set theory is about “crisp” sets with sharpboundaries, fuzzy set theory is concerned with “fuzzy” sets whose boundaries are“grey”.

In classical set theory, an element ui can either belong or not belong to a set A∼, i.e.

the degree to which element u belongs to set A∼is either 1 or 0. However, in fuzzy

set theory, the degree of belonging of an element u to a fuzzy set A∼is a real number

between 0 and 1. This is denoted by �A∼�ui�, the grade of membership of ui in A∼

. Fuzzy

set A∼is a fuzzy set in U, the “universe of discourse” or “universe” which includes all

objects to be discussed. �A∼�ui� is 1 when ui is definitely a member of A∼

and �A∼�ui� is

0 when ui is definitely not a member of A∼. For instance, a fuzzy set defining the term

“normal room temperature” might be:-

(1)

normal room temperature ≡ 0�0/below10� C +0�3/10� C–16� C

+0�8/16� C–18� C +1�0/18� C–22� C +0�8/22� C–24� C

+0�3/24� C–30� C +0�0/above 30� C

The values 0.0, 0.3, 0.8 and 1.0 are the grades of membership to the givenfuzzy set of temperature ranges below 10� C (above 30� C), between 10� C and16� C�24� C–30� C�, between 16� C and 18� C�22� C–24� C� and between 18� C and22� C. Figure 2(a) shows a plot of the grades of membership for “normal roomtemperature”. For comparison, Figure 2(b) depicts the grades of membership for acrisp set defining room temperatures in the normal range.

Knowledge in an expert system employing fuzzy logic can be expressed asqualitative statements (or fuzzy rules) such as “If the room temperature is normal,then set the heat input to normal”, where “normal room temperature” and “normalheat input” are both fuzzy sets.

A fuzzy rule relating two fuzzy sets A∼and B∼

is effectively the Cartesian product

A∼×B∼

which can be represented by a relation matrix �R∼�. Element Rij of �R∼

� is the

membership to A∼×B∼

of pair �ui� vj�� ui ∈ A∼and vj ∈ B∼

. Rij is given by:

(2) Rij = min��A∼�ui��B∼

�vj��

For example, with “normal room temperature” defined as before and “normalheat input” described by:

(3) normal heat input ≡ 0�2/1 kW +0�9/2 kW +0�2/3 kW


0.5

1

1

µ

(a)

µ

(b)

10 20 30 40

10 20 30 40 Temperature ( ˚C )

Temperature ( ˚C )

Figure 2. (a) Fuzzy set of “normal temperature” (b) Crisp set of “normal temperature”

�R∼� can be computed as:

(4) �R∼� =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣

0�0 0�0 0�00�2 0�3 0�20�2 0�8 0�20�2 0�9 0�20�2 0�8 0�20�2 0�3 0�20�0 0�0 0�0

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦

A reasoning procedure known as the compositional rule of inference, whichis the equivalent of the modus-ponens rule in rule-based expert systems, enablesconclusions to be drawn by generalisation (extrapolation or interpolation) from thequalitative information stored in the knowledge base. For instance, when the room

8 CHAPTER 1

temperature is detected to be “slightly below normal”, a temperature-controllingfuzzy expert system might deduce that the heat input should be set to “slightlyabove normal”. Note that this conclusion might not be contained in any of thefuzzy rules stored in the system. A well-known compositional rule of inference isthe max-min rule. Let �R∼

� represent the fuzzy rule “If A∼Then B∼

” and a∼ ≡∑

ii/ui

a fuzzy assertion. A∼and a

∼ are fuzzy sets in the same universe of discourse. The

max-min rule enables a fuzzy conclusion b∼≡∑

jj/vj to be inferred from a

∼ and �R∼�

as follows:

b∼= a

∼o�R∼�(5)

�j = maxi

�min�i� Rij��(6)

For example, given the fuzzy rule “If the room temperature is normal, then set theheat input to normal” where “normal room temperature” and “normal heat input”are as defined previously, and a fuzzy temperature measurement of

(7)

temperature ≡ 0�0/below10� C +0�4/10� C–16� C +0�8/16� C–18� C

+0�8/18� C–22� C +0�2/22� C–24� C +0�0/24� C–30� C

+0�0/above30� C

the heat input will be deduced as:

heat input = temperature o�R∼�

= 0�2/1 kW +0�8/2 kW +0�2/3 kW(8)

For further information on fuzzy logic, see [Kaufmann, 1975; Klir and Yuan,1995; 1996; Ross, 1995; Zimmermann, 1996; Dubois and Prade, 1998].

Fuzzy logic potentially has many applications in engineering where the domainknowledge is usually imprecise. Notable successes have been achieved in the areaof process and machine control although other sectors have also benefited from thistool. Recent examples of engineering applications include:• controlling the height of the arc in a welding process [Bigand et al., 1994];• controlling the rolling motion of an aircraft [Ferreiro Garcia, 1994];• controlling a multi-fingered robot hand [Bas and Erkmen, 1995];• analysing the chemical composition of minerals [Da Rocha Fernandes and Cid

Bastos, 1996];• monitoring of tool-breakage in end-milling operations [Chen and Black, 1997];• modelling of the set-up and bend sequencing process for sheet metal bending

[Ong et al., 1997];• determining the optimal formation of manufacturing cells [Szwarc et al., 1997;

Zülal and Arikan, 2000];


• classifying discharge pulses in electrical discharge machining [Tarng et al.,1997];

• modelling an electrical drive system [Costa Branco and Dente, 1998];• improving the performance of hard disk drive final assembly [Zhao and De

Souza, 1998; 2001];• analysing chatter occurring during a machine tool cutting process [Kong et al.,

1999];• addressing the relationships between customer needs and design requirements

[Sohen and Choi, 2001; Vanegas and Labib, 2001; Karsak, 2004];• assessing and selecting advanced manufacturing systems [Karsak and Kuz-

gunkaya, 2002; Bozdag et al., 2003; Beskese et al., 2004; Kulak and Kahraman,2004];

• evaluating cutting force uncertainty in turning [Wang et al., 2002];• reducing defects in automotive coating operations [Lou and Huang, 2003].

3. INDUCTIVE LEARNING

The acquisition of domain knowledge to build into the knowledge base of an expertsystem is generally a major task. In some cases, it has proved a bottleneck inthe construction of an expert system. Automatic knowledge acquisition techniqueshave been developed to address this problem. Inductive learning is an automatictechnique for knowledge acquisition. The inductive approach produces a structuredrepresentation of knowledge as the outcome of learning. Induction involves gener-alising a set of examples to yield a selected representation which can be in termsof a set of rules, concepts or logical inferences or a decision tree.

An inductive learning program usually requires as input a set of examples. Eachexample is characterised by the values of a number of attributes and the classto which it belongs. In one approach to inductive learning, through a process of“dividing-and-conquering” where attributes are chosen according to some strategy(for example, to maximise the information gain) to divide the original example setinto subsets, the inductive learning program builds a decision tree that correctlyclassifies the given example set. The tree represents the knowledge generalisedfrom the specific examples in the set. This can subsequently be used to handlesituations not explicitly covered by the example set.

In another approach known as the “covering approach”, the inductive learningprogram attempts to find groups of attributes uniquely shared by examples in givenclasses and forms rules with the IF part as conjunctions of those attributes and theTHEN part as the classes. The program removes correctly classified examples fromconsideration and stops when rules have been formed to classify all examples inthe given set.

A new approach to inductive learning, “inductive logic programming”, is acombination of induction and logic programming. Unlike conventional inductivelearning which uses propositional logic to describe examples and represent newconcepts, inductive logic programming (ILP) employs the more powerful predicate

10 CHAPTER 1

logic to represent training examples and background knowledge and to express newconcepts. Predicate logic permits the use of different forms of training examplesand background knowledge. It enables the results of the induction process, that isthe induced concepts, to be described as general first-order clauses with variablesand not just as zero-order propositional clauses made up of attribute-value pairs.There are two main types of ILP systems, the first, based on the top-down general-isation/specialisation method, and the second, on the principle of inverse resolution[Muggleton, 1992; Lavrac, 1994].

A number of inductive learning programs have been developed. Some of thewell known programs are CART [Breiman et al., 1998], ID3 and its descen-dants C4.5 and C5.0 [Quinlan, 1983; 1986; 1993; ISL, 1998; RuleQuest, 2000]which are divide-and-conquer programs, the AQ family of programs [Michalski,1969; 1990; Michalski et al., 1986; Cervone et al., 2001; Michalski and Kaufman,2001] which follow the covering approach, the FOIL program [Quinlan, 1990;Quinlan and Cameron-Jones, 1995] which is an ILP system adopting the gener-alisation/specialisation method and the GOLEM program [Muggleton and Feng,1990] which is an ILP system based on inverse resolution. Although most pro-grams only generate crisp decision rules, algorithms have also been developed toproduce fuzzy rules [Wang and Mendel, 1992; Janikow, 1998; Hang and Chen,2000; Baldwin and Martin, 2001; Wang et al., 2001; Baldwin and Karale, 2003;Wang et al., 2003].

Figure 3 shows the main steps in RULES–3 Plus, an induction algorithm in thecovering category [Pham and Dimov, 1997] and belonging to the RULES family ofrule extraction systems [Pham and Aksoy, 1994; 1995a; 1995b; Pham et al., 2000;Pham et al., 2003; Pham and Afify; 2005a]. The simple problem of detecting thestate of a metal cutting tool is used to explain the operation of RULES-3 Plus.Three sensors are employed to monitor the cutting process and, according to thesignals obtained from them (1 or 0 for sensors 1 and 3; −1, 0, or 1 for sensor 2),the tool is inferred as being “normal” or “worn”. Thus, this problem involves threeattributes which are the states of sensors 1, 2 and 3 and the signals that they emitconstitute the values of those attributes. The example set for the problem is givenin Table 1.

Table 1. Training set for the Cutting Tool problem

Example Sensor_1 Sensor_2 Sensor_3 Tool State

1 0 −1 0 Normal2 1 0 0 Normal3 1 −1 1 Worn4 1 0 1 Normal5 0 0 1 Normal6 1 1 1 Worn7 1 −1 0 Normal8 0 −1 1 Worn


Step 1. Take an unclassified example and form array SETAV. Step 2. Initialise arrays PRSET and T_PRSET (PRSET and T_PRSET will consist of mPRSET expressions with null conditions and zero H measures) and set nco = 0.

Step 3. IF nco < na THEN nco = nco + 1 and set m = 0; ELSE the example itself is taken as a rule and STOP.

Step 4. DO m = m + 1; Specialise expression m in PRSET by appending to it a condition from SETAV that differs from the conditions already included in the expression;

Compute the H measure for the expression; IF its H measure is higher than the H measure of any expression in T_PRSET THEN replace the expression having the lowest H measure with the newly formed expression; ELSE discard the new expression; WHILE m < mPRSET .

Step 5. IF there are consistent expressions in T_PRSET THEN choose as a rule the expression that has the highest H measure and discard the others;

ELSE copy T_PRSET into PRSET; initialise T_PRSET and go to step 3.

Figure 3. Rule forming procedure of RULES-3 Plus

Notes: nco – number of conditions; na-number of attributes;

mPRSET – number of expressions stored in PRSET (mPRSET is user-provided);

T_PRSET - a temporary array of partial rules of the same dimension as PRSET

In step 1, example 1 is used to form the attribute-value array SETAV which willcontain the following attribute-value pairs: [Sensor_1 = 0�� Sensor_2 = −1� and�Sensor_3 = 0�.

In step 2, the partial rule set PRSET and T_PRSET, the temporary version ofPRSET used for storing partial rules in the process of rule construction, are ini-tialised. This creates for each of these sets three expressions having null conditionsand zero H measures. The H measure for an expression is defined as:

(9) H =√

Ec

E

[2−2

√Ec

i

Ec

Ei

E−2

√(1− Ec

i

Ec

)(1− Ei

E

)]

where Ec is the number of examples covered by the expression (the total numberof examples correctly classified and misclassified by a given rule), E is the totalnumber of examples, Ec

i is the number of examples covered by the expression andbelonging to the target class i (the number of examples correctly classified by agiven rule), and Ei is the number of examples in the training set belonging to the

12 CHAPTER 1

target class i. In Equation (9), the first term

(10) G =√

Ec

E

relates to the generality of the rule and the second term

(11) A = 2−2

√Ec

i

Ec

Ei

E−2

√(1− Ec

i

Ec

)(1− Ei

E

)

indicates its accuracy.In steps 3 and 4, by specialising PRSET using the conditions stored in SETAV,

the following expressions are formed and stored in T_PRSET:

1 � �Sensor_3 = 0� ⇒ �Alarm = OFF� H = 0�2565

2 � �Sensor_2 = −1� ⇒ �Alarm = OFF� H = 0�0113

3 � �Sensor_1 = 0� ⇒ �Alarm = OFF� H = 0�0012

In step 5, a rule is produced as the first expression in T_PRSET applies to onlyone class:

Rule1 � IF �Sensor_3 = 0� THEN �Alarm = OFF�� H = 0�2565

Rule 1 can classify examples 2 and 7 in addition to example 1. Therefore, theseexamples are marked as classified and the induction proceeds.

In the second iteration, example 3 is considered. T_PRSET, formed in step 4after specialising the initial PRSET, now consists of the following expressions:

1 � �Sensor_3 = 1� ⇒ �Alarm = ON� H = 0�0406

2 � �Sensor_2 = −1� ⇒ �Alarm = ON� H = 0�0079

3 � �Sensor_1 = 1� ⇒ �Alarm = ON� H = 0�0005

As none of the expressions cover only one class, T_PRSET is copied into PRSET(step 5) and the new PRSET has to be specialised further by appending the existingexpressions with conditions from SETAV. Therefore the procedure returns to step3 for a new pass. The new T_PRSET formed at the end of step 4 contains thefollowing three expressions:

1 � �Sensor_2 = −1��Sensor_3 = 1� ⇒ �Alarm = ON� H = 0�3876

2 � �Sensor_1 = 1��Sensor_3 = 1� ⇒ �Alarm = ON� H = 0�0534

3 � �Sensor_1 = 1��Sensor_2 = −1� ⇒ �Alarm = ON� H = 0�0008


As the first expression applies to only one class, the following rule is obtained:

Rule 2 � IF �Sensor_2 = −1� AND �Sensor_3 = 1� THEN

�Alarm = ON�� H = 0�3876�

Rule 2 can classify examples 3 and 8, which again are marked as classified.In the third iteration, example 4 is used to obtained the next rule:

Rule 3 � IF �Sensor_2 = 0� THEN �Alarm = OFF�� H = 0�2565�

This rule can classify examples 4 and 5 and so they are also marked as classified.In iteration 4, the last unclassified example 6 is employed for rule extraction,

yielding:

Rule 4 � IF �Sensor_2 = 1� THEN �Alarm = ON�� H = 0�2741�

There are no remaining unclassified examples in the example set and the proce-dure terminates at this point.

Due to its requirement for a set of examples in a rigid format (with knownattributes and of known classes), inductive learning has found rather limited appli-cations in engineering as not many engineering problems can be described in termsof such a set of examples. Another reason for the paucity of applications is thatinductive learning is generally more suitable for problems where attributes havediscrete or symbolic values than for those with continuous-valued attributes as inmany engineering problems. Some recent examples of applications of inductivelearning are:• controlling a laser cutting robot [Luzeaux, 1994];• controlling the functional electrical stimulation of spinally-injured humans

[Kostov et al., 1995];• modelling job complexity in clothing production systems [Hui et al., 1997];• analysing the constructability of a beam in a reinforced-concrete frame

[Skibniewski et al., 1997];• analysing the results of tests on portable electronic products to discover useful

design knowledge [Zhou, 2001];• accelerating rotogravure printing [Evans and Fisher, 2002];• predicting JIT factory performance from past data that includes both good and

poor factory performance [Mathieu et al., 2002];• developing an intelligent monitoring system for improving the reliability of a

manufacturing process [Peng, 2004].• analysing data in a steel bar manufacturing company to help intelligent decision

making [Pham et al., 2004];More information on inductive learning techniques and their applications in

engineering and manufacture can be found in [Pham et al., 2002; Pham and Afify,2005b].

14 CHAPTER 1

4. NEURAL NETWORKS

Like inductive learning programs, neural networks can capture domain knowledgefrom examples. However, they do not archive the acquired knowledge in an explicitform such as rules or decision trees and they can readily handle both continuousand discrete data. They also have a good generalisation capability as with fuzzyexpert systems.

A neural network is a computational model of the brain. Neural network modelsusually assume that computation is distributed over several simple units calledneurons which are interconnected and which operate in parallel (hence, neuralnetworks are also called parallel-distributed-processing systems or connectionistsystems). Figure 4 illustrates a typical model of a neuron. Output signal yj is afunction f of the sum of weighted input signals xi. The activation function f canbe a linear, simple threshold, sigmoidal, hyberbolic tangent or radial basis function.Instead of being deterministic, f can be a probabilistic function, in which case yj

will be a binary quantity, for example, +1 or −1. The net input to such a stochasticneuron – that is, the sum of weighted input signals xi – will then give the probabilityof yj being +1 or −1.

How the inter-neuron connections are arranged and the nature of the connectionsdetermine the structure of a network. How the strengths of the connections areadjusted or trained to achieve a desired overall behaviour of the network is governedby its learning algorithm. Neural networks can be classified according to theirstructures and learning algorithms.

In terms of their structures, neural networks can be divided into two types:feedforward network and recurrent networks. Feedforward networks can perform astatic mapping between an input space and an output space: the output at a giveninstant is a function only of the input at that instant. The most popular feedforwardneural network is the multi-layer perceptron (MLP): all signals flow in a singledirection from the input to the output of the network. Figure 5 shows an MLP withthree layers: an input layer, an output layer and an intermediate or hidden layer.Neurons in the input layer only act as buffers for distributing the input signals xi toneurons in the hidden layer. Each neuron j in the hidden layer operates accordingto the model of Figure 4. That is, its output yj is given by:

(12) yj = f� wjixi�

x1

xi

xn

wj1

wji

wjn

f(.)yj

∑

Figure 4. Model of a neuron


Output Layer

Hidden Layer

Input Layer

w11

w12

w1m

y1 yn

x1 x2 xm

Figure 5. A multi-layer perceptron

The outputs of neurons in the output layer are computed similarly.Other feedforward networks [Pham and Liu, 1999] include the learning vector

quantisation (LVQ) network, the cerebellar model articulation control (CMAC)network and the group-method of data handling (GMDH) network.

Recurrent networks are networks where the outputs of some neurons are fedbackto the same neurons or to neurons in layers before them. Thus signals can flowin both forward and backward directions. Recurrent networks are said to havea dynamic memory: the output of such networks at a given instant reflects thecurrent input as well as previous inputs and outputs. Examples of recurrent networks[Pham and Liu, 1999] include the Hopfield network, the Elman network and theJordan network. Figure 6 shows a well-known, simple recurrent neural network,the Grossberg and Carpenter ART-1 network. The network has two layers, an inputlayer and an output layer. The two layers are fully interconnected, the connectionsare in both the forward (or bottom-up) direction and the feedback (or top-down)direction. The vector Wi of weights of the bottom-up connections to an outputneuron i forms an exemplar of the class it represents. All the Wi vectors constitutethe long-term memory of the network. They are employed to select the winningneuron, the latter again being the neuron whose Wi vector is most similar to thecurrent input pattern. The vector Vi of the weights of the top-down connectionsfrom an output neuron i is used for vigilance testing, that is, determining whetheran input pattern is sufficiently close to a stored exemplar. The vigilance vectors Vi

form the short-term memory of the network. Vi and Wi are related in that Wi is anormalised copy of Vi, viz.

(13) Wi = Vi

�+∑Vji

16 CHAPTER 1

output layer

top down weights V

bottom up weights W

input layer

Figure 6. An ART-1 network

where � is a small constant and Vji, the jth component of Vi (i.e. the weight of theconnection from output neuron i to input neuron j).

Implicit “knowledge” is built into a neural network by training it. Neural networksare trained and categorised according to two main types of learning algorithms:supervised and unsupervised. In addition, there is a third type, reinforcement learn-ing, which is a special case of supervised learning. In supervised training, the neuralnetwork can be trained by being presented with typical input patterns and the cor-responding expected output patterns. The error between the actual and expectedoutputs is used to modify the strengths, or weights, of the connections between theneurons. The backpropagation (BP) algorithm, a gradient descent algorithm, is themost commonly adopted MLP training algorithm. It gives the change �wji in theweight of a connection between neurons i and j as follows:-

(14) �wji = ��jxi

where � is a parameter called the learning rate and �j is a factor depending onwhether neuron j is an output neuron or a hidden neuron. For output neurons,

(15) �j =(

�f�netj

)(y�t�

j −yj

)

and for hidden neurons,

(16) �j =(

�f�netj

)∑q

wqj �q


In Equation (15), netj is the total weighted sum of input signals to neuron j andy

�t�j is the target output for neuron j.

As there are no target outputs for hidden neurons, in Equation (16), the differencebetween the target and actual output of a hidden neuron j is replaced by theweighted sum of the �q terms already obtained for neurons q connected to the outputof j. Thus, iteratively, beginning with the output layer, the � term is computedfor neurons in all layers and weight updates determined for all connections. Theweight updating process can take place after the presentation of each training pattern(pattern-based training) or after the presentation of the whole set of training patterns(batch training). In either case, a training epoch is said to have been completedwhen all training patterns have been presented once to the MLP.

For all but the most trivial problems, several epochs are required for the MLP tobe properly trained. A commonly adopted method to speed up the training is to adda “momentum” term to Equation (14) which effectively lets the previous weightchange influence the new weight change, viz:

(17) �wji�k+1� = ��jxi +��wji�k�

where �wji�k + 1� and �wji�k� are weight changes in epochs �k + 1� and �k�respectively and � is the “momentum” coefficient.

Some neural networks are trained in an unsupervised mode where only the inputpatterns are provided during training and the networks learn automatically to clusterthem in groups with similar features. For example, training an ART-1 networkinvolves the following steps:

(i) initialising the exemplar and vigilance vectors Wi and Vi for all output neuronsby setting all the components of each Vi to 1 and computing Wi accordingto Equation (13). An output neuron with all its vigilance weights set to 1is known as an uncommitted neuron in the sense that it is not assigned torepresent any pattern classes;

(ii) presenting a new input pattern x;(iii) enabling all output neurons so that they can participate in the competition for

activation;(iv) finding the winning output neuron among the competing neurons, i.e. the

neuron for which x. Wi is largest; a winning neuron can be an uncommittedneuron as is the case at the beginning of training or if there are no betteroutput neurons;

(v) testing whether the input pattern x is sufficiently similar to the vigilancevector Vi of the winning neuron. Similarity is measured by the fraction r ofbits in x that are also in Vi, viz.

(18) r = x�Vi∑xi

x is deemed to be sufficiently similar to Vi if r is at least equal to vigilancethreshold

��0 < � ≤ 1��

18 CHAPTER 1

(vi) going to step (vii) if r ≥ � (i.e. there is resonance); else disabling the winningneuron temporarily from further competition and going to step (iv) repeatingthis procedure until there are no further enabled neurons;

(vii) adjusting the vigilance vector Vi of the most recent winning neuron by log-ically ANDing it with x, thus deleting bits in Vi that are not also in x;computing the bottom-up exemplar vector Wi using the new Vi according toEquation (13); activating the winning output neuron;

(viii) going to step (ii).The above training procedure ensures that if the same sequence of training pat-

terns is repeatedly presented to the network, its long-term and short-term memoriesare unchanged (i.e. the network is stable). Also, provided there are sufficient outputneurons to represent all the different classes, new patterns can always be learnt, asa new pattern can be assigned to an uncommitted output neuron if it does not matchpreviously stored exemplars well (i.e. the network is plastic).

In reinforcement learning, instead of requiring a teacher to give target outputsand using the differences between the target and actual outputs directly to modifythe weights of a neural network, the learning algorithm employs a critic only toevaluate the appropriateness of the neural network output corresponding to a giveninput. According to the performance of the network on a given input vector, thecritic will issue a positive or negative reinforcement signal. If the network hasproduced an appropriate output, the reinforcement signal will be positive (a reward).Otherwise, it will be negative (a penalty). The intention of this is to strengthen thetendency to produce appropriate outputs and to weaken the propensity for generatinginappropriate outputs. Reinforcement learning is a trial-and-error operation designedto maximise the average value of the reinforcement signal for a set of training inputvectors. An example of a simple reinforcement learning algorithm is a variationof the associative reward-penalty algorithm [Hassoun, 1995]. Consider a singlestochastic neuron j with inputs �x1� x2� x3� � � � � xn�. The reinforcement rule may bewritten as [Hassoun, 1995]

(19) wji�k+1� = wji�k�+ lr�k��yj�k�−E�yj�k��xi�k�

wji is the weight of the connection between input i and neuron j, l is the learningcoefficient, r (which is +1 or −1) is the reinforcement signal, yj is the output ofneuron j, E�yj� is the expected value of the output, and xi�k� is the ith componentof the kth input vector in the training set. When learning converges, wji�k+ 1� =wji�k� and so E�yj�k�� = yj�k� = +1 or −1. Thus, the neuron effectively becomesdeterministic. Reinforcement learning is typically slower than supervised learning.It is more applicable to small neural networks used as controllers where it is difficultto determine the target network output.

For more information on neural networks, see [Michie et al., 1994; Hassoun,1995; Pham and Liu, 1999; Yao, 1999; Jiang et al., 2002; Duch et al., 2004].

Neural networks can be employed as mapping devices, pattern classifiers orpattern completers (auto-associative content addressable memories and pattern asso-ciators). Like expert systems, they have found a wide spectrum of applications in


almost all areas of engineering, addressing problems ranging from modelling, pre-diction, control, classification and pattern recognition, to data association, clustering,signal processing and optimisation. Some recent examples of such applications are:• predicting the tensile strength of composite laminates [Teti and Caprino, 1994];• controlling a flexible assembly operation [Majors and Richards, 1995];• choosing sheet metal working conditions [Lin and Chang, 1996];• determining suitable cutting conditions in operation planning [Park et al., 1996;

Schultz et al., 1997];• recognising control chart patterns [Pham and Oztemel, 1996];• analysing vibration spectra [Smith et al., 1996];• deducing velocity vectors in uniform and rotating flows by tracking the move-

ment of groups of particles [Jambunathan et al., 1997];• setting the number of kanbans in a dynamic JIT factory [Wray et al., 1997;

Markham et al., 2000];• generating knowledge for scheduling a flexible manufacturing system [Kim et al.,

1998; Priore et al., 2003];• modelling and controlling dynamic systems including robot arms [Pham and Liu,

1999];• acquiring and refining operational knowledge in industrial processes [Shigaki

and Narazaki, 1999];• improving yield in a semiconductor manufacturing company [Shin and Park,

2000];• identifying arbitrary geometric and manufacturing categories in CAD databases

[Ip et al., 2003];• minimising the makespan in a flowshop scheduling problem [Akyol, 2004].

5. GENETIC ALGORITHMS

Conventional search techniques, such as hill-climbing, are often incapable of opti-mising non-linear or multi modal functions. In such cases, a random search methodis generally required. However, undirected search techniques are extremely inef-ficient for large domains. A genetic algorithm (GA) is a directed random searchtechnique, invented by Holland [Holland, 1975], which can find the global optimalsolution in complex multi-dimensional search spaces. A GA is modelled on naturalevolution in that the operators it employs are inspired by the natural evolutionprocess. These operators, known as genetic operators, manipulate individuals in apopulation over several generations to improve their fitness gradually. Individualsin a population are likened to chromosomes and usually represented as strings ofbinary numbers.

The evolution of a population is described by the “schema theorem” [Holland,1975; Goldberg, 1989]. A schema represents a set of individuals, i.e. a subset of thepopulation, in terms of the similarity of bits at certain positions of those individuals.For example, the schema 1∗0∗ describes the set of individuals whose first andthird bits are 1 and 0, respectively. Here, the symbol ∗ means any value would be

20 CHAPTER 1

acceptable. In other words, the values of bits at positions marked ∗ could be either0 or 1. A schema is characterised by two parameters: defining length and order.The defining length is the length between the first and last bits with fixed values.The order of a schema is the number of bits with specified values. According tothe schema theorem, the distribution of a schema through the population from onegeneration to the next depends on its order, defining length and fitness.

GAs do not use much knowledge about the optimisation problem under studyand do not deal directly with the parameters of the problem. They work with codeswhich represent the parameters. Thus, the first issue in a GA application is how tocode the problem, i.e. how to represent its parameters. As already mentioned, GAsoperate with a population of possible solutions. The second issue is the creationof a set of possible solutions at the start of the optimisation process as the initialpopulation. The third issue in a GA application is how to select or devise a suitableset of genetic operators. Finally, as with other search algorithms, GAs have to knowthe quality of the solutions already found to improve them further. An interfacebetween the problem environment and the GA is needed to provide this information.The design of this interface is the fourth issue.

5.1 Representation

The parameters to be optimised are usually represented in a string form since thistype of representation is suitable for genetic operators. The method of representationhas a major impact on the performance of the GA. Different representation schemesmight cause different performances in terms of accuracy and computation time.

There are two common representation methods for numerical optimisation prob-lems [Blickle and Thiele, 1995, Michalewicz, 1996]. The preferred method is thebinary string representation method. The reason for this method being popular isthat the binary alphabet offers the maximum number of schemata per bit comparedto other coding techniques. Various binary coding schemes can be found in theliterature, for example, Uniform coding, Gray scale coding, etc. The second repre-sentation method is to use a vector of integers or real numbers with each integer orreal number representing a single parameter.

When a binary representation scheme is employed, an important step is to decidethe number of bits to encode the parameters to be optimised. Each parameter shouldbe encoded with the optimal number of bits covering all possible solutions in thesolution space. When too few or too many bits are used the performance can beadversely affected.

5.2 Creation of Initial Population

At the start of optimisation, a GA requires a group of initial solutions. There aretwo ways of forming this initial population. The first consists of using randomlyproduced solutions created by a random number generator, for example. This method


is preferred for problems about which no a priori knowledge exists or for assessingthe performance of an algorithm.

The second method employs a priori knowledge about the given optimisationproblem. Using this knowledge, a set of requirements is obtained and solutionswhich satisfy those requirements are collected to form an initial population. In thiscase, the GA starts the optimisation with a set of approximately known solutionsand therefore convergence to an optimal solution can take less time than with theprevious method.

5.3 Genetic Operators

The flowchart of a simple GA is given in Figure 7. There are basically four geneticoperators, selection, crossover, mutation and inversion. Some of these operatorswere inspired by nature. In the literature, many versions of these operators can befound. It is not necessary to employ all of these operators in a GA because eachoperates independently of the others. The choice or design of operators dependson the problem and the representation scheme employed. For instance, operatorsdesigned for binary strings cannot be directly used on strings coded with integersor real numbers.

5.3.1 Selection

The aim of the selection procedure is to reproduce more of individuals whose fitnessvalues are higher than those whose fitness values are low. The selection procedurehas a significant influence on driving the search towards a promising area andfinding good solutions in a short time. However, the diversity of the population

Initial Population

Evaluation

Selection

Crossover

Mutation

Inversion

Figure 7. Flowchart of a basic genetic algorithm

22 CHAPTER 1

must be maintained to avoid premature convergence and to reach the global optimalsolution. In GAs there are mainly two selection procedures: proportional selection,also called stochastic selection, and ranking-based selection [Whitely, 1989].

Proportional selection is usually called “Roulette Wheel” selection, since itsmechanism is reminiscent of the operation of a Roulette Wheel. Fitness values ofindividuals represent the widths of slots on the wheel. After a random spinning ofthe wheel to select an individual for the next generation, slots with large widthsrepresenting individuals with high fitness values will have a higher chance to beselected.

One way to prevent premature convergence is to control the range of trialsallocated to any single individual, so that no individual produces too many offspring.The ranking system is one such alternative selection algorithm. In this algorithm,each individual generates an expected number of offspring which is based on therank of its performance and not on the magnitude [Baker, 1985].

5.3.2 Crossover

This operation is considered the one that makes the GA different from other algo-rithms, such as dynamic programming. It is used to create two new individuals(children) from two existing individuals (parents) picked from the current popula-tion by the selection operation. There are several ways of doing this. Some commoncrossover operations are one-point crossover, two-point crossover, cycle crossoverand uniform crossover.

One-point crossover is the simplest crossover operation. Two individuals arerandomly selected as parents from the pool of individuals formed by the selectionprocedure and cut at a randomly selected point. The tails, which are the parts afterthe cutting point, are swapped and two new individuals (children) are produced.Note that this operation does not change the values of bits. An example of one-pointcrossover is shown in Figure 8.

5.3.3 Mutation

In this procedure, all individuals in the population are checked bit by bit and the bitvalues are randomly reversed according to a specified rate. Unlike crossover, this is

Parent 1 1 0 0 | 0 1 0 0 1 1 1 1 0

Parent 2 0 0 1 | 0 1 1 0 0 0 1 1 0

New string 1 1 0 0 | 0 1 1 0 0 0 1 1 0

New string 2 0 0 1 | 0 1 0 0 1 1 1 1 0

Figure 8. Crossover


Old string 1 1 0 0 | 0 | 1 0 1 1 1 0 1

New string 1 1 0 0 | 1 | 1 0 1 1 1 0 1

Figure 9. Mutation

a monadic operation. That is, a child string is produced from a single parent string.The mutation operator forces the algorithm to search new areas. Eventually, it helpsthe GA to avoid premature convergence and find the global optimal solution. Anexample is given in Figure 9.

5.3.4 Inversion

This operator is employed for a group of problems, such as the cell placementproblem, layout problem and travelling salesman problem. It also operates on oneindividual at a time. Two points are randomly selected from an individual and thepart of the string between those two points is reversed (see Figure 10).

5.4 Control Parameters

Important control parameters of a simple GA include the population size (numberof individuals in the population), crossover rate, mutation rate and inversion rate.Several researchers have studied the effect of these parameters on the performanceof a GA [Schaffer et al., 1989; Grefenstette, 1986; Fogarty, 1989; Mahfoud, 1995;Smith and Fogarty, 1997]. The main conclusions are as follows. A large populationsize means the simultaneous handling of many solutions and increases the compu-tation time per iteration; however since many samples from the search space areused, the probability of convergence to a global optimal solution is higher than witha small population size.

The crossover rate determines the frequency of the crossover operation. It isuseful at the start of optimisation to discover promising regions in the search space.A low crossover frequency decreases the speed of convergence to such areas. If thefrequency is too high, it can lead to saturation around one solution. The mutationoperation is controlled by the mutation rate. A high mutation rate introduces highdiversity in the population and might cause instability. On the other hand, it isusually very difficult for a GA to find a global optimal solution with too low amutation rate.

Old string 1 0 | 1 1 0 0 | 1 1 1 0 1

New string 1 0 | 0 0 1 1 | 1 1 1 0 1

Figure 10. Inversion of a binary string segment

24 CHAPTER 1

5.5 Fitness Evaluation Function

The fitness evaluation unit in a GA acts as an interface between the GA and theoptimisation problem. The GA assesses solutions for their quality according to theinformation produced by this unit and not by directly using information about theirstructure. In engineering design problems, functional requirements are specified tothe designer who has to produce a structure which performs the desired functionswithin predetermined constraints. The quality of a proposed solution is usuallycalculated depending on how well the solution performs the desired functionsand satisfies the given constraints. In the case of a GA, this calculation must beautomatic and the problem is how to devise a procedure which computes the qualityof solutions.

Fitness evaluation functions might be complex or simple depending on the opti-misation problem at hand. Where a mathematical equation cannot be formulatedfor this task, a rule-based procedure can be constructed for use as a fitness functionor in some cases both can be combined. Where some constraints are very importantand cannot be violated, the structures or solutions which do so can be eliminated inadvance by appropriately designing the representation scheme. Alternatively, theycan be given low probabilities by using special penalty functions.

For further information on genetic algorithms, see [Holland, 1975; Goldberg,1989; Davis, 1991; Mitchell, 1996; Pham and Karaboga, 2000; Freitas, 2002].

Genetic algorithms have found applications in engineering problems involvingcomplex combinatorial or multi-parameter optimisation. Some recent examples ofthose applications are:• configuring transmission systems [Pham and Yang, 1993];• designing the knowledge base of fuzzy logic controllers [Pham and Karaboga,

1994];• generating hardware description language programs for high-level specifi-

cation of the function of programmable logic devices [Seals and Whapshott,1994];

• planning collision-free paths for mobile and redundant robots [Ashiru et al.,1995; Wilde and Shellwat, 1997; Nearchou and Aspragathos, 1997];

• scheduling the operations of a job shop [Cho et al., 1996; Drake and Choudhry,1997; Lee et al., 1997; Chryssolouris and Subramaniam, 2001; Pérez et al.,2003];

• generating dynamic schedules for the operation and control of a flexible manu-facturing cell [Jawahar et al., 1998];

• optimising the performance of an industrially designed inventory control system[Disney, 2000];

• forming manufacturing cells and determining machine layout information forcellular manufacturing [Wu et al., 2002];

• optimising assembly process plans to improve productivity [Li et al., 2003];• improving the convergence speed and reducing the computational complexity of

neural networks [Öztürk and Öztürk, 2004].


6. SOME APPLICATIONS IN ENGINEERINGAND MANUFACTURE

This section briefly reviews five engineering applications of the aforementionedsoft computing tools.

6.1 Expert Statistical Process Control

Statistical process control (SPC) is a technique for improving the quality of pro-cesses and products through closely monitoring data collected from those processesand products and using statistically-based tools such as control charts.

XPC is an expert system for facilitating and enhancing the implementation of sta-tistical process control [Pham and Oztemel, 1996]. A commercially available shellwas employed to build XPC. The shell allows a hybrid rule-based and pseudo object-oriented method of representing the standard SPC knowledge and process-specificdiagnostic knowledge embedded in XPC. The amount of knowledge involved isextensive, which justifies the adoption of a knowledge-based systems approach.

XPC comprises four main modules. The construction module is used to set upa control chart. The capability analysis module is for calculating process capabil-ity indices. The on-line interpretation and diagnosis module assesses whether theprocess is in control and determines the causes for possible out-of-control situa-tions. It also provides advice on how to remedy such situations. The modificationmodule updates the parameters of a control chart to maintain true control over atime-varying process. XPC has been applied to the control of temperature in aninjection moulding machine producing rubber seals. It has recently been enhancedby integrating a neural network module with the expert system modules to detectabnormal patterns in the control chart (see Figure 11).

6.2 Fuzzy Modelling of a Vibratory Sensor for Part Location

Figure 12 shows a six-degree-of-freedom vibratory sensor for determining thecoordinates of the centre of mass �xG� yG� and orientation � of bulky rigid parts.The sensor is designed to enable a robot to pick up parts accurately for machinefeeding or assembly tasks. The sensor consists of a rigid platform (P) mounted ona flexible column (C). The platform supports one object (O) to be located at a time.O is held firmly with respect to P. The static deflections of C under the weight ofO and the natural frequencies of vibration of the dynamic system comprising O,P and C are measured and processed using a mathematical model of the systemto determine xG, yG and � for O. In practice, the frequency measurements havelow repeatability, which leads to inconsistent location information. The problemworsens when � is in the region 80�-90� relative to a reference axis of the sensorbecause the mathematical model becomes ill-conditioned. In this “ill-conditioning”region, an alternative to using a mathematical model to compute � is to adopt anexperimentally derived fuzzy model. Such a fuzzy model has to be obtained for

26 CHAPTER 1

15 30 45 60 75 98

Mean : 4.5 St. Dev : 1.5 PCI: 1.7

State of the process: in-control State of the process: in-control

St. Dev : 4.4Mean : 72.5 PSD : 4.0

UCL : 9 CL : 4 LCL : 0.00 UCL : 93 CL : 78 LCL : 63

15 30 45 60 75 98

Warning !!!!!!

Process going out of control!

press any key to continue

press 999 to exit

Range Chart Mean Chart

:::::

the pattern is normalthe pattern is inc. trend the pattern is dec. trendthe pattern is up. shiftthe pattern is down. shiftthe pattern is cyclic :

0.000.00

100.000.000.000.00

(%)(%)(%)(%)(%)(%)

Figure 11. XPC output screen

each specific object through calibration. A possible calibration procedure involvesplacing the object at different positions �xG� yG� and orientations � and recordingthe periods of vibration T of the sensor. Following calibration, fuzzy rules relatingxG, yG and T to � could be constructed to form a fuzzy model of the behaviour of thesensor for the given object. A simpler fuzzy model is achieved by observing that xG


End of robotarm

Column C

PlatformP

Y

Zz

Object O

Orientation y

yG

Figure 12. Schematic diagram of a vibratory sensor mounted on a robot wrist

and yG only affect the reference level of T and, if xG and yG are employed to definethat level, the trend in the relationship between T and � is the same regardless ofthe position of the object. Thus, a simplified fuzzy model of the sensor consists ofrules such as “IF �T-Tref� is small THEN ��-�ref� is small” where Tref is the valueof T when the object is at position �xG� yG� and orientation �ref . �ref could be chosenas 80�, the point at which the fuzzy model is to replace the mathematical model.Tref could be either measured experimentally or computed from the mathematicalmodel. To counteract the effects of the poor repeatability of period measurementswhich are particularly noticeable in the “ill-conditioning” region, the fuzzy rulesare modified so that they take into account the variance in T. An example of amodified fuzzy rule is:

“IF �T-Tref� is small and �T is small, THEN �� −�ref� is small�”

In the above rule, �T denotes the standard deviation in the measurement of T.Fuzzy modelling of the vibratory sensor is detailed in Pham and Hafeez (1992).

Using a fuzzy model, the orientation � can be determined to ±2� accuracy in theregion 80�-90�. The adoption of fuzzy logic in this application has produced acompact and transparent model from a large amount of noisy experimental data.

6.3 Induction of Feature Recognition Rules in a Geometric ReasoningSystem for Analysing 3D Assembly Models

Pham et al. (1999) have described a concurrent engineering approach involvinggenerating assembly strategies for a product directly from its 3D CAD model.

28 CHAPTER 1

A feature-based CAD system is used to create assembly models of products.A geometric reasoning module extracts assembly-oriented data for a product fromthe CAD system after creating a virtual assembly tree that identifies the compo-nents and sub-assemblies making up the given product (Figure 13a). The assemblyinformation extracted by the module includes: placement constraints and dimen-sions used to specify the relevant position of a given component or sub-assembly;geometric entities (edges, surfaces, etc) used to constrain the component or sub-assembly; and the parents and children of each entity employed as a placementconstraint. An example of the information extracted is shown in Figure 13b.

Feature recognition is applied to the extracted information to identify each featureused to constrain a component or sub-assembly. The rule-based feature recognitionprocess has three possible outcomes:1. The feature is recognised as belonging to a unique class.2. The feature shares attributes with more than one class (see Figure 13c).3. The feature does not belong to any known class.

Cases 2 and 3 require the user to decide the correct class of the feature and therule base to be updated. The updating is implemented via a rule induction program.The program employs RULES-3 Plus which automatically extracts new featurerecognition rules from examples provided to it in the form of characteristic vectorsrepresenting different features and their respective class labels.

Rule induction is very suitable for this application because of the complex-ity of the characteristic vectors and the difficulty of defining feature classesmanually.

Figure 13a. An assembly model


Bolt:• Child of Block• Placement constraints:

1: alignment of two axes2: mating ofthe bottom surface of the bolt head and the upper surface ofthe block

• No child part in the assembly hierarchy

Block:• No parents• No constraints (root component)• Next part in the assembly: Bolt

Figure 13b. An example of assembly information

Rectangular Non-through Slot (BSL_1)

Partial Round Non-through Slot (BSL_2)New Form Feature

Detected SimilarFeature Classes

Figure 13c. An example of feature recognition

6.4 Neural-network-based Automotive Product Inspection

Figure 14 depicts an intelligent inspection system for engine valve stem seals[Pham and Oztemel, 1996]. The system comprises four CCD cameras connectedto a computer that implements neural-network-based algorithms for detecting andclassifying defects in the seal lips. Faults on the lip aperture are classified by amultilayer perceptron. The inputs to the network are a 20-component vector, where

30 CHAPTER 1

HostPC

BowlFeeder

Material handlingand lightingcontroller

Ethernet link

Databus

Chute

4 CCD cameras512 x 512 resolution

Lighting ring

Seal

Vision system

Rework

Good

Indexing machine

Reject

Figure 14. Valve stem seal inspection system

the value of each component is the number of times a particular geometric featureis found on the aperture being inspected. The outputs of the network indicatethe type of defect on the seal lip aperture. A similar neural network is used toclassify defects on the seal lip surface. The accuracy of defect classification in bothperimeter and surface inspection is in excess of 80%. Note that this figure is notthe same as that for the accuracy in detecting defective seals, that is differentiatingbetween good and defective seals. The latter task is also implemented using aneural network which achieves an accuracy of almost 100%. Neural networks arenecessary for this application because of the difficulty of describing precisely thevarious types of defects and the differences between good and defective seals.The neural networks are able to learn the classification task automatically fromexamples.

6.5 GA-based Conceptual Design

TRADES is a system using GA techniques to produce conceptual designs of trans-mission units [Pham and Yang, 1993]. The system has a set of basic buildingblocks, such as gear pairs, belt drives and mechanical linkages, and generates con-ceptual designs to satisfy given specifications by assembling the building blocksinto different configurations. The crossover, mutation and inversion operators ofthe GA are employed to create new configurations from an existing population ofconfigurations. Configurations are evaluated for their compliance with the designspecifications. Potential solutions should provide the required speed reduction ratio


and motion transformation while not containing incompatible building blocks orexceeding specified limits on the number of building blocks to be adopted. A fitnessfunction codifies the degree of compliance of each configuration. The maximumfitness value is assigned to configurations that satisfy all functional requirementswithout violating any constraints. As in a standard GA, information concerningthe fitness of solutions is employed to select solutions for reproduction thus guid-ing the process towards increasingly fitter designs as the population evolves. Inaddition to the usual GA operators, TRADES incorporates new operators to avertpremature convergence to non-optimal solutions and facilitate the generation of avariety of design concepts. Essentially, these operators reduce the chances of anyone configuration or family of configurations dominating the solution populationby avoiding crowding around very fit configurations and preventing multiple copiesof a configuration particularly after it has been identified as a potential solution.

TRADES is able to produce design concepts from building blocks without requir-ing much additional a priori knowledge. The manipulation of the building blocks togenerate new concepts is carried out by the GA in a stochastic but guided manner.This enables good conceptual designs to be found without the need to search thedesign space exhaustively.

Due to the very large size of the design space and the quasi random operationof the GA, novel solutions not immediately evident to a human designer are some-times generated by TRADES. On the other hand, impractical configurations could alsoarise. TRADES incorporates a number of heuristics to filter out such design proposals.

7. CONCLUSION

Over the past fifty years, the field of soft computing has produced a number ofpowerful tools. This chapter has reviewed five of those tools, namely, knowledge-based systems, fuzzy logic, inductive learning, neural networks and genetic algo-rithms. Applications of the tools in engineering and manufacture have becomemore widespread due to the power and affordability of present-day computers. It isanticipated that many new applications will emerge and that, for demanding tasks,greater use will be made of hybrid tools combining the strengths of two or moreof the tools reviewed here [Michalski and Tecuci, 1994; Medsker, 1995]. Othertechnological developments in soft computing that will have an impact in engi-neering include data mining, or the extraction of information and knowledge fromlarge databases [Limb and Meggs, 1994; Witten and Frank, 2000, Braha, 2001; Hanand Kamber, 2001; Pham and Afify, 2002; Klösgen and Zytkow, 2002; Giudici,2003], and multi-agent systems, or distributed self-organising systems employingentities that function autonomously in an unpredictable environment concurrentlywith other entities and processes [Wooldridge and Jennings, 1994; Rzevski, 1995;Márkus et al., 1996; Tharumarajah et al., 1996; Bento and Feijó, 1997; Monostori,2002]. The appropriate deployment of these new soft computing tools and of thetools presented in this chapter will contribute to the creation of more competitiveengineering systems.

32 CHAPTER 1

8. ACKNOWLEDGEMENTS

This work was carried out within the ALFA project “Novel Intelligent Automationand Control Systems II” (NIACS II), the ERDF (Objective One) projects “Inno-vation in Manufacturing Centre (IMC)”, “Innovative Technologies for EffectiveEnterprises” (ITEE) and “Supporting Innovative Product Engineering and Respon-sive Manufacturing” (SUPERMAN) and within the project “Innovative ProductionMachines and Systems” (I∗PROMS).

REFERENCES

Akyol D E, (2004), “Application of neural networks to heuristic scheduling algorithms”, Computers Ind.Engng, 46, 679–696.

Ashiru I, Czanecki C and Routen T, (1995), “Intelligent operators and optimal genetic-based pathplanning for mobile robots”, Proc. Int. Conf. on Recent Advances in Mechatronics, Istanbul, Turkey,1018–1023.

Badiru A B and Cheung J Y, (2002), Fuzzy Engineering Expert Systems with Neural Network Appli-cations, John Wiley & Sons, New York.

Baker J E, (1985), “Adaptive selection methods for genetic algorithms”, Proc. 1st Int. Conf. on GeneticAlgorithms and Their Applications, Pittsburgh, PA, 101–111.

Baldwin J F and Karale S B, (2003), “New concepts for fuzzy partitioning, defuzzification and derivationof probabilistic fuzzy decision trees”, Proc. 22nd Int. Conf. of the North American Fuzzy InformationProcessing Society (NAFIPS-03), Chicago, Illinois, USA, 484–487.

Baldwin J F and Martin T P, (2001), “Towards inductive support logic programming”, Proc. Joint 9thIFSA World Congress and 20th NAFIPS Int. Conf., Vancouver, Canada, 4, 1875–1880.

Bas K and Erkmen A M, (1995), “Fuzzy preshape and reshape control of Anthrobot-III 5-fingered robothand”, Proc. Int. Conf. on Recent Advances in Mechatronics, Istanbul, Turkey, 673–677.

Bento J and Feijó B, (1997), “An agent-based paradigm for building intelligent CAD systems”, ArtificialIntelligence in Engineering, 11 (3), 231–244.

Beskese A, Kahraman C and Irani Z, (2004), “Quantification of flexibility in advanced manufacturingsystems using fuzzy concepts”, Int. J. Production Economics, 89 (1), 45–56.

Bigand A, Goureau P and Kalemkarian J, (1994), “Fuzzy control of a welding process”, Proc. IMACSInt. Symp. on Signal Processing, Robotics and Neural Networks (SPRANN 94), Villeneuve d’Ascq,France, 379–342.

Blickle T and Thiele L, (1995), “A comparison of selection schemes used in genetic algorithms”,Computer engineering and Communication Networks Lab (TIK)-Report, No. 11, Version 1.1b, SwissFederation Institute of Technology (ETH), Zurich, Switzerland.

Bose A, Gini M and Riley D, (1997), “A case-based approach to planar linkage design”, ArtificialIntelligence in engineering, 11 (2), 107–119.

Bozdag C E, Kahraman C and Ruan D, (2003), “Fuzzy group decision making for selection amongcomputer integrated manufacturing systems”, Computers in Industry, 15 (1), 13–29.

Braha D, (2001), Data Mining for Design and Manufacturing: Methods and Applications. KluwerAcademic Publishers, Boston.

Breiman L, Friedman J H, Olshen R A and Stone C J, (1984), Classification and Regression Trees,Belmont, Wadsworth.

Cervone G, Panait L A and Michalski R S, (2001), “The development of the AQ20 learning system andinitial experiments”, Proc. 10th Inter. Symposium on Intelligent Information Systems, Poland.

Chen J C and Black J T, (1997), “A fuzzy-nets in-process (FNIP) system for tool-breakage monitoringin end-milling operations”, Int. J Machine Tools Manufacturing, 37 (6), 783–800.

Cho B J, Hong S C and Okoma S, (1996), “Job shop scheduling using genetic algorithm”, Proc. 3rdWorld Congress on Expert Systems, Seoul, Korea, 351–358.


Chryssolouris G and Subramaniam V, (2001), “Dynamic scheduling of manufacturing job shops usinggenetic algorithms”, J. Intelligent Manufacturing, 12, 281–293.

Costa Branco P J and Dente J A, (1998), “An experiment in automatic modelling an electrical drivesystem using fuzzy logic”, IEEE Trans on Systems, Man, and Cybernetics, 28 (2), 254–262.

Da Rocha Fernandes A M and Cid Bastos R, (1996), “Fuzzy expert systems for qualitative analysis ofminerals”, Proc. 3rd World Congress on Expert Systems, Seoul, Korea, February, 673–680.

Darlington K W, (1999), The Essence of Expert Systems, Prentice Hall.Davis L, (1991), Handbook of Genetic Algorithms, Van Nostrand, New York, NY.De La Sen M, Miñambres J J, Garrido A J, Almansa A and Soto J C, (2004), “Basic theoretical results

for expert systems: Application to the supervision of adaptation transients in planar robots”, ArtificialIntelligence, 152 (2), 173–211.

Disney S M, Naim M M and Towill D R, (2000), “Genetic algorithm optimisation of a class of inventorycontrol systems”, Inter. J. Production Economics, 68, 259–278.

Drake P R and Choudhry I A, (1997), “From apes to schedules”, Manufacturing Engineer, 76 (1), 43–45.Dubois D and Prade H, (1998), “An introduction to fuzzy systems”, Clinica Chimica Acta, 270 (1),

3–29.Duch W, Setiono R and Zurada J M, (2004), “Computational intelligence methods for rule-based data

understanding”, Proc. IEEE, 92 (5), 771–805.Durkin J, (1994), Expert Systems Design and Development, Macmillan, New York.Evans B and Fisher D, (2002), “Using decision tree induction to minimize process delays in printing

industry”, In: Handbook of Data Mining and Knowledge Discovery (W. Klösgen and J.M. Zytkow(Eds.)), Oxford University Press.

Kong F, Yu J and Zhou X, (1999), “Analysis of fuzzy dynamic characteristics of machine cuttingprocess: Fuzzy stability analysis in regenerative-type-chatter”, Int. J. Machine Tools and Manufacture,39 (8), 1299–1309.

Ferreiro Garcia R, (1994), “FAM rule as basis for poles shifting applied to the roll control of an aircraft”,SPRANN 94 (ibid), 375–378.

Fogarty T C, (1989), “Varying the probability of mutation in the genetic algorithm”, Proc. Third Int.Conf. on Genetic Algorithms and Their Applications, George Mason University, 104–109.

Freitas A A, (2002), Data mining and knowledge discovery with evolutionary algorithms, Springer-Verlag, Berlin, New York.

Giarratano J C and Riley G D, (1998), Expert Systems: Principles and Programming, 3rd Edition, PWSPublishing Company, Boston, MA.

Giudici P, (2003), Applied Data Mining: Statistical Methods for Business and Industry, John Wiley &Sons, England.

Goldberg D E, (1989), Genetic Algorithms in Search, Optimisation and Machine Learning, AddisonWesley, Reading, MA.

Grefenstette J J, (1986), “Optimization of control parameters for genetic algorithms”, IEEE Trans onSystems, Man and Cybernetics, 16 (1), 122–128.

Han J and kamber M, (2001), Data Mining: Concepts and Techniques, Academic Press, USA.Hassoun M H, (1995), Fundamentals of Artificial Neural Networks, MIT Press, Cambridge, MA.Holland J H, (1975), Adaptation in Natural and Artificial Systems, The University of Michigan Press,

Ann Arbor, MI.Hong T P and Chen J B, (2000), “Processing individual fuzzy attributes for fuzzy rule induction”, Fuzzy

Sets and Systems, 112 (1), 127–140.Hui P C L, Chan K C K and Yeung K W, (1997), “Modelling job complexity in garment manufacturing

by inductive learning”, Inter. J. Clothing Science and Technology, 9 (1), 34–44.Ip C Y, Regli W C, Sieger L and Shokoufandeh A, (2003), “Automated learning of model classification.

Proc. 8th ACM Symposium on Solid Modeling and Applications, Seattle, Washington, USA, ACMPress, 322–327.

ISL, (1998), Clementine Data Mining Package. SPSS UK Ltd., 1st Floor, St. Andrew’s House, WestStreet, Woking, Surrey GU21 1EB, United Kingdom.

Jackson P, (1999), Introduction to Expert Systems, 3rd Edition, Addison-Wesley, Harlow, Essex.

34 CHAPTER 1

Jambunathan K, Fontama V N, Hartle S L and Ashforth-Frost S, (1997), “Using ART 2 networks todeduce flow velocities”, Artificial Intelligence in Engineering, 11 (2), 135–141.

Janikow C Z, (1998), “Fuzzy decision trees: Issues and methods”, IEEE Trans on System, Man, andCybernetic, 28 (1), 1–14.

Jawahar N, Aravindan P, Ponnambalam S G and Karthikeyan A A, (1998), “A genetic algorithm-basedscheduler for setup-constrained FMC”, Computers in Industry, 35, 291–310.

Jiang Y, Zhou Z-H and Chen Z-Q, (2002), “Rule learning based on neural network ensemble”, Proc.Inter. Joint Conf. on Neural Networks, Honolulu, HI, 1416–1420.

Kalogirou S A, (2003), “Artificial Intelligence for modelling and control of combustion processes:A review”, Progress in Energy and Combustion Science, 29 (6), 515–566.

Kamrani A K, Shashikumar S and Patel S, (1995), “An intelligent knowledge-based system for roboticcell design”, Computers Ind. Engng, 29 (1–4), 141–145.

Karsak E E, (2004), “Fuzzy multiple objective programming framework to prioritize design requirementsin quality function deployment”, Computers Ind. Engng, (Submitted and accepted).

Karsak E E and Kuzgunkaya O, (2002), “A fuzzy multiple objective programming approach for theselection of a flexible manufacturing system”, Int. J. Production Economics, 79 (2), 101–111.

Kaufmann A, (1975), Introduction to the Theory of Fuzzy Subsets, Vol.1, Academic Press, New York.Kim C-O, Min Y-D and Yih Y, (1998), “Integration of inductive learning and neural networks for

multi-objective FMS scheduling”, Inter. J. Production Research, 36 (9), 2497–2509.Klir G J and Yuan B, (1995), Fuzzy Sets and Fuzzy Logic: Theory and Applications, Prentice Hall,

Upper Saddle River, NJ.Klir G J and Yuan B, (Eds.), (1996), Fuzzy Sets, Fuzzy Logic, and Fuzzy Systems – selected papers by

L A Zadeh, World Scientific, Singapore.Klösgen W and Zytkow J M, (2002), Handbook of Data Mining and Knowledge Discovery, Oxford

University Press, New York.Koo D Y and Han S H, (1996), “Application of the configuration design methods to a design expert

system for paper feeding mechanism”, Proc. 3rd World Congress on Expert Systems, Seoul, Korea,February, 49–56.

Kostov A, Andrews B, Stein R B, Popovic D and Armstrong W W, (1995), “Machine learning in controlof functional electrical stimulation systems for locomotion”, IEEE Trans on Biomedical Engineering,44 (6), 541–551.

Kulak O and Kahraman C, (2004), “Multi-attribute comparison of advanced manufacturing systemsusing fuzzy vs. crisp axiomatic design approach”, Int. J. Production Economics, (Submitted andaccepted).

Lara Rosano F, Kemper Valverde N, De La Paz Alva C and Alcántara Zavala J, (1996), “Tutorial expertsystem for the design of energy cogeneration plants”, Proc. 3rd World Congress on Expert Systems,Seoul, Korea, February, 300–305.

Lavrac N and Dzeroski S, (1994), Inductive Logic Programming: Techniques and Applications, EllisHorwood, New York.

Lee C-Y, Piramuthu S and Tsai Y-K, (1997),“Job shop scheduling with a genetic algorithm and machinelearning”, Inter. J. Production Research, 35 (4), 1171–1191.

Li J R, Khoo L P and Tor S B, (2003), “A Tabu-enhanced genetic algorithm approach for assemblyprocess planning”, J. Intelligent Manufacturing, 14, 197–208.

Limb P R and Meggs G J, (1994), “Data mining tools and techniques”, British Telecom TechnologyJournal, 12 (4), 32–41.

Lin Z-C and Chang D-Y, (1996), “Application of a neural network machine learning model in theselection system of sheet metal bending tooling”, Artificial Intelligence in Engineering, 10, 21–37.

Lou H H and Huang Y L, (2003), “Hierarchical decision making for proactive quality control: Systemdevelopment for defect reduction in automotive coating operations”, Engineering Applications ofArtificial Intelligence, 16, 237–250.

Luzeaux D, (1994), “Process control and machine learning: rule-based incremental control”, IEEE Transon Automatic Control, 39 (6), 1166–1171.


Mahfoud S W, (1995), Niching Methods for Genetic Algorithms, Ph.D. Thesis, Department of GeneralEngineering, University of Illinois at Urbana-Champaign.

Majors M D and Richards R J, (1995), “A topologically-evolving neural network for robotic flexibleassembly control”, Proc. Int. Conf. on Recent Advances in Mechatronics, Istanbul, Turkey, August,894–899.

Markham I S, Mathieu RG and Wray B A, (2000), “Kanban setting through artificial intelligence:A comparative study of artificial neural networks and decision trees”, Integrated ManufacturingSystems, 11 (4), 239–246.

Márkus A, Kis T, Váncza J and Monostori L, (1996), “A market approach to holonic manufacturing”,CIRP Annals, 45 (1), 433–436.

Mathieu R G, Wray B A and Markham I S, (2002), “An approach to learning from both good andpoor factory performance in a kanban-based just-in-time production system”, Production Planning &Control, 13 (8), 715–724.

Medsker L R, (1995), Hybrid Intelligent Systems, Kluwer Academic Publishers, Boston, 298 pp.Michalewicz Z, (1996), Genetic Algorithms + Data Structures = Evolution Programs, 3rd Edition,

Springer-Verlag, Berlin.Michalski R S, (1990), “A theory and methodology of inductive learning”, in Readings in Machine

Learning, Eds. Shavlik J W and Dietterich T G, Kaufmann, San Mateo, CA, 70–95.Michalski R S and Kaufman KA, (2001), “The AQ19 system for machine learning and pattern discovery:

A general description and user guide”, Reports of the Machine Learning and Inference Laboratory,MLI 01-2, George Mason University, Fairfax, VA, USA.

Michalski R S and Larson J B, (1983), “Incremental generation of VL1 hypotheses: The underlyingmethodology and the descriptions of program AQ11”, ISG 83–5, Department of Computer Science,University of Illinois at Urbana-Champaign, Urbana, Illinois.

Michalski R S, Mozetic I, Hong J and Lavrac N, (1986), “The multi-purpose incremental learningsystem AQ15 and its testing application to three medical domains”, American Association of ArtificialIntelligence, Los Altos, CA, Morgan Kaufmann, 1041–1045.

Michalski R and Tecuci G, (1994), Machine Learning: A Multistrategy Approach, 4, Morgan KaufmannPublishers, San Francisco, CA, USA.

Michie D, Spiegelhalter D J and Taylor C C, (1994), Machine Learning, Neural and Statistical Classi-fication, Ellis Horwood, New York.

Mitchell M, (1996), An Introduction to Genetic Algorithms, MIT Press.Monostori L, (2002), “AI and machine learning techniques for managing complexity, changes and

uncertainties in manufacturing”, Proc. 15th Triennial World Congress, Barcelona, Spain, 119–130.Muggleton S (ed), (1992), Inductive Logic Programming, Academic Press, London, 565 pp.Muggleton S and Feng C, (1990), “Efficient induction of logic programs”, Proc. 1st Conf. on Algorithmic

Learning Theory, Tokyo, Japan, 368–381.Nearchou A C and Aspragathos N A, (1997), “A genetic path planning algorithm for redundant articulated

robots”, Robotica, 15 (2), 213–224.Nurminen J K, Karonen O and Hatonen K, (2003), “What makes expert systems survive over 10-years –

empirical evaluation of several engineering applications”, Expert Systems with Applications, 24 (2),199–211.

Ong S K, De Vin L J, Nee A Y C and Kals H J J, (1997), “Fuzzy set theory applied to bend sequencingfor sheet metal bending”, Int. J. Materials Processing Technology, 69, 29–36.

Öztürk N and Öztürk F, (2004), “Hybrid neural network and genetic algorithm based machining featurerecognition”, J. Intelligent Manufacturing, 15, 278–298.

Park M-W, Rho H-M and Park B-T, (1996), “Generation of modified cutting conditions using neuralnetworks for an operation planning system”, Annals of the CIRP, 45 (1), 475–478.

Peers S M C, Tang M X and Dharmavasan S, (1994), “A knowledge-based scheduling system foroffshore structure inspection”, Artificial Intelligence in Engineering IX (AIEng 9), Eds. Rzevski G,Adey R A and Russell D W, Computational Mechanics, Southampton, 181–188.

Peng Y, (2004), “Intelligent condition monitoring using fuzzy inductive learning”, J. Intelligent Manu-facturing, 15 (3), 373–380.

36 CHAPTER 1

Pérez E, Herrera F and Hernández C, (2003), “Finding multiple solutions in job shop scheduling byniching genetic algorithms”, J. Intelligent Manufacturing, 14, 323–339.

Pham D T and Afify A A, (2002), “Machine learning: Techniques and trends”, Proc. 9th Inter. Workshopon Systems, Signals and Image Processing (IWSSIP – 02), Manchester Town Hall, UK, WorldScientific, 12–36.

Pham D T and Afify A A, (2005a), “RULES-6: A simple rule induction algorithm for handling largedata sets”, Proc. of the Institution of Mechanical Engineers, Part (C), 219 (10), 1119–1137 .

Pham D T and Afify A A, (2005b), “Machine learning techniques and their applications in manufactur-ing”, Proc. of the Institution of Mechanical Engineers, Part B, 219 (5), 395–412.

Pham D T, Afify A A and Dimov S S, (2002), “Machine learning in manufacturing”, Proc. 3rd CIRPInter. Seminar on Intelligent Computation in Manufacturing Engineering – (ICME 2002), Ischia, Italy,III–XII.

Pham D T and Aksoy M S, (1994), “An algorithm for automatic rule induction”, Artificial Intelligencein Engineering, 8, 277–282.

Pham D T and Aksoy M S, (1995a), “RULES : A rule extraction system”, Expert Systems withApplications, 8, 59–65.

Pham D T and Aksoy M S, (1995b), “A new algorithm for inductive learning”, Journal of SystemsEngineering, 5, 115–122.

Pham D T, Bigot S and Dimov S S, (2003), “RULES-5: A rule induction algorithm for problemsinvolving continuous attributes”, Proc. of the Institution of Mechanical Engineers, 217 (Part C),1273–1286.

Pham D T and Dimov S S (1997), “An efficient algorithm for automatic knowledge acquisition”, PatternRecognition, 30(7), 1137–1143.

Pham D T, Dimov S S and Salem Z, (2000), “Technique for selecting examples in inductive learning”,ESIT 2000 European Symposium on Intelligent Techniques, Erudit Aachen Germany, 119–127.

Pham D T, Dimov S S and Setchi RM (1999), “Concurrent engineering: a tool for collaborative working”,Human Systems Management, 18, 213–224.

Pham D T and Hafeez K, (1992), “Fuzzy qualitative model of a robot sensor for locating three-dimensional objects”, Robotica, 10, 555–562.

Pham D T and Karaboga D, (1994), “Some variable mutation rate strategies for genetic algorithms”,SPRANN 94 (ibid), 73–96.

Pham D T and Karaboga D, (2000), Intelligent Optimisation Techniques: Genetic Algorithms, TabuSearch, Simulated Annealing and Neural Networks, Springer-Verlag, London, Berlin and Heidelberg,2nd printing, 302 pp.

Pham D T and Liu X, (1999), Neural Networks for Identification, Prediction and Control, SpringerVerlag, London, Berlin and Heidelberg, 4th printing, 238 pp.

Pham D T and Oztemel E, (1996), Intelligent Quality Systems, Springer Verlag, London, Berlin andHeidelberg, 201 pp.

Pham D T, Packianather M S, Dimov S, Soroka A J, Girard T, Bigot S. and Salem Z, (2004),“An application of data mining and machine learning techniques in the metal industry”, Proc. 4thCIRP Inter. Seminar on Intelligent Computation in Manufacturing Engineering (ICME-04), Sorrento(Naples), Italy.

Pham D T and Pham P T N, (1988), “Expert systems in mechanical and manufacturing engineering”,Int. J. Adv. Manufacturing Technology, Special Issue on Knowledge Based Systems, 3(3), 3–21.

Pham D T and Yang Y, (1993), “A genetic algorithm based preliminary design system”, Proc. IMechE,Part D: J. Automobile Engineering, 207, 127–133.

Price C J, (1990), Knowledge Engineering Toolkits, Ellis Horwood, Chichester.Priore P, Fuente D, Pino R and Puente J, (2003), “Dynamic scheduling of manufacturing systems using

neural networks and inductive learning”, Integrated Manufacturing Systems, 14 (2), 160–168.Quinlan J R, (1983), “Learning efficient classification procedures and their application to chess

endgames”, In: Machine Learning: An Artificial Intelligence Approach (Michalski R S, CarbonellJ G and Mitchell T M (Eds.)), I, Tiogo Publishing Co., 463–482.

Quinlan J R, (1986), “Induction of decision trees”, Machine Learning, 1, 81–106.


Quinlan J R, (1990), “Learning logical definitions from relations”, Machine Learning, 5, 239–266.Quinlan J R, (1993), C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA.Quinlan J R and Cameron-Jones R M, (1995), “Induction of logic programs: FOIL and related systems”,

New Generation Computing, 13, 287–312.Ross T J, (1995), Fuzzy Logic with Engineering Applications, McGraw-Hill, New York.RuleQuest, (2001), Data Mining Tools C5.0, Pty Ltd, 30 Athena Avenue, St Ives NSW 2075, Australia.

Available from: http://www.rulequest.com/see5-info.html.Rzevski G, (1995), “Artificial intelligence in engineering : past, present and future”, Artificial Intelligence

in Engineering X, Eds Rzevski G, Adey R A and Tasso C, Computational Mechatronics, Southampton,3–16.

Schaffer J D, Caruana R A, Eshelman L J and Das R, (1989), “A study of control parameters affectingon-line performance of genetic algorithms for function optimisation”, Proc. Third Int. Conf. on GeneticAlgorithms and Their Applications, George Mason University, 51–61.

Schultz G, Fichtner D, Nestler A and Hoffmann J, (1997), “An intelligent tool for determination ofcutting values based on neural networks”, Proc. 2nd World Congress on Intelligent ManufacturingProcesses and Systems, Budapest, Hungary, 66–71.

Seals R C and Whapshott G F, (1994), “Design of HDL programmes for digital systems using geneticalgorithms”, AI Eng 9 (ibid), 331–338.

Shi Z Z, Zhou H and Wang J, (1997), “Applying case-based reasoning to engine oil design”, ArtificialIntelligence in Engineering, 11 (2), 167–172.

Shigaki I and Narazaki H, (1999), “A machine-learning approach for a sintering process using a neuralnetwork”, Production Planning & Control, 10 (8), 727–734.

Shin C K and Park S C, (2000), “A machine learning approach to yield management in semiconductormanufacturing”, Inter. J. Production Research, 38 (17), 4261–4271.

Skibniewski M, Arciszewski T and Lueprasert K, (1997), “Constructability analysis : machine learningapproach”, ASCE J of Computing in Civil Engineering, 12 (1), 8–16.

Smith J E and Fogarty T C, (1997), “Operator and parameter adaptation in genetic algorithms”, SoftComputing, 1 (2), 81–87.

Smith P, MacIntyre J and Husein S, (1996), “The application of neural networks in the power industry”,Proc. 3rd World Congress on Expert Systems, Seoul, Korea, February, 321–326.

Sohen S Y and Choi I S, (2001), “Fuzzy QFD for supply chain management with reliability considera-tion”, Reliability Eng. and Systems Safety, 72, 327–334.

Streichfuss M and Burgwinkel P, (1995), “An expert-system-based machine monitoring and maintenancemanagement system”, Control Eng. Practice, 3 (7), 1023–1027.

Szwarc D, Rajamani D and Bector C R, (1997), “Cell formation considering fuzzy demand and machinecapacity”, Int. J. Advanced Manufacturing Technology, 13 (2), 134–147.

Tarng Y S, Tseng C M and Chung L K, (1997), “A fuzzy pulse discriminating systems for electricaldischarge machining”, Int. J. Machine Tools and Manufacture, 37 (4), 511–522.

Teti R and Caprino G, (1994), “Prediction of composite laminate residual strength based on a neuralnetwork approach”, AI Eng 9 (ibid), 81–88.

Tharumarajah A, Wells A J and Nemes L, (1996), “Comparison of the bionic, fractal and holonicmanufacturing system concepts”, Int. J. Computer Integrated Manfacturing, 9 (3), 217–226.

Vanegas L V and Labib A W, (2001), “A fuzzy quality function deployment (FQFD) model for derivingoptimum targets”, Int. J. Production Research, 39 (1), 99–120.

Venkatachalam A R, (1994), “Automating manufacturability evaluation in CAD systems through expertsystems approaches”, Expert Systems with Applications, 7 (4), 495–506.

Wang L X and Mendel M, (1992), “Generating fuzzy rules by learning from examples”, IEEE Trans onSystems, Man and Cybernetics, 22 (6), 1414–1427.

Wang W P, Peng Y H and Li X Y, (2002), “Fuzzy-grey prediction of cutting force uncertainty inturning”, J Materials Processing Technology, 129, 663–666.

Wang C-H, Tsai C-J, Hong T-P and Tseng S-S, (2003), “Fuzzy Inductive Learning Strategies”, AppliedIntelligence, 18 (2), 179–193.

38 CHAPTER 1

Wang X Z, Wang Y D, Xu X F, Ling W D and Yeung D S, (2001), “A new approach to fuzzy rulegeneration: Fuzzy extension matrix”, Fuzzy Sets and Systems, 123 (3), 291–306.

Whitely D, (1989), “The GENITOR algorithm and selection pressure: why rank-based allocation ofreproductive trials is best”, Proc. Third Int. Conf. on Genetic Algorithms and Their Applications,George Mason University, 116–123.

Wilde P and Shellwat H, (1997), “Implementation of a genetic algorithm for routing an autonomousrobot”, Robotica, 15 (2), 207–211.

Witten I H and Frank E, (2000), Data Mining: Practical Machine Learning Tools and Techniques withJava Implementations, Morgan Kaufmann Publishers, USA.

Wooldridge M J and Jennings N R, (1994), “Agent theories, architectures and languages : a survey”,Proc. ECAI 94 Workshop on Agent Theories, Architectures and Languages, Amsterdam, 1–32.

Wray B A, Rakes T R and Rees L, (1997), “Neural network identification of critical factors in dynamicjust-in-time kanban environment”, J. Intelligent Manufacturing, 8, 83–96.

Wu X, Chu C-H, Wang Y and Yan W, (2002), “A genetic algorithm for integrated cell formation andlayout decisions”, Proc. of the 2002 Congress on Evolutionary Computation (CEC-02), 2, 1866–1871.

Yano H, Akashi T, Matsuoka N, Nakanishi K, Takata O and Horinouchi N, (1997), “An expert systemto assist automatic remeshing in rigid plastic analysis”, Toyota Technical Review, 46 (2), 87–92.

Yao X, (1999), “Evolving artificial neural networks”, Proceedings of the IEEE, 87 (9), 1423–1447.Zadeh L A, (1965), “Fuzzy Sets”, Information Control, 8, 338–353.Zha X F, Lim S Y E and Fok S C, (1998), “Integrated knowledge-based assembly sequence planning”,

Int. J. Adv. Manufacturing Technology, 12 (3), 211–237.Zha X F, Lim S Y E and Fok S C, (1999), “Integrated knowledge-based approach and system for product

design and assembly”, Int. J. Computer Integrated Manufacturing, 14, 50–64.Zhao Z Y and De Souza R, (1998), “On improving the performance of hard disk drive final assembly

via knowledge intensive simulation”, J. Electronics Manufacturing, 1, 23–25.Zhao Z Y and De Souza R, (2001), “Fuzzy rule learning during simulation of manufacturing resources”,

Fuzzy Sets and Systems, 122, 469–485.Zhou C, Nelson P C, Xiao W, Tirpak T M and Lane S A, (2001), “An intelligent data mining system

for drop test analysis of electronic products”, IEEE Trans on Electronics Packaging Manufacturing,24 (3), 222–231.

Zimmermann H-J, (1996), Fuzzy Set Theory and its Applications, 3nd Edition, Kluwer AcademicPublishers, Boston.

Zülal G and Arikan F, (2000), “Application of fuzzy decision making in part-machine grouping”,Int. J. Production Economics, 63, 181–193.

CHAPTER 2

NEURAL NETWORKS HISTORICAL REVIEW

D. ANDINA1, A. VEGA-CORONA2, J. I. SEIJAS3,J. TORRES-GARCÍA1 Departamento de Señales, Sistemas y Radiocomunicaciones (SSR), Universidad Politécnicade Madrid (UPM), Ciudad Universitaria C.P. 28040, Madrid, España. [email protected] Facultad de Ingeniería, Mecánica, Eléctrica y Electrónica (FIMEE), Universidad de Guanajuato(UG), Salamanca, Gto., México. [email protected] Departamento de Señales, Sistemas y Radiocomunicaciones (SSR), Universidad Politécnicade Madrid (UPM), Ciudad Universitaria C.P. 28040, Madrid, España. [email protected]

Abstract: This chapter starts with a historical summary of the evolution of Neural Networks fromthe first models which are very limited in application capabilities to the present ones thatmake possible to think in applying automatic process to tasks that formerly had beenreserved to the human intelligence. After the historical review, Neural Networks are dealtfrom a computational point of view. This perspective helps to compare Neural Systemswith classical Computing Systems and leads to a formal and common presentation thatwill be used throughout the book

INTRODUCTION

Computers used nowadays can make a great variety of tasks (whenever they arewell defined) at a higher speed and with more reliability than those reached by thehuman beings. None of us will be, for example, able to solve complex mathematicalequations at the speed that a personal computer will. Nevertheless, mental capacityof the human beings is still higher than the one of machines in a wide variety of tasks.

No artificial system of image recognition is able to compete with the capacity ofa human being to discern between objects of diverse forms and directions; in fact itwould not even be able to compete with the capacity of an insect. In the same way,whereas a computer performs an enormous amount of computation and restrictiveconditions to recognize, for example, phonemes, an adult human recognizes withoutno effort words pronounced by different people, at different speeds, accents andintonations, even in the presence of environmental noise.

It is observed that, by means of rules learned from the experience, the humanbeing is much more effective than the computers in the resolution of imprecise

39


40 CHAPTER 2

problems (ambiguous problems), or problems that require great amount of infor-mation. Our brain reaches these objectives by means of thousands of millions ofsimple cells, called neurons, which are interconnected to each other.

However, it is estimated that the operational amplifiers and logical gates canmake operations several orders of magnitude faster than the neurons. If the sameprocessing technique of biological elements were implemented with operationalamplifiers and logical gates, one could construct machines relatively cheap andable to process as much information, at least, as the one that processes a biologicalbrain. Of course, we are too far from knowing if these machines will be constructedone day.

Therefore, there are strong reasons to think about the viability to tackle certainproblems by means of parallel systems that process information and learn by meansof principles taken from the brain systems of living beings. Such systems are calledArtificial Neural Networks, connexionist models or distributed parallel processmodels. Artificial Neural Networks (ANNs or, simply, NNs) come then from theman’s intention of simulating the biological brain system in an artificial way.

1. HISTORICAL PERSPECTIVE

The science of Artificial Neural Networks did his first significant appearance dur-ing the 1940’s. Researchers who tried to emulate the functions of the human braindeveloped physical models (later, simulations by means of programs) of the bio-logical neurons and their interconnections. As the neurobiologists were deepeningin the knowledge of the human neural system, these first models were being con-sidered more and more rudimentary approaches. Nevertheless, some of the resultsobtained in these first times were impressive, which encouraged future research anddevelopments of sophisticated and powerful Artificial Neural Networks.

1.1 First Computational Model of Nervous Activity: The Model ofMcCulloch and Pitts

McCulloch and Pitts published the first systematic studies of the artificial neuralnetworks [McCulloch and Pitts, 1943] [Pitts and McCulloch, 1947]. This studyappeared in terms of a computational model of the nervous activity of the humannervous system cells.

Most of their work is focused on the behavior of a simple neuron, whose mathe-matical or computational model is shown in Figure 1. Inside the artificial neuron, thesum of each input xi multiplied by a scale factor (or weight wi) is made. The inputsemulate the excitations received by the biological neurons. The weights representthe force of the synaptic union: a positive weight represents an excitatory effect,and a negative weight an inhibitory effect. If the result of the sum is higher thana certain threshold value or bias (represented by the weight w0), the cell activatesproviding a positive value (normally +1); in the opposite case, the output presentsa negative value (usually −1) or zero. Therefore, it is a binary output. In general,

NEURAL NETWORKS HISTORICAL REVIEW 41

x1

x2X =

xm

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

w0

w1

wM

∑wi xi

Z OActivationfunction

−1

1O = f (Z )

O

f (Z )

Figure 1. Artificial model [McCulloch and Pitts, 1943] of a biological neuron. As it can be observed,the relation between the input and output follows a nonlinear function called activation function. In the

first model shown in this figure, the activation is a hard threshold function

the model follows the neurobiologic behavior: the nervous cells produce nonlinearanswers when provided of an excitation by a certain input. In particular, McCullochand Pitts proposed an activation function, that represents the nonlinearity of themodel, called hard threshold function (see Figure 1).

Although this first model can only perform very simple tasks, as it will bedescribed below, the potentiality of the neural systems is essentially in the inter-connection between neurons to form networks. This interconnection is normallyarranged forming layers of nodes (artificial neurons). This kind of neural networksare called Multi-Layer Perceptron (MLP). In general, it is possible to speak aboutFeed Forward Neural Networks like those in which the information always is trans-mitted in the direction of input layer to output layer. Or Feedback Neural Networks,where the information can be transmitted in both directions; that is, connectionsbetween nodes of higher layers with nodes of lower layers are allowed. Figure 2shows a Feed Forward Neural Network of two layers: a hidden layer (located rightafter the input layer) and the output layer. The input layer is usually not consid-ered as being properly a layer of the network. Each component of the input vectorx = �1x1� � � xM�T is connected to all the nodes of the first hidden layer. The forcesof these connections are determined by the weight associated with each one ofthem. When the same philosophy is applied to the rest of the network’s layers, it issaid that full connectivity exists.

Trying to proceed chronologically, we will leave the Multilayer Neural Networks(MLP) by the moment. The first artificial neural networks proposals were networksof a single layer as the one shown in Figure 2 but eliminating the output layer.

1

2

m

1

x1

xM

⎡ ⎤⎢ ⎥⎢ ⎥X =⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦ mw

01w

2o

M2w

1mw

o1m

OUTPUT

INPUT

Figure 2. Two Layer Feed Forward Neural Networks

42 CHAPTER 2

This joint disposition of the first model of neuron (see Figure 1) in parallel wassuggested in order to solve the limitations of a neuron acting alone. It is easy toverify that the model of McCulloch and Pitts divides the input space into two partsby means of the hyperplane described by the equation

(1) h�x� =M∑

j=1

wjxj +wo = 0

This effect can be observed in Figure 3 that shows this hyperplane for the particularcase of M = 2.

A simple neuron can solve two-class classification problems of M-dimensionaldata, assuming that they are linearly separable. That is, it can assign an outputequal to 1 to all the data of class “A” (that fall in the same side of the hyperplane),whereas it assigns a value equal to −1 to the rest of the data that fall in the oppositeside. Mathematically, we can express this classification as

(2)M∑

j=1

wjxj

CB><CA

−wo

where CA and CB denotes class A and class B, respectively.We have now a very simple neuron behavior model, that does not consider

many of the biological characteristics that tries to emulate. For example, it doesnot consider the real delays existing in all inter-neural transmission –that have animportant effect on the dynamic system–, or, more importantly, it does not includeeffects of synchronism or frequency modulation features, which is considered crucialby many researchers.

w0

w0

w1x1

w2 w2

w2

Class 2

x2

x1

h(x) > 0

h(x) < 0

h(x) = 0

Class1

x2

Figure 3. The hyperplane determined by the McCulloch and Pitts neuron model for the case of twodimensional inputs. This hyperplane depends on the neuron’s parameters (weights wj , and threshold

value w0) according with the mathematical expression∑M

j=1 wjxj +w0 = 0


In spite of their limitations, the networks designed in this way have characteristicsclassically restricted to the biologic systems. Perhaps researchers have been able toshape the main biological neuron operations in this model, or perhaps the similaritiesin some applications are mere coincidence. Only the necessary time to continue thisresearch will solve this question.

1.2 Training of Neural Networks: Hebbian Learning

The equation of the hyperplane border that characterizes the operation of the arti-ficial neuron depends on the synaptic weights w1, wM and on the threshold valuewo, which is normally considered as another weight of the network. The remainingproblem consists in the way of choosing, determining or looking for the appropriatevalue of these weights that solve the problem in hand. This task is called learningor training of the network.

From a historical point of view, the Hebbian Learning is the oldest and one of themost studied learning procedures. In 1961, Hebb proposed a learning model that hasgiven rise to many of the learning systems which nowadays exist for training neuralnetworks. Hebb proposed that the value of the synaptic union would be increasedwhenever the input and the output of a neuron were simultaneously activated. Inthis way, the network’s connections used more frequently are reinforced, emulatingthe biological phenomenon of the habit and the learning by means of repetition.

It is said that a neural network uses Hebbian learning when it increases the valueof its weights accordingly with the product of the levels of excitation of the sourceand destiny neurons.

The Hebbian learning of the network is performed by means of successiveiterations using only the information of the input and output network, it never usednever the desired output or target. For this reason, this type of learning is calledunsupervised learning. It distinguishes it from other models of learning that use theadditional information of the desired values of the output, as a teacher, and that wewill expose next.

1.3 Supervised Learning: Rosenblatt and Widrow

Although many learning methods following the Hebbian model have been devel-oped, it seems logical to expect that the most efficient results can be achieved bythose methods that use information of the network output (supervised learning.Learning is so guided in order to perform a given function. About 1960, Rosenblatt[Rosenblatt, 1962] dedicated his efforts in developing supervised learning algo-rithms to train a neural network that called perceptron. A perceptron is a FeedForward neural network as that shown in Figure 2, where the nonlinearities of theneurons are of the hard type. Some of the common functions used as alternatives tothe hard threshold functions will be shown later on. In this way, the Mcculloch andPitts model can be considered as the simplest kind of hard threshold perceptron.

44 CHAPTER 2

Concretely, Rosenblatt showed that a one layer perceptron is able to learnmany practical functions. He proposed a learning rule for the perceptron calledthe perceptron rule. Let us consider the simplest case of a one layer percep-tron composed by one single neuron, that is, the model proposed by McCullochand Pitts. If certain pairs of input and corresponding output is known, DN =��x1�d1�� x2�d2�� xN �dN �, then, at a given input pattern xk of the input dataset, the perceptron rule updates the network weights w = �wo�w1� � � � �wM�T in thefollowing way

(3) w�k+1� = w�k�+�dk −ok�xk

The parameter controls the updating magnitude values, and so the speed of thealgorithm convergence. It is called the learning rate and it usually takes values inthe range between 0 and 1. The DN set is called learning set and, as it includesvalues of the desired outputs, it is of the supervised type.

If the linear separability is accomplished by the training data set, Rosenblattshowed that the algorithm always converge in a finite number of steps, indepen-dently of the value. On the contrary, if the problem is not linearly separable, it willhave to be forced to stop, as always there will be at least one pattern erroneouslyclassified.

Usually, the training starts giving small random values to the perceptron weights.In each step of the algorithm, a new input xk is applied to the network, then thecorresponding output is calculated, ok, and the weights are updated only if error�dk −ok� is not equal to zero. It is interesting to note that if the learning rate hasa value close to 0, the weights will have a little variation with each new input, andthe learning is slow; if the value is next to 1 there can be large differences betweenweight values for one iteration and the following one, reducing the influence of pastiterations and the algorithm could not converge. This problem is called instability.Therefore, the gain rate should be adapted to the distribution changes on the inputpatterns, satisfying the conflict between training time and stable updating of weights.

Also at early 1960’s, Widrow and Hoff [Widrow and Hoff, 1960] performedseveral demonstrations on perceptron-like systems, that called ADALINE (“ADAp-tive LINear Elements”), proposing a learning rule called LMS algorithm (“LeastMean Square” algorithm) or Widrow-Hoff algorithm. This rule minimizes the Sumof Square Errors (SSE, “sum-of-square errors”) between the desired output and theoutput given by the perceptron before the hard threshold activation function. Thatis, it minimizes the error function

(4) E�w� = 12

N∑

j=1

�dj − zj�2

through a gradient algorithm. The linear output z can be observed in Figure 1.When the gradient to w is applied in Equation (4) and actualized in the opposite


direction to the gradient one, the LMS rule is obtained.

(5) w�k+1� = w�k�+N∑

j=1

�dj − zj�k��xj

where zj�k� = wT �k�xj . This “block” (in the sense that it uses all training patternsin each iteration) version of the LMS is usually substituted for an “estocasticapproximation” (pattern by pattern) as shown in equation

(6) w�k+1� = w�k�+�dk − zk�xk

Unlike the perceptron rule, the application of LMS delivers reasonable results (thebest that can be achieved through a linear discriminator in the SSE sense) when thetraining set is not linearly separable.

During these years, researchers all around the world become enthusiastic withthe application possibilities that these systems promised.

1.4 Partial eclipse of Neural Networks: Minsky and Papert

The initial euphoria aroused in the early sixties was substituted by disappointmentwhen Minsky and Papert [Minsky and Papert, 1969] rigorously analyzed the prob-lem and showed that there exists severe restrictions in the class of functions thata perceptron can perform. One of their results shows how a one layer perceptronwith two inputs and one output is unable of performing a simple function as theor-exclusive (Xor). The inputs of this function are of the type 1 or −1 being theoutput −1 when the two inputs are different and 1 if they are equal. In the Figure 4this problem is illustrated. It can be observed how a linear discriminator is unableof separating the patterns of the two classes.

This limitation was well known by the end of the sixties and it was also known thatthe problem could be solved adding more layers to the system. As an example, let usanalyze a two layer perceptron. The first layer can classify input vectors separated

x1

x2

−1

−1

1

1

Class AClass B

(1,1)(1,–1)

(–1,1)

(–1,–1)

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

X d

1

1

⎡ ⎤⎢ ⎥–1⎢ ⎥⎢ ⎥–1⎢ ⎥⎣ ⎦

Figure 4. The or-exclusive (Xor) problem

46 CHAPTER 2

by hyperplanes. The second layer can implement the logical functions AND andOR, because both problems are linearly separable. In this way, a perceptron as theone shown in Figure 5 (a) can implement boundaries as the one shown in Figure 5(b) and, so, solve the Xor problem. In the general case, it can be shown that a twolayer perceptron can implement simply convex and connex regions –a region issaid to be convex if any straight line that joins two points of its boundary goes onlythrough points included in the region limited by the boundary. Convex regions arelimited by the (hyper)planes performed by each node in the first layer, and can beopen or closed.

It has to be noted that the possibilities of Multi Layer Perceptrons rely on thenonlinearities of their neurons. If the activation function performed by these neuronswas linear, then the MLP capabilities would be the same as those of the single layerperceptron. For example, let us think of a two layer perceptron with a thresholdvalue, wo = zero and with a linear activation function, f�z� = z (see Figure 1). Inthis case, the outputs of the first layer can be easily expressed through a matrixO1 = WT

1 X, and those of the second layer as O2 = WT2 O1. Then, the output as a

function of the input is obtained as

(7) O2 = WT2 O1 = WT

2 WT1 X = WT

total

1

2

Ox1

X =x2

x1

2x

–1

–1

1

1

Class A Class B

Class AClass B

Decision boundary(node 1)

Decision boundary(node 2)

Figure 5. (a) Two layers perceptron, able to solve the Xor problem, implementing a boundary asshown in (b)


This function could be performed by a single layer perceptron whose layer weightswere Wtotal. Therefore, if the nodes are linear elements, the performance of thestructure is not improved by adding new layers, as an equivalent one layer perceptroncan be found.

In spite of the possibilities opened by the MLP, Minsky and Papert, prestigiousscientists of their time, emphasized that algorithms to train such structures werenot known, and showed their scepticism on the possibilities of they would ever bedeveloped. The book of Minsky and Papert [Minsky and Papert, 1969], showedsome critical examples of the disadvantages of NNs vs classical computers in termsof their capabilities for storing information, was a strong punch on the NNs researchenthusiasm, eclipsing their developing for the next twenty years.

1.5 Backpropagation algorithm: Werbos, Rumelhart et al. and Parker

It is true that the single layer perceptron has the limitation of being a simplediscriminator. There are reasons to affirm that it is only able of solving “toy”problems. Although their limitations reduce when the number of layers raises, it isdifficult to find the adequate weights to solve a given problem.

This problem was solved with the incorporation of “soft”, derivable, nonlinearitiesin the neurons in the place of the classical hard threshold. Concretely, the sigmoidalfunction is very appropriate Figure 6.

Among others, there exists an specially relevant theorem on the capabilities ofthe MLP with soft activation functions, Cybenko’s Theorem [Cybenko, 1989]: it issufficient with a two layers perceptron with the nodes (in indefinite number) in thefirst layers performing sigmoidal activation functions to establish any correspon-dence between �No and �−1� 1�NL (therefore, it will also be possible to establish anyclassification). For a first revision on the perceptron capabilities as “approximators”in the case of soft nonlinearities, it is worth to mention the work of Hornik et al.,[Hornik et al., 1989, Hornik et al., 1990].

But, again, we must come back on the question of how to train the networkweights. In a completely analogous form to the LMS algorithm previously described,the retropropagation algorithm updates the network weights (in this case of a MLP)in the opposite direction of the error function gradient that we aim to minimize(i.e. SSE). For that purpose, the chain rule is applied as many times as required

1

2

m

x1

x2X =

xm

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

O1m

O1m 1

–1

O2

wmxMPL-NODET

Figure 6. Multilayer Perceptron with sigmoidal nonlinearities

48 CHAPTER 2

to calculate that gradient for all the weights in the network. As the output is aderivable function, this calculation is relatively easy [Haykin, 1994].

The backpropagation algorithm was proposed independently and consecutivelyby Werbos [Werbos, 1974], Rumelhart et al., [Rumerlhart et al., 1986] and Parker[Parker, 1985]. It can be said that the pessimism aroused by the book by Min-sky and Papert had its counterpart twenty years later with the developing of thebackpropagation algorithm.

2. NEURAL NETWORKS VS CLASSICAL COMPUTERS

Classical digital computers process the information at two basic levels: hardwareand software. The computations performed are algorithmic and sequential. Eachproblem is solved through an algorithm coded in a program, physically locatedin the computer memory. Problems are solved one after the other. Algorithms areperformed as many times as needed, with the same reliability and at electronicspeed. Nevertheless, there are many real problems where computers cannot besuccessfully applied yet. For example, let us think of a little mosquito finding itsway to survive in the world. Such a problem is a not-solved challenge to anyautomatic device. But the difference probably relies on the fact that living beingsdo not follow the computer processing scheme.

Biological brains process information in a massive, parallel, not sequential way.Problems are solved by the cooperative participation of millions of highly inter-connected elemental processors, called neurons. The neurons do not need to beprogrammed. From the stimulus they receive from other neurons, they are able tomodify, adapt or learn its functioning. The system does not need a central process-ing unit to control the activities of the system. It is interesting to note that biologicalneural systems work at a speed several orders of magnitude lower than electronicsystems.

Therefore, brain is an adaptive, non-linear, sophisticated processing system.Knowledge is distributed in the neurons activation state and memory is not addressedthrough fixed labels. Their architecture tries to emulate the basic neural featuresof brains and are designed by learning from examples. They could be defined asnetworks that massively connect simple units (usually adaptive units), hierarchicallyorganized, that try to interact with the real world objects in the biological systemsfashion.

Advantages of NNs over classical computers are:1 Adaptive Learning: they are able to learn and to perform tasks by an adaptive

procedure.2 Self-organized: they are able to build their own internal organization or repre-

sentation of the information provided in a learning phase.3 Fault Tolerance: ability of performing the same function despite of the partial

destruction of the Network.4 Real time operation: its hardware architecture is oriented to massive parallel

processing of the information.


5 Simplicity of integration with present technology: these systems can be easilysimulated using the present computers and are also implemented in specific neuralhardware, that allows their modular integration in present systems.

3. BIOLOGICAL AND ARTIFICIAL NEURONS

3.1 The Biological Neuron

The biological neuron, whose basic operation is not yet completely known norunderstood, is composed of a cellular body and series of ramifications that are inbranches, called dendrites. Among all these branches, one of them is particularlylong and receives the name of axon. It starts from the cellular body and ends inanother series of dendrites. These last nervous terminals are used by the neuronsto be in contact with each other by means of the synaptic connections. When acell receives signals of other cells, (these can be excitatory or inhibitory signals)the global effect is an excitation that exceeds a certain threshold value. Then itresponds transmitting a certain nervous signal through the axon to the adjacent cellsby means of the synapse of the nervous terminations.

Human nervous system is made up of these cells and is of a fascinating complex-ity. It is estimated that 1011 neurons participate in more than 1015 interconnectionson channels that can measure more of a meter. Studies on the human brain anatomyconclude that there are more than 1000 synapses in the input and output of eachneuron. It is important to note that, although the commutation time of a neuron(few milliseconds) is almost a million times lower than the one of the actual com-puter elements, the biological neurons have a very higher connectivity (thousandsof times) than the actual supercomputers.

Neurons are composed of the cell core, soma, and several branches called theaxon and the dendrites. The dendrites of different neurons are connected in what arecalled sinapses and play the role of establishing the connection with the neighborneurons in order to make possible the communication among them.

Each neuron has two basic states: activation and rest. When a neuron is activatedit emits through the axon a chain of electrical excitements, of different frequenciesdepending of its level of activation. Information is coded in the frequency ofgeneration and not in its amplitude.

The signal produced in the neuron body propagates to other neurons from theaxon to other neurons through chemical interchanges that take place in the synapsesof the dendrites.

The chemical components liberated by the dendrites are called neurotransmittersand contribute to increase or inhibit the activation level of the neuron that receivesthe neurotransmitters. Due to the action of the neurotransmitters – that are basicallychemical signals – ionic channels are opened in the receiver neuron and electricalions are received, contributing to the overall electrical charge of the neuron orexcitation level.

When the excitation level surpasses a certain activation level, the neuron isactivated. The efficiency of the synapse depends on several factors: the number of

50 CHAPTER 2

the neurotransmitter glands, concentration in the membrane of the neighbor neuron,efficiency of the ionic channels and other physical and chemical variables.

As to the learning procedure, last discoveries make believe that it is also ofelectrochemical nature, taking place among neighbor neurons, hierarchically closein a layered structure. The chemical liberated in the learning process seems to benitric oxide (NO). Its molecules are able to go through the membrane and route tothe neighbor neurons controlling the efficiency of the connection by reactions withother chemicals in this last neuron.

This efficiency regulation of the electrochemical connection among neighborneurons is the responsible of the learning procedure.

3.2 The Artificial Neuron

The simplest model of artificial neuron, as presented in Figure 1, is obtained throughapproximating the action of all neuron inputs by a linear function. Let us call thisfunction Base Function, u�·�. In this case, the Base Function is a weighted sum

u = w0 +ni∑

j=1

wjxj�

where w0 is a threshold and wj are the synaptic weights, that correspond to theeffect of the inputs on the activation function.

The output function of an artificial neuron can be expressed as

y = f�x� = f

(

w0 +ni∑

j=1

wjxj

)

�

In an artificial neuron, this function can be computed in three steps: the calculationof the base function value, u�·�, as the sum of the input values xj weighted bythe synaptic weights wij plus the threshold value w0 and a non-linear activationfunction f�u�.

Typical activation functions are explained in Figure 7:• Step function

u�t� ={

0 si t < 0

1 si t ≥ 0�

• Sign function

sgn�t� ={

−1 si t < 0

1 si t ≥ 0�

• Gaussian function

f�x� = ae− x2

�2 �


1

−1

x

Sign function

a

1, if x ≥ a–1, if x ≤ a

f (x−a) =

f (x−a)

⎧⎨⎩

1

−1

f (x)

x

f (x) = tanh(βx), β > 0

Hyperbolic function

Figure 7. Some typical activation functions

• Exponential function

f�x� = 11+ e− x

� > 0�

• Hyperbolic Function

f�x� = tanh� x�� > 0�

Hyperbolic and exponential functions are classified as sigmoids or sigmoidalfunctions. They are real class functions, limited and monotonic �f ′�x� > 0�.

In the case of sigmoidal functions, the mean value of the slope in the originis called gain and such a value represents a measurement of the transition slopesteepness.Therefore, if the gain tends to an infinite value, the sigmoid tends to aSign function. According to this, Exponential and Hyperbolic functions have a gainof

4 and , respectively.As assumed in the previous point, the activation function of a neuron is non-

linear. If the function f�u� is linear, f�u� = u, then the artificial neuron is calledLinear Neuron or Linear Node of the NN.

4. NEURAL NETWORKS: CHARACTERISTICS AND TAXONOMY

A Neural Network can be represented as an oriented pair �G�E�, composed of aset of nodes or basic processing elements G, also called processing units, artificialneurons or nodes, and a set of interconnections, E, among them. The nodes set Gis partitioned in different sets called layers.

Each processing unit can also have a local memory and always a transfer function.Depending upon this function of the weighted input values and the values stored inthe local memory, the output y is computed.

There are four main aspects that can characterize all NNs:a) Data Representation. According to the input-output form, ANNs can be clas-

sified as: continuous type NNs, digital NNs or hybrid NNs.In the continuous type, input-output data are of analogic nature. Their valuesare real and continuous. In digital NNs, input-output data is of digital nature. Inthe hybrid case, inputs are analogic and outputs are binary.

52 CHAPTER 2

b) Topology. Architecture or Topology of the NN refers to the way that the nodesare physically disposed in the network. The nodes form layers or groups ofnodes that share a common input and feed their output to common nodes.Only neurons in the input and output layers interact with the external systems.The rest of nodes in the network present internal connections, forming what iscalled hidden layers.Therefore, topology of the NNs is characterized by the number of layers, numberof neurons inside the layers, connectivity degree and type of connections amongthe nodes.

c) Input-Output Association. With respect to the input-output association typeNNs can be classified as heteroassociative or autoassociative.Heteroassociative NNs: implement a certain function, frequently of difficultanalytical expression. They associate a set of inputs with a set of outputs in sucha way that each input has a corresponding output.Autoassociative networks: outputs have the purpose to rebuild a certain inputinformation that has been corrupted by associating to each input data the moresimilar stored data.

d) Learning Procedure. All the connections or synapsis of the nodes in a NN havean associated synaptic weight efficiency factor. Each connection or synapsisbetween the node i and the node j is weighted by wji. This weight is responsibleof the learning of the neural network.In the learning phase, the NN modifies its weights as a result of a new inputinformation. Weights are modified following a convergent algorithm in such away that when all the weight values are stabilized to a certain value and thelearning phase ends, it is said that the NN has“learnt”.For the learning process it is crucial to establish the weights updating algorithmfor the NN to correctly learn the new input information. According to thelearning criteria NNs can be classified as neural networks of supervised learningor unsupervised learning NNs.Figure 8 represents the most common way of NNs classification.

5. FEED FORWARD NEURAL NETWORKS: THE PERCEPTRON

First presented in section 1.1 Feed Forward Neural Networks are generally definedas those networks composed of one or more layers whose nodes are connected insuch a way that their input comes only from nodes in the previous layer and theiroutputs connect exclusively to neurons of the following layer. Their name comesfrom the fact that the output of each layer feeds to the units of the following layer.

Of all feed forward NNs the most popular, is the Multilayer Perceptron, devel-oped as an extension to the Perceptron proposed by Rossenblatt in 1962 [Rosenblatt,1962].

In this type of networks, the learning is supervised because it uses informationof the output that the network must provide to the current input. Learning phase


Figure 8. Neural Networks basic taxonomy

or training phase consists in presenting to the network an input-output pair, calledtraining pattern

DN = ��x1�d1�� x2�d2�� xM�dM�

in such a way that the weights are adjusted by xi ∈�p and di ∈�k, �i = 1� 2� � � � �N�.Once the training phase is completed, the network is designed and ready to work

in what is called the direct mode phase. In this phase, the network classifies the

54 CHAPTER 2

inputs by the following binary decision rule

g ={

1 if ��x� w� > 0

0 if ��x� w� < 0

where ��x� w� is the discriminating function, that is, the space �p is divided intotwo regions by the decision boundary ��x� w� = 0.

Logically, the choice of the discriminating function ��x� w� depends on thedistribution of the training patterns.

5.1 One Layer Perceptron

It basically consists in a set of nodes whose activation is produced for the actionof the weighted sums of the input values and, consequently, the discriminatingfunction takes the form

(8) ��x� w�w� =p∑

i=1

wixi +� = 0�

Also, if we make � = w0 and we consider the inputs in the space �p+1 such as x =�x1� x2� � � � � xp� 1� and w = �w1�w2� � � � �wp�w0�, Equation (8) can be expressedas

��x� w� = wxT = 0�

Among other things, it serves to perform the pattern classification task, througha discriminating function of the form [Karayiannis and Venetsanopoulos, 1993],[Hush and Horne, 1993]:

uk�xn� =N∑

j=0

wkjxnj�

The classification rule is based on the assignment of class k to the input patternif the kth network output is the highest of all outputs. The network must be trainedfollowing an appropriate algorithm, to produce the desired output for each pattern

uk�xn� ≥ uj�xn��∀j �= k �−→ xn ∈ Wk�

This decision rule is, sometimes, substituted by a binary decision rule with adecision threshold. The Perceptron is a system that operates in such a way. Afterthe learning or training, the Perceptron structure can separates the classificationspace in regions, one region for each class.

The decision boundaries are composed by hyperplane segments defined as:

uk�xn�−uj�xn� = 0�

The Perceptron was initially proposed by Rosenblatt and a group of his students. Intheir work, the Perceptron versatility was shown. Unfortunately, the fact problemof the linear separability made its use out of interest.


5.1.1 Perceptron Training

It can be summarized in five steps:1 Weights and Threshold initialization. Each one of the weights �wi� has to be

initialized to low random values �w0 = ��.2 For i = 1� 2� � � � � N, presenting the training pattern (a new E/S training pair

is composed by a new input Xp = �x1� x2� � � � � xN ��i = 1� 2� � � � �N� and itscorresponding desired output d�t�.

3 Computing present output

yi�t� = f

(M∑

j=1

wijxj�t�−�i

)

= f

(M∑

j=1

wijxj�t�

)

= f�Neti��

4 Weight adaptation: �Wi = �d�t�−y�t��xi�t�.• : learning rate �0 < < 1�.• d�t�: desired output, y�t�: present output.• This process is repeated till the error e�t� = d�t� − y�t� for each one of the

patterns is zero or less than a preset value.5 back to step 2

The convergence of the perceptron training is established by the followingtheorem: If the training set of a multiple classification problem is linearly sepa-rable then the perceptron training algorithm converges to a correct solution in afinite number of iterations. The mathematical proof of this theorem can be foundin [Rosenblatt, 1962] and its significance relays in the fact that a multiple classproblem can be reduced to a binary classification. Two typical examples of thissituation are shown in the Figure 9.

6. LMS LEARNING RULE

Nevertheless, even with the simple Perceptron structure, a reasonable solution canbe achieved for a set that does not accomplish the linear separability property, by

01 01

1000

11

x2

OR -function

1000

11

x1x1

x2

AND -function

Figure 9. Logical functions OR and AND reduced to a binary classification problem

56 CHAPTER 2

the use of the Least Mean Square convergence algorithm (LMS) to update the NNweights during the learning phase. In general, the error function Equation (4), alsocalled cost function or objective function, to be minimized by the LMS algorithmcan be expressed as follows [Hush and Horne, 1993]:

E =M∑

k=1

∑

xn∈Wl

u�xn�−�k�

where �k is a k elements vector with all its components of zero value, except thoseof k order, that corresponds to the correct classification.

Therefore, for a given training set DN where dk represents the computed value,if the desired output to the k-th input vector is yk, then the Mean Square Error(MSE) corresponding to the input-output pair is given by

< �2k >= 1

N

N∑

i=1

�2k = 1

N

N∑

i=1

�dk −yk�2

or, in vectorial notation,

< �2k >=< d2

k > −2dk < wTx > +w < xkxTk > �

The minimum square error corresponds to the matrix w that satisfies the equation

��

�w= 0�

In the case N = 2 the equation is an error paraboloid as shown in Figure 10.From Figure 10 it can be observed that the optimum value for the weights of the

network is the one that makes the gradient null.A possible search procedure is the maximum step descent. The gradient direction

is perpendicular to the contour lines in each point of the error surface. At thealgorithm starting point, the weight vector does not derives to a minimum exceptin the case of spherical level curves. The weight updates in each iteration step mustbe small or the weight vector could wander over the hypersurface without neverreaching the searched minimum.

6.1 The Multilayer Perceptron

A Perceptron of n layers is composed of n+1 layers Ll�l = 0� 1� � � � � n�, of severalprocessing units in each one, corresponding L0 to the input layer and Ln to theoutput layer and Ll�l = 1� � � � � n−1� to the hidden layers.

The nodes in the hidden and output layers are individual processing units. Theoverall output is obtained by adding all weighted inputs and passing the resultthrough a non-linear function of sigmoidal type (see Figure 6).


–2–1

01

2

–2

–10

120

10

20

30

40

50

60

70

80

y x

Figure 10. Error Paraboloid of the LMS learning

Usually, in a Multilayer Perceptron, the nodes in each layer are fully intercon-nected with the neurons in the adjacent layer. This fact is repeated layer by layerthrough all the network.

6.1.1 Learning Algorithm (“Backpropagation”)

Before detailing the learning algorithm, let us introduce the following nomenclature:

ul�j: output of the j-th node in layer l.wl�j�i: weight that connects the i-th node in layer l−1 to the node j-th in layer l.xp: p-th training pattern.u0�i: i-th component of the input vector.dj�xp�: desired output of the j-th node in the output layer when a p-th pattern

is presented at the network input.NL: number of nodes in a given layer.L: number of layers.P: number of training patterns.

Obviously, in a Perceptron-like structure, outputs depend upon the synapticweights that connect neurons in the different layers. Such weights are actualized inthe following way1. Associating a set of input patterns to a set of desired outputs. In a pattern

classification problem it is the same as making a primary classification on themby the designer (supervised training).

58 CHAPTER 2

2. Presenting all training patterns to the network. The network then processes allpatterns and presents an output. The classification offered by the network canbe an erroneous one, thus the error is easily quantified.

3. Defining an objective function. For example, the Mean Square Error (MSE)between the desired and real outputs of the units in the output layer [Hush andHorne, 1993]:

Jp�w� = 12

NL∑

q=1

uL�q�xn�−dq�xn��

This objective function represents an error function in a parametric hyperspace.The training or learning then consists in the search for the minimum of that surfacethrough a gradient descent algorithm in the opposite direction of the surface gradientby examining a set of weights that minimizes the error.

Each weight is modified or adapted in each iteration step in an amount that isproportional to the partial derivative of the function to that weight

(9) wl�j�i�k+1� = wl�j�i�k�−�Jp�w�

�wl�j�i

�

In Equation (9), constant is the learning rate. The speed of the convergenceof the algorithm depends on because the amount of the weight modification ineach iteration step is proportional to the gradient in the weight direction, but itis weighted by the constant value of the learning rate. In this point, the trainingalgorithm can be designed if we know how to calculate the partial derivative toeach weight of the network. This derivative can be easily calculated using the chainrule:

�Jp�w�

�wl�j�i

= �Jp�w�

�ul�j

�ul�j

�wl�j�i

�

that is,

�Jp�w�

�wl�j�i

= �Jp�w�

�ul�j

f ′(

Nl−1−1∑

m=0

wl�j�mul−1�m

)

ul−1�i�

where f�·�, represents the sigmoidal function previously defined. This function hasa very simple derivative:

f ′�� = f��

d�= f��1−f��

when the parameter is of unit value.In this expression we can observe that the sensibility of the objective function to

each weight depends on the sensibility of this function to the output of the neuronthat is fed by the synaptic weight input.


This last sensibility can be in its turn calculated from the objective functionsensibilities with respect to the node outputs of the following layer, and so on [Hushand Horne, 1993]. This process is repeated till we reach to the output layer.

The sensibility of the objective function to each node output can be calculatedfrom the output layer in a recursive from. The sensibilities to the outputs of nodesin hidden layers are also denominated “error”, although, strictly speaking, they donot represent a real error. In order to calculate the error in the hidden layers, theerror in the output layer must be computed and backpropagated to previous layers.That is performed by the Backpropagation algorithm. In this algorithm, trainingusually is started with random small values of the synaptic weights in order toprovide a safe to the backpropagation algorithm. Once the structure of the networkis chosen, the key parameter to be controlled is the learning rate. A too small valuewill slow the learning process. A too high value will accelerate the learning, butcan produce loosing the minimum of the error surface. To find the optimal value ofthis parameter, an empirical method has to be used. Once the learning has started,it must continue till a minimum error is found, or till no variation in weight valuesis achieved. In that point, the network is said to have finished learning. It is notalways practical to wait till this point of the learning and several other criteria areadopted, among them:1. When the value of the gradient error surface is sufficiently small, it means

that the gradient learning algorithm has found a set of weight values in a localminimum of the error surface.

2. When the error between the real network output and the desired one is under cer-tain tolerable value for our application. Obviously, this case needs the knowledgeof the maximum tolerable error for the given application.

3. In pattern classification problems, when all the learning patterns have beencorrectly classified, the training procedure can be stopped.

4. Training can be stopped after a fixed number of iterations.5. Finally, a more appropriate and developed procedure is to train the network with

a set of patterns and supervise the error over a different set called test set. Thetraining phase is stopped when a minimum error on the test set is found.This last method prevents the overspecialization of the network on the training

set, a phenomenon that happens when the error on the training set is lower than theerror over other set of patterns of the same application, showing that the networkhas lost generalization capabilities. The method needs to use a double number ofpatterns, a fact that can be expensive or even not possible. Therefore, in order toefficiently apply neural networks to real problems it is very important to have anumber of patterns in sufficient number.

6.2 Acceleration of the training procedure

The training procedure described in the previous section presents two main prob-lems: in one hand the convergence or training phase is very slow, and, on the otherhand, it is not easy to precisely elect the appropriate learning rate. A simple solution

60 CHAPTER 2

to accelerate the network training is the usage of second order methods that use theinformation contained in the second matrix of derivates (Hessian). These methodsreduce the number of iterations needed in the training phase in order to achievea local or global minimum of the error surface. Nevertheless, they cost a higheramount of computation and this increases the time of training. For this reason, onlythe diagonal matrix of the Hessian is usually used.

Another solution is to rise the gradient value by adding a term that is a fractionof the past changes in the weights. This term, usually known as momentum term,is the weight by a new constant value, usually designated by �:

wk�j�k+1� = wk�j�k�−�J�w�

�wk�j

+��wk�j�k��

This term tends to smooth the changes in the weights, leading to increase thelearning speed by avoiding divergent learning fluctuations.

It has been shown that adding noise to the training patterns, decreases the trainingtime and helps to avoid local minima in the learning process.

Another way to decrease the training time consists in the use of alternativetransfer functions in the network nodes. When allowing a function to take positiveand negative values in a symmetric dynamical range, it is probable that severalactivations will be next to zero and their corresponding weights will not need to beactualized. An example of this type of activation function is the hyperbolic one.

In Table 1, typical parameters of this kind of networks and their influence in theprocessing are summarized.

6.3 On-Line and Off-Line training

During the training, weight update can be carried out in two different ways [Bour-land and Morgan, 1994]:• Off-line or “Block training”: in this case, modifications on the weights over the

whole training set are accumulated. The weights are modified only when all thetraining patterns are presented to the network.

Table 1. Design Properties of NNs

Transfer Function Sign Exponential � = 1� Hyperbolic� = 1�

Derivate of TransferFunction

f ′�x� = f ′�x� = f�x��1−f�x�� f ′�x� = 1−f 2�x�

Learning rate � = 1 � = 0�1 � = 0�01Effects on the NN Learning not

guaranteedQuick but not precise

convergencePrecise and slowconvergence

Moment With a small value, thevectors of weightsincrement take verydivergent directions

With a big value, thevectors of weightsincrement take similardirections, helping tothe convergence of thetraining


• On-line training: the network weights are modified each time that a new trainingpattern is presented to the network. It can be proved that this method leads tothe same result as that of the off-line training [Widrow and Stearns, 1985]. Inpractice, this method shows some advantages that make it much more attractiveto be used: it converges much more quickly to a minimum of the error surfaceand usually avoids the local minima. A possible explanation is that with theon-line training some “noise” is introduced over the set of training patterns.

6.4 Selection of the Network size

The selection of the appropriate network size is a task of the utmost importance:if the network is too small, it will not be able to achieve an efficient solution forthe problem that is representing, while if its size is too big it can happen that thenetwork can represent too many solutions to solve the problem over the trainingpatterns but none of them is optimum to the application problem.

If there is no preliminary experience, the dimension of the network size is a trialand error problem. To start with, an option is to try a small network and to increasethe size progressively in order to find an efficient dimension for the network. Theother option is to try a big network and reduce the size progressively, removing thenodes or weights that do not have significance on the overall output of the NN.

Several studies have settled some size limits that should not be exceeded. Inthis sense, a proposal is that the number of nodes in the hidden layer should notexceed the number of training patterns. In practice, this is always accomplished,as the number of nodes will always be much lower than the number of trainingpatterns. In fact, big networks can be able to memorize the whole training setloosing generalization capabilities.

7. KOHONEN NETWORKS

A main principle in biological brain organization is that neurons group in such away that those that are physically close collaborate in the same stimulus that isbeing processed. That is the way that nerve connections are organized. For example,to each level of the auditive path, nerve cells and fibers are disposed in relationto the frequency that is responsible of a higher output for each neuron [Lipmann,1987]. Therefore, the physical disposition of the neurons in the brain structure is insomehow related to the function they perform.

Kohonen [Kohonen, 1984] proposed an algorithm to adjust the weights of anetwork whose input is a vector of N components and its output is another vectorof different dimension, M�M < N�. In this way, the dimension of the input subspaceis reduced, physically grouping the data.

Vectors defined over a continuous variable are used as input to the network. Thenetwork is trained without supervision in a way that the network itself establishesthe input data grouping criteria, extracting regularities and correlations. When a suf-ficient number of input vectors has been presented, the weights are self-organized in

62 CHAPTER 2

a way that, topologically speaking, close nodes are sensible to similar inputs. Nodesphysically far will stand completely inactive. Clusters that have their topologicalequivalent in the network are produced. For this reason this kind of networks areknown as Self-Organized Feature Map (SOFM).

The algorithm that assigns values to the connections of the synaptic weightsis based on the concepts of neighborhood and competitive learning. The distancebetween the input and the weights of all the nodes is computed, establishing theclosest one as the winner node. The updating of weights is performed for this nodeand the neighbor nodes. The rest are not actualized favoring a concrete physicalorganization.

This kind of network has always two layers: the input and the output one. Thedimension of the input vector establishes the number of nodes of the input layer:one node for each component of the input vector. The input neurons drive the inputvectors to the output layer controlled by the connections weights. In this type ofnetwork it is very important to establish a neighborhood and a distance measure inthe network.

In the example of Figure 11, the nodes are configured in a bidimensional structure.The algorithm used to compute the output is designed in such a way that onlyone output neuron is activated when one input vector is applied to the network.The fired node corresponds to the category of classification corresponding to theinput vector. Similar input vectors activate the same output, while different vectorsactivate different neurons. Only the neuron with the minimum difference betweenthe input vector and the output weights vector node is activated. When the trainingalgorithm starts, the adjustment is done in a wide zone surrounding the fired node orwinner node. As the training progresses, the neighbor area is progressively reduced.Through this little adjustment, the network follows any systematic change in theinput vectors: the network self-organizes. Therefore, this algorithm behaves as avectorial quantifier when the number of desired clusters can be a priori specifiedand a sufficient amount of data relative to the number of the desired clusters is

Input layer

Outputs

Figure 11. Structure of the Kohonen Network


known. However, the results depend on the order of the presentation of the inputdata, specially when the amount of input data is small.

7.1 Training

Training of the SOMF network can be summarized in five steps:1. Weights initialization: The network structure is N input nodes and M output

nodes. Random values are assigned to each of the weight �wij� connections.Initial neighbor radius is fixed for the neighbor mask.

2. Presentation of a new E/S pair: A new pattern is presented at the input Xp�t� =�x1�t�� x2�t�� xN �t��.

3. Computation of the distance dj between the input and each one of the outputnodes

dj =N−1∑

i=0

�xi�t�−wij�t��2�

where xi�t� is the input to the node i in the iteration t, and wij�t� is the inputweight i to the output j in the iteration t.

4. Selection of the output node as the node with the minimum distance: node j∗ isselected as the node with the minimum distance dj .

5. Updating node j∗ and its neighbor: weights are updated for node j∗ and all itsneighbors in the vicinity matrix defined by NEj ∗ �t�.The new weights are:

wij�t +1� = wij�t�+�t��xi�t�−wij�t��

for j ∈ NEj∗�t�� 0 ≤ i ≤ N −1.The term �t� is a gain term �0 < �t� < 1� that decreases with time.

6. Back to step 2.An standard example introduced by Kohonen illustrates the self-organized net-

works capacity to learn random distributions of the input vectors presented to thenetwork. For example, if the input is an order two vector with component uniformlydistributed and the output is designed as bidimensional, then the network weightswill organize in a reticular fashion as shown in Figure 12.

1 1.2 1.4 1.60.80.60.4

10.90.80.70.60.50.40.30.20.1

0

1.5 2 32.5100.5 0.5

1

3

1.5

2.5

0

0.5

2

1.5

1

0

0.5

–0.51.510 0.5–0.5

Figure 12. Kohonen Map for the two-dimensional case

64 CHAPTER 2

8. FUTURE PERSPECTIVES

Artificial neural networks are inspired from the biological performance of thehuman brain, where the former attempts to emulate the latter. This is the mainlink between biological and artificial neural networks. From this starting point,both disciplines follow separate ways. The present understanding of the brainmechanisms is so limited that the systems designer has not sufficient data to emulateits performing. Therefore, the engineer has to be one step forward from the biologicalknowledge, searching and devising useful algorithms and structures that efficientlysolve given problems. In the vast majority of cases, this search delivers a result thatdiverges completely from the biological reality and the brain similarities becomemetaphors.

Despite this faint and usually inexisting analogy between biology and artificialneural networks, the results of the latter frequently evoke comparisons with theformer, because they are frequently reminiscent of the performing of the brain.Unfortunately, these comparisons are not benign and produce unrealistic expec-tations that lead to disappointment. Researching based on false expectations canevaporate when illuminated by the light of reality, as happened in the sixties. Thispromising researching field could eclipse again if we do not contain the temptationof comparing our results with those of the brain.

It has been said that NNs are capable of being applied in all activities spe-cific of the human brain. Currently, they are considered an alternative for allthose tasks where the conventional computation does not achieve satisfactoryresults. There has been speculations about a next future where NNs will beable to reach a place together with classical computation. However, this willonly happen if the researchers achieve sufficient knowledge for that develop-ing. Currently, the theoretical knowledge is not robust enough to justify suchpredictions.

REFERENCES

W.W. McCulloch and W. Pitts, A Logical Calculus of the Ideas Inminent in Nervous Activity, Bulletinof Mathematical Biophysics, 5:115–133, 1943.

W. Pitts and W.W. McCulloch, How We Know Universals, Bulletin of Mathematical Biophysics, 9:127–147, 1947.

D.O. Hebb, Organization of Behaviour, Science Editions, New York, 1961.F. Rosenblatt, Principles of Neurodynamics, Science Editions, New York, 1962.B. Widrow, M. E. Hoff, Adaptive Switching Circuits, In IRE WESCON Convention Record, pages

96–104, 1960.M. Minsky, S. Papert, Perceptrons, MIT press, Cambridge, MA, 1969.G. Cybenko, Approximation by Superposition of a Sigmoidal Function, Mathematics of Control, Signals,

and Systems, 2:303–314, 1989.K. Hornik, M. Stinchcombe and H. White, Multilayer Feedforward Networks are Universal Approxi-

mators, Neural Networks, 2(5):359–366, 1989.K. Hornik, M. Stinchcombe and H. White, Universal Aproximation of an Unknown Mapping and Its

Derivatives using Multilayer Feedforward Networks, Neural Networks, 3:551–560, 1990.


S. Haykin, Neural Networks. A Comprehensive Foundation, Macmillan College Publishing, Ontario,1994.

P.J. Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioural Sciences,PhD thesis, Harvard University, Boston, 1974.

D.E. Rumerlhart, G. E. Hinton and R. J. Williams, Learning Internal Representations by Error Propaga-tion, In D. E. Rumelhart, J. L. McClelland and the PDP Research Group, editors, Parallel DistributedProcessing: Explorations in the Microstructure of Cognition, volume 1: Foundations, pages 318–362,MIT Press, Cambridge, MA, 1986.

D.B. Parker, Learning Logic, Technical report, Technical Report TR-47, Cambridge, MA: MIT Centerfor Research in Computational Economics and Management Science, 1985.

D.R. Hush and B.G. Horne, Progress in Supervised Neural Networks. What’s new since Lippman?,IEEE Signal Processing Magazine, 2:721–729, January, 1993.

N.B. Karayiannis and A.N. Venetsanopoulos, Artificial Neural Networks, Learning Algorithms, Perfo-mance Evaluation and Applications, Kluwer Academic Publishers, Boston, MA, 1993.

H.A. Bourland and N. Morgan, Connectionist Speech recognition. A hybrid Approach, Kluwer AcademicPublishers, Boston, MA, 1994.

B. Widrow and S.D. Stearns, Adaptative Signal Processing, Prentice-Hall, Signal Processing Series,Englewood Cliffs, NJ, 1985.

R.P. Lipmann, An Introduction to Computing with Neural Nets, IEEE ASSP Magazine, 328–339, April,1987.

T. Kohonen, Self-Organization and Associative Memory, Springer-Verlag, Berlin, 1984.

CHAPTER 3

ARTIFICIAL NEURAL NETWORKS

D. T. PHAM, M. S. PACKIANATHER, A. A. AFIFYManufacturing Engineering Centre, Cardiff University, Cardiff CF24 3AA, United Kingdom

INTRODUCTION

Artificial neural networks are computational models of the brain. There are manytypes of neural networks representing the brain’s structure and operation withvarying degrees of sophistication. This chapter provides an introduction to the maintypes of networks and presents examples of each type.

1. TYPES OF NEURAL NETWORKS

Neural networks generally consist of a number of interconnected processing ele-ments (PEs) or neurons. How the inter-neuron connections are arranged and thenature of the connections determine the structure of a network. How the strengths ofthe connections are adjusted or trained to achieve a desired overall behaviour of thenetwork is governed by its learning algorithm. Neural networks can be classifiedaccording to their structures and learning algorithms.

1.1 Structural Categorisation

In terms of their structures, neural networks can be divided into two types: feed-forward networks and recurrent networks.

Feedforward networks: In a feedforward network, the neurons are generallygrouped into layers. Signals flow from the input layer through to the output layervia unidirectional connections, the neurons being connected from one layer to thenext, but not within the same layer. Examples of feedforward networks include themulti-layer perceptron (MLP) [Rumelhart and McClelland, 1986], the radial basisfunction (RBF) network [Broomhead and Lowe, 1988; Moody and Darken, 1989],the learning vector quantization (LVQ) network [Kohonen, 1989], the cerebellar

67


68 CHAPTER 3

model articulation control (CMAC) network [Albus, 1975a], the group-method ofdata handling (GMDH) network [Hecht-Nielsen, 1990] and some spiking neuralnetworks [Maass, 1997]. Feedforward networks can most naturally perform staticmappings between an input space and an output space: the output at a given instantis a function only of the input at that instant.

Recurrent networks: In a recurrent network, the outputs of some neurons arefedback to the same neurons or to neurons in preceding layers. Thus, signals can flowin both forward and backward directions. Examples of recurrent networks includethe Hopfield network [Hopfield, 1982], the Elman network [Elman, 1990] and theJordan network [Jordan, 1986]. Recurrent networks have a dynamic memory: theiroutputs at a given instant reflect the current input as well as previous inputs andoutputs.

1.2 Learning Algorithm Categorisation

Neural networks are trained by two main types of learning algorithms: supervisedand unsupervised learning algorithms. In addition, there exists a third type, rein-forcement learning, which can be regarded as a special form of supervised learning.

Supervised learning: A supervised learning algorithm adjusts the strengths orweights of the inter-neuron connections according to the difference between thedesired and actual network outputs corresponding to a given input. Thus, supervisedlearning requires a teacher or supervisor to provide desired or target output signals.Examples of supervised learning algorithms include the delta rule [Widrow andHoff, 1960], the generalised delta rule or backpropagation algorithm [Rumelhartand McClelland, 1986] and the LVQ algorithm [Kohonen, 1989].

Unsupervised learning: Unsupervised learning algorithms do not require thedesired outputs to be known. During training, only input patterns are presented to theneural network which automatically adapts the weights of its connections to clusterthe input patterns into groups with similar features. Examples of unsupervisedlearning algorithms include the Kohonen [Kohonen, 1989] and Carpenter-GrossbergAdaptive Resonance Theory (ART) [Carpenter and Grossberg, 1988] competitivelearning algorithms.

Reinforcement learning: As mentioned before, reinforcement learning is a spe-cial case of supervised learning. Instead of using a teacher to give target outputs,a reinforcement learning algorithm employs a critic only to evaluate the goodnessof the neural network output corresponding to a given input. An example of areinforcement learning algorithm is the genetic algorithm (GA) [Holland, 1975;Goldberg, 1989].

2. NEURAL NETWORKS EXAMPLE

This section briefly describes the example neural networks and associated learningalgorithms cited previously.

ARTIFICIAL NEURAL NETWORKS 69

2.1 Multi-layer Perceptron (MLP)

MLPs are perhaps the best known type of feedforward networks. Figure 1a showsan MLP with three layers: an input layer, an output layer and an intermediate orhidden layer. Neurons in the input layer only act as buffers for distributing theinput signals xi to neurons in the hidden layer. Each neuron j (Figure 1b) in thehidden layer sums up its input signals xi after weighting them with the strengths ofthe respective connections wji from the input layer and computes its output yj as afunction f of the sum, viz.

(1) yj = f(∑

wjixi

)

f can be a simple threshold function or a sigmoidal, hyperbolic tangent or radialbasis function (see Table 1).

The output of neurons in the output layer is computed similarly.The backpropagation (BP) algorithm, a gradient descent algorithm, is the most

commonly adopted MLP training algorithm. It gives the change �wji in the weight

Output Layer

Hidden Layer

Input Layer

w11

w12

w1m

y1 yn

x1 x2 xm

Figure 1a. A multi-layer perceptron

x1

xi

xn

wj1

wji

wjn

f(.)yj

Σ

Figure 1b. Details of a neuron

70 CHAPTER 3

Table 1. Activation functions

Type of Functions Functions

Linear f�s� = s

Threshold f�s� ={

+1� if s > st

−1� otherwise

Sigmoid f�s� = 1/�1+ exp�−s��

Hyperbolic tangent f�s� = �1− exp�−2s��/�1+ exp�2s��

Radial basis function f�s� = exp�−s2/�2�

of a connection between neurons i and j as follows:

(2) �wji = ��jxi

where � is a parameter called the learning rate and �j is a factor depending onwhether neuron j is an output neuron or a hidden neuron. For output neurons,

(3) �j =(

f

netj

)(y

�t�j −yj

)

and for hidden neurons,

(4) �j =(

f

netj

)∑

q

wqj�q

In Equation (3), netj is the total weighted sum of input signals to neuron j andy

�t�j is the target output for neuron j.

As there are no target outputs for hidden neurons, in Equation (4), the differencebetween the target and actual output of a hidden neuron j is replaced by theweighted sum of the �q terms already obtained for neurons q connected to the outputof j. Thus, iteratively, beginning with the output layer, the � term is computedfor neurons in all layers and weight updates determined for all connections. Theweight updating process can take place after the presentation of each training pattern(pattern-based training) or after the presentation of the whole set of training patterns(batch training). In either case, a training epoch is said to have been completedwhen all training patterns have been presented once to the MLP.

For all but the most trivial problems, several epochs are required for the MLPto be properly trained. A commonly adopted method to speed up the training is toadd a “momentum” term to Equation (2) which effectively lets the previous weightchange influence the new weight change, viz:

(5) �wji�k+1� = ��jxi +�wji�k�

where �wji�k + 1� and �wji�k� are weight changes in epochs �k + 1� and �k�respectively and is the “momentum” coefficient.


Another learning method suitable for training MLPs is the genetic algorithm(GA). This is an optimisation algorithm based on evolution principles. The weightsof the connections are considered genes in a chromosome. The goodness or fitnessof the chromosome is directly related to how well trained the MLP is. The algorithmstarts with a randomly generated population of chromosomes and applies geneticoperators to create new and fitter populations. The most common genetic operatorsare the selection, crossover and mutation operators. The selection operator chooseschromosomes from the current population for reproduction. Usually, a biased selec-tion procedure is adopted which favours the fitter chromosomes. The crossoveroperator creates two new chromosomes from two existing chromosomes by cuttingthem at a random position and exchanging the parts following the cut. The muta-tion operator produces a new chromosome by randomly changing the genes of anexisting chromosome. Together, these operators simulate a guided random searchmethod which can eventually yield the optimum set of weights to minimise thedifferences between the actual and target outputs of the neural network. Furtherdetails of genetic algorithms can be found in the chapter on Soft Computing andits Applications in Engineering and Manufacture.

2.2 Radial Basis Function (RBF) Network

Large multi-layer perceptron (MLP) networks take a long time to train. This has ledto the construction of alternative networks such as the Radial Basis Function (RBF)network [Cichocki and Unbahauen, 1993; Hassoun, 1995; Haykin, 1999]. The RBFnetwork is the most used network after MLPs. Figure 2 shows the structure of a RBFnetwork which consists of three layers. The input layer neurons receive the inputsx1� � � � � xM . The hidden layer neurons provide a set of activation functions that consti-tute an arbitrary “basis” for the input patterns in the input space to be expanded into thehidden space by way of non-linear transformation. At the input of each hidden neuron,the distance between the centre of each activation or basis function and the input vectoris calculated. Applying the basis function to this distance produces the output of thehidden neuron. The RBF network output y is formed by the neuron in the output layeras a weighted sum of the hidden layer neuron activation.

InputLayer

HiddenLayer

OutputLayer

x1

xk

xM

w1

wi

wN

y

Figure 2. The RBF network

72 CHAPTER 3

StandardDeviationσ = 1

1.0

K(x)

x0

Figure 3. The Radial Basis Function

The basis function is generally chosen to be a standard function which is positiveat its centre x = 0 and then decreases uniformly to zero on either side as shown inFigure 3. A common choice is the Gaussian distribution function:

(6) K�x� = exp(

−x2

2

)

This function can be shifted to an arbitrary centre, x = c, and stretched by varyingits standard deviation � as follows:

(7) K

(�x− c�

�

)= exp

(− �x− c�2

2�2

)

The output of the RBF network y is given by:

(8) y =N∑

i=1

wiK

(�x− ci��i

)

∀x

where wi is the weight of the hidden neuron i, ci the centre of basis function i and�i the standard deviation of the function. �x − ci� is the norm of �x − ci�. Thereare various ways to calculate the norm. The most common is the Euclidean normgiven by:

(9) �x− ci� =√�x1 − ci1�

2 + �x2 − ci2�2 + � � � + �xM − ciM�2

This norm gives the distance between the two points x and ci in N-dimensionalspace. All points x that are the same radial distance from ci give the same valuefor the norm and hence the same value for the basis function. Hence the basisfunctions are called Radial Basis Functions. Obtaining the values for wi, ci and �i

requires training the RBF network. Because the basis functions are differentiable,back-propagation could be used as with MLP networks. Training of a multiple-inputsingle-output RBF network can proceed as follows:

(i) choose the number N of hidden units;There is no firm guidance available for this. The selection of N is normallymade by trial and error. In general, the smallest N that gives the RBF networkan acceptable performance is adopted.


(ii) choose the centres, ci;Centre selection could be performed in three different ways [Haykin, 1999]:a) Trial and error:

Centres can be selected by trial and error. This is not always easy if little isknown about underlying functional behaviour of data. Usually, the centresare spread evenly or randomly over N -dimensional input space.

b) Self-organized selection:An adaptive unsupervised method can be used to learn where to place thecentres.

c) Supervised selection:A supervised learning process, commonly error correction learning, can bedeployed to fix the centres.

(iii) choose stretch constants, �i;Several heuristics are available. A popular way is to set �i equal to the distanceto nearest neighbour. First the distances between centres are computed thenthe nearest distance is chosen to be the value of �i.

(iv) calculate weights, wi.When ci and wi are known, the outputs of hidden units �O1� � � � � � � �ON �T

can be calculated for any pattern of inputs x = �x1� � � � � xM�. Assuming thereare P input patterns x in the training set, there will be P sets of hidden unitoutputs that can be calculated. These can be assembled in a N ×P matrix:

(10) O =

⎡

⎢⎢⎢⎢⎣

O�1�1 O

�2�1 � � � � � � � � � O

�P�1

O�1�2 O

�2�2 � � � � � � � � � O

�P�2

O�1�N O

�2�N � � � � � � � � � O

�P�N

⎤

⎥⎥⎥⎥⎦

If the output y�i� of the RBF network corresponding to training input patternx�i� is y�i� = O

�i�1 w1 + O

�i�2 w2 + � � � + O

�i�N wN , the following equation can be

obtained:

(11) y =

⎡

⎢⎢⎣

y�1�

y�P�

⎤

⎥⎥⎦=

⎡

⎢⎢⎣

O�1�1 O

�1�N

O�P�1 O

�P�N

⎤

⎥⎥⎦ ·

⎡

⎢⎢⎣

w1

wN

⎤

⎥⎥⎦= OT ·w

y is the vector of actual outputs corresponding to the training inputs x. Ideally,y should be equal to d, the desired/target outputs. Unknown coefficients wi

can be chosen to minimise the sum-squared-error of y compared with d. It canbe shown that this is achieved when:

(12) w = �O OT�−1 O d

74 CHAPTER 3

2.3 Learning Vector Quantization (LVQ) Network

Figure 4 shows an LVQ network which comprises three layers of neurons: an inputbuffer layer, a hidden layer and an output layer. The network is fully connectedbetween the input and hidden layers and partially connected between the hiddenand output layers, with each output neuron linked to a different cluster of hiddenneurons. The weights of the connections between the hidden and output neurons arefixed to 1. The weights of the input-hidden neuron connections form the componentsof reference vectors (one reference vector is assigned to each hidden neuron). Theyare modified during the training of the network. Both the hidden neurons (alsoknown as Kohonen neurons) and the output neurons have binary outputs. When aninput pattern is supplied to the network, the hidden neuron whose reference vectoris closest to the input pattern is said to win the competition for being activated andthus allowed to produce a “1”. All other hidden neurons are forced to produce a“0”. The output neuron connected to the cluster of hidden neurons that containsthe winning neuron also emits a “1” and all other output neurons a “0”. The outputneuron that produces a “1” gives the class of the input pattern, each output neuronbeing dedicated to a different class. The simplest LVQ training procedure is asfollows:(i) initialise the weights of the reference vectors;(ii) present a training input pattern to the network;(iii) calculate the (Euclidean) distance between the input pattern and each reference

vector;(iv) update the weights of the reference vector that is closest to the input pattern,

that is, the reference vector of the winning hidden neuron. If the latter belongs

Input vector

Input layer

Output layer

Hidden (Kohonen)Layer

Reference vector

Figure 4. Learning Vector Quantization network


to the cluster connected to the output neuron in the class that the input pattern isknown to belong to, the reference vector is brought closer to the input pattern.Otherwise, the reference vector is moved away from the input pattern;

(v) return to (ii) with a new training input pattern and repeat the procedure untilall training patterns are correctly classified (or a stopping criterion is met).

For other LVQ training procedures, see for example [Pham and Oztemel, 1994].

2.4 CMAC Network

CMAC (Cerebellar Model Articulation Control) [Albus, 1975a, 1975b, 1979a,1979b; An et al 1994] can be considered a supervised feedforward neural networkwith the characteristics of a fuzzy associative memory. A basic CMAC module isshown in Figure 5.

CMAC consists of a series of mappings:

(13) Se−→M

f−→Ag−→u

where

S = �input vectors�M = �intermediate variables�A = �association cell vectors�u = output of CMAC ≡ h�S�h ≡ g ·f · e

(a) Input encoding (S → M mapping)

The S → M mapping is a set of submappings, one for each input variable:

(14) S → M =

⎡

⎢⎢⎣

s1 → m1

s2 → m2

�sn → mn

⎤

⎥⎥⎦

Input Encoding

Weight Table

Input: : : +

_

+

Actual Output u

Desired Output

S

S A A> > > uMM

Figure 5. A basic CMAC module

76 CHAPTER 3

The range of s1 is coarsely discretised using the quantising functionsq1� q2� � � � � qk. Each function divides the range into k intervals. The intervals pro-duced by function qj+1 are offset by one kth of the range compared to their coun-terparts produced by function qj. mi is a set of k intervals generated by q1 to qk

respectively.An example is given in Figure 6 to illustrate the internal mappings within a

CMAC module. The S → M mapping is shown in the leftmost part of the figure.In Figure 6, two input variables s1 and s2 are represented with unity resolution

in the range of 0 to 8. The range of each input variable is described using threequantising functions. For example, the range of s1 is described by functions q1� q2,and q3. q1 divides the range into intervals A, B, C and D. q2 gives intervals E, F ,G, and H and q3 provides intervals I , J , K and L. That is,

q1 = �A�B�C�D�

q2 = �E�F�G�H�

q3 = �I� J�K�L�

For every value of s1, there exists a set of elements, m1, which are the intersectionof the functions q1 to q3, such that the value of s1 uniquely defines set m1 andvice versa. For example, value s1 = 5 maps to set m1 = �B�G�K� and vice versa.Similarly, value s2 = 4 maps to set m2 = �b� g� j� and vice versa.

The S → M mapping gives CMAC two advantages: the first is that a singleprecise variable si can be transmitted over several imprecise information channels.Each channel carries only a small part of the information of si. This increasesthe reliability of the information transmission. The other advantage is that smallchanges in the value of si have no influence on most of the elements in mi. Thisleads to the property of input generalisation which is important in an environmentwhere random noise exists.

s2

s10 1 4 6 8

012345678

m2

m1

A C

* * *

*

*

*

**

X1

_

+

+

S M M A

A B C DE F G H

c

b

a

d

g

f

e

h Dabcd * * *

*

*

*

*

*

**

*

*

**

X2

* **

*

*

*

*

*

*

*

*

*

*

*

**

X3

* **

*

*

*

*

E G

fe

gh

I Kijkl

i

k

l

j

uA

2 3 5 7

I K LJ

B

F H

J L

Figure 6. Internal mappings within a CMAC module


(b) Address computing (M → A mapping)

A is a set of address vectors associated with weight tables. A is obtained bycombining the elements of mi. For example, in Figure 6, the sets m1 = �B�G�K�and m2 = �b� g� j� are combined to give the set of elements A = �a1� a2� a3� =�Bb�Gg�Kj�.

(c) Output computing (A → U mapping)

This mapping involves looking up the weight tables and adding the contents of theaddressed locations to yield the output of the network. The following formula isemployed:

(15) u =∑

i

wi �ai�

That is, only the weights associated with the addresses ai in A are summed. Forthis given example, these weights are:

w�Bb� = x1

w�Gg� = x2

w�Kj� = x3

Thus the output is:

(16) u = x1 +x2 +x3

Training a CMAC module consists of adjusting the stored weights. Assumingthat f is the function that CMAC has to learn, the following training steps couldbe adopted:

(i) select a point S in the input space and obtain the current output u correspondingto S;

(ii) let u be the desired output of CMAC, that is, u = f �S�;(iii) if �u−u� ≤ �, where � is an acceptable error, then do nothing; the desired

value is already stored in CMAC. However, if �u−u� > �, then add to everyweight which contributed to u the quantity

(17) � = �u−u

�A�where �A� = the number of weights which contributed to u and � is the learningrate.

2.5 Group Method of Data Handling (GMDH) Network

Figure 7 shows a GMDH network and the details of one of its neurons. Unlikethe feedforward neural networks previously described which have a fixed structure,

78 CHAPTER 3

y

x1

x2

x3

x4

N-Adaline

N-Adaline

N-Adaline

N-Adaline

N-Adaline

N-Adaline

N-Adaline

N-Adaline

N-Adaline

N-Adaline

N-Adaline

Figure 7a. A trained GMDH network

Note: Each GMDH neuron is an N-Adaline, which is an Adaptive Linear Element with a nonlinear preprocessor

+output

e

yd

+1

x1

x2

X

Square

x1

x21

x1x2

x22

x2+

–

desired output

w1

w2

w3

w4

w5

Nonlinear processor

w0Square

Figure 7b. Details of a GMDH Neuron

a GMDH network has a structure which grows during training. Each neuron in aGMDH network usually has two inputs x1 and x2 and produces an output y that isa quadratic combination of these inputs, viz.

(18) y = wo +w1x1 +w2x21 +w3x1x2 +w4x

22 +w5x2

Training a GMDH network consists of configuring the network starting with theinput layer, adjusting the weights of each neuron, and increasing the number oflayers until the accuracy of the mapping achieved with the network deteriorates.


The number of neurons in the first layer depends on the number of externalinputs available. For each pair of external inputs, one neuron is used.

Training proceeds with presenting an input pattern to the input layer and adaptingthe weights of each neuron according to a suitable learning algorithm, such as thedelta rule (see for example [Pham and Liu, 1994]), viz.

(19) Wk+1 = Wk +�Xk

�Xk�2(yd

k −W Tk Xk

)

where Wk, the weight vector of a neuron at time k, and Xk the modified input vectorto the neuron at time k, are defined as

Wk = �w0�w1�w2�w3�w4�w5�T(20)

Xk = [1� x1� x2

1� x1x2� x22� x2

]T(21)

and ydk is the desired network output at time k.

Note that, for this description, it is assumed that the GMDH network only has oneoutput. Equation (19) shows that the desired network output is presented to eachneuron in the input layer and an attempt is made to train each neuron to producethat output. When the sum of the mean square errors SE over all the desired outputsin the training data set for a given neuron reaches the minimum for that neuron,the weights of the neuron are frozen and its training halted. When the traininghas ended for all neurons in a layer, the training for the layer stops. Neurons thatproduce SE values below a given threshold when another set of data (known as theselection data set) is presented to the network are selected to grow the next layer.At each stage, the smallest SE value achieved for the selection data set is recorded.If the smallest SE value for the current layer is less than that for the previous layer(that is, the accuracy of the network is improving), a new layer is generated, thesize of which depends on the number of neurons just selected. The training andselection processes are repeated until the SE value deteriorates. The best neuron inthe immediately preceding layer is then taken as the output neuron for the network.

2.6 Hopfield Network

Figure 8 shows one version of a Hopfield network. This network normally acceptsbinary and bipolar inputs (+1 or −1). It has a single “layer” of neurons, eachconnected to all the others, giving it a recurrent structure, as mentioned earlier. Thetraining of a Hopfield network takes only one step, the weights wij of the networkbeing assigned directly as follows:

(22) wij =⎧⎨

⎩

1N

P∑

c=1xc

i xcj � i �= j

0� i = j

where wij is the connection weight from neuron i to neuron j, and xci (which is

either +1 or −1) is the ith component of the training input pattern for class c, P

80 CHAPTER 3

Outputs

Inputs

HopfieldLayer

w12 w13 w1N

y1 y2 y3 yN

x1 x2 x3 xN

Figure 8. A Hopfield network

the number of classes and N the number of neurons (or the number of componentsin the input pattern). Note from Equation (22) that wij = wji and wii = 0, a set ofconditions that guarantee the stability of the network. When an unknown patternis input to the network, its outputs are initially set equal to the components of theunknown pattern, viz.

(23) yi�0� = xi� 1 ≤ i ≤ N

Starting with these initial values, the network iterates according to the followingequation until it reaches a minimum energy state, i.e. its outputs stabilise to constantvalues:

(24) yi �k+1� = f

[N∑

j=1

wijyi �k�

]

� 1 < i ≤ N

where f is a hard limiting function defined as

(25) f�x� ={

−1� x < 0

1� x > 0

2.7 Elman and Jordan Nets

Figures 9a and b show an Elman net and a Jordan net, respectively. These networkshave a multi-layered structure similar to the structure of MLPs. In both nets, inaddition to an ordinary hidden layer, there is another special hidden layer sometimescalled the context or state layer. This layer receives feedback signals from the


11

input units

hidden units

outputs

output units

context unit

inputs

Figure 9a. An Elman network

input unit

hidden layer

output

output feedback

context unit

self feedbackinput

output unit

Figure 9b. A Jordan network

ordinary hidden layer (in the case of an Elman net) or from the output layer (inthe case of a Jordan net). The Jordan net also has connections from each neuronin the context layer back to itself. With both nets, the outputs of neurons in thecontext layer, are fed forward to the hidden layer. If only the forward connectionsare to be adapted and the feedback connections are preset to constant values, thesenetworks can be considered ordinary feedforward networks and the BP algorithmused to train them. Otherwise, a GA could be employed [Pham and Karaboga,1993b; Karaboga, 1994]. For improved versions of the Elman and Jordan nets, see[Pham and Liu, 1992; Pham and Oh, 1992].

82 CHAPTER 3

2.8 Kohonen Network

A Kohonen network or a self-organising feature map has two layers, an input bufferlayer to receive the input pattern and an output layer (see Figure 10). Neurons in theoutput layer are usually arranged into a regular two-dimensional array. Each outputneuron is connected to all input neurons. The weights of the connections form thecomponents of the reference vector associated with the given output neuron.

Training a Kohonen network involves the following steps:(i) initialise the reference vectors of all output neurons to small random values;

(ii) present a training input pattern;(iii) determine the winning output neuron, i.e. the neuron whose reference vector is

closest to the input pattern. The Euclidean distance between a reference vectorand the input vector is usually adopted as the distance measure;

(iv) update the reference vector of the winning neuron and those of its neighbours.These reference vectors are brought closer to the input vector. The adjustmentis greatest for the reference vector of the winning neuron and decreased forreference vectors of neurons further away. The size of the neighbourhood of aneuron is reduced as training proceeds until, towards the end of training, onlythe reference vector of a winning neuron is adjusted.

In a well-trained Kohonen network, output neurons that are close to one anotherhave similar reference vectors. After training, a labelling procedure is adopted whereinput patterns of known classes are fed to the network and class labels are assigned tooutput neurons that are activated by those input patterns. As with the LVQ network,an output neuron is activated by an input pattern if it wins the competition againstother output neurons, that is, if its reference vector is closest to the input pattern.

Reference vector

Input vector

Input neurons

Output neurons

Figure 10. A Kohonen network


2.9 ART Networks

There are different versions of the ART network. Figure 11 shows the ART-1version for dealing with binary inputs. Later versions, such as ART-2 can alsohandle continuous-valued inputs.

ART-1

As illustrated in Figure 11, an ART-1 network has two layers, an input layer and anoutput layer. The two layers are fully interconnected, the connections are in boththe forward (or bottom-up) direction and the feedback (or top-down) direction. Thevector Wi of weights of the bottom-up connections to an output neuron i formsan exemplar of the class it represents. All the Wi vectors constitute the long-termmemory of the network. They are employed to select the winning neuron, the latteragain being the neuron whose Wi vector is most similar to the current input pattern.The vector Vi of the weights of the top-down connections from an output neuroni is used for vigilance testing, that is, determining whether an input pattern issufficiently close to a stored exemplar. The vigilance vectors Vi form the short-termmemory of the network. Vi and Wi are related in that Wi is a normalised copy ofVi, viz.

(26) Wi = Vi

�+∑Vji

where � is a small constant and Vji, the jth component of Vi (i.e. the weight of theconnection from output neuron i to input neuron j).

output layer

top down weights Vbottom up weights W

input layer

Figure 11. An ART-1 network

84 CHAPTER 3

Training an ART-1 network occurs continuously when the network is in use andinvolves the following steps:

(i) initialise the exemplar and vigilance vectors Wi and Vi for all output neurons,setting all the components of each Vi to 1 and computing Wi according toEquation (26). An output neuron with all its vigilance weights set to 1 isknown as an uncommitted neuron in the sense that it is not assigned torepresent any pattern classes;

(ii) present a new input pattern x;(iii) enable all output neurons so that they can participate in the competition for

activation;(iv) find the winning output neuron among the competing neurons, i.e. the neuron

for which x. Wi is largest; a winning neuron can be an uncommitted neuronas is the case at the beginning of training or if there are no better outputneurons;

(v) test whether the input pattern x is sufficiently similar to the vigilance vectorVi of the winning neuron. Similarity is measured by the fraction r of bits inx that are also in Wi, viz.

(27) r = x Vi∑xi

x is deemed to be sufficiently similar to Vi if r is at least equal to vigilancethreshold ��0 < � ≤ 1�;

(vi) go to step (vii) if r ≥ � (i.e. there is resonance); else disable the winningneuron temporarily from further competition and go to step (iv) repeating thisprocedure until there are no further enabled neurons;

(vii) adjust the vigilance vector Vi of the most recent winning neuron by logicallyANDing it with x, thus deleting bits in Vi that are not also in x; compute thebottom-up exemplar vector Wi using the new Vi according to Equation (26);activate the winning output neuron;

(viii) go to step (ii).The above training procedure ensures that if the same sequence of training pat-

terns is repeatedly presented to the network, its long-term and short-term memoriesare unchanged (i.e. the network is stable). Also, provided there are sufficient outputneurons to represent all the different classes, new patterns can always be learnt, asa new pattern can be assigned to an uncommitted output neuron if it does not matchpreviously stored exemplars well (i.e. the network is plastic).

ART-2

The architecture of an ART-2 network [Carpenter and Grossberg, 1987; Pham andChan, 1998; 2001] is depicted in Figure 12. In this particular configuration, the“feature representation” field �F1� consists of 4 loops. An input pattern will becirculated in the lower two loops first. Inherent noise in the input pattern will besuppressed (this is controlled by the parameters a and b and the feedback functionf�·�) and prominent features in it will be accentuated. Then the enhanced input


pattern will be passed to the upper two F1 loops and will excite the neurons in the“category representation” field �F2� via the bottom-up weights. The “establishedclass” neuron in F2 that receives the strongest stimulation will fire. This neuron willread out a “top-down expectation” in the form of a set of top-down weights some-times referred to as class templates. This top-down expectation will be comparedagainst the enhanced input pattern by the vigilance mechanism. If the vigilance testis passed, the top-down and bottom-up weights will be updated and, along with theenhanced input pattern, will circulate repeatedly in the two upper F1 loops untilstability is achieved. The time taken by the network to reach a stable state dependson how close the input pattern is to passing the vigilance test. If it passes thetest comfortably, i.e. the input pattern is quite similar to the top-down expectation,stability will be quick to achieve. Otherwise, more iterations are required. After thetop-down and bottom-up weights have been updated, the current firing neuron willbecome an established class neuron. If the vigilance test fails, the current firingneuron will be disabled. Another search within the remaining established class neu-rons in the F2 layer will be conducted. If none of the established class neurons hasa top-down expectation similar to the input pattern, an unoccupied F2 neuron willbe assigned to classify the input pattern. This procedure repeats itself until eitherall the patterns are classified or the memory capacity of F2 has been exhausted.

The basic ART-2 training algorithm can be summarised as follows:(i) initialising the top-down and bottom-up long term memory traces;

(ii) presenting an input pattern from the training data set to the network;(iii) triggering the neuron with the highest total input in the category representation

field;(iv) checking the match between the input pattern and the exemplar in the top-

down filter (long term memory) using a vigilance parameter;(v) starting the learning process if the mismatch is within the tolerance level

defined by the vigilance parameter and then going to step (viii); otherwise,moving to the next step;

(vi) disabling the current active neuron in the category representation field andreturning to step (iii); go to step (vii) if all the established classes have beentried;

(vii) establishing a new class for the given input pattern;(viii) repeating (ii) to (vii) until the network stabilises or a specified number of

iterations are completed.In the recall mode, only steps (ii), (iii), (iv) and (viii) will be utilised.

Dynamics of ART-2: The dynamics of the ART-2 network illustrated in Figure 12is controlled by a set of mathematical equations. They are as follows:

w′i = Ii +au′

i(28)

x′i = w′

i

�W ′�(29)

86 CHAPTER 3

ρ F2 resetZij

YjF2

g(Yj) = d

Zjiri

cpi

pi

ui

aui

wi

Ii

xi

f(xi)

vi

bf(qi)

qi

F1

F2

vigilance mechanism

p′i

u′iau′i

w′ix′i

f(x′i)

v′i

bf(q′ )

iq′

i

Figure 12. Architecture of an ART-2 network

v′i = f �x′

i�+bf �q′i�(30)

u′i = v′

i

�V ′�(31)

p′i = u′

i(32)

q′i = p′

i

�P ′�(33)

wi = q′i(34)

xi = wi

�W�(35)

vi = f �xi�+bf �qi�(36)

ui = vi

�V�(37)


pi = ui +∑

j

g(Yj

)zji(38)

qi = pi

�P�(39)

The symbol �X� represents the L2 norm of the vector X. If X = �x1� x2� � � xn�,then �X� =√

x21 +x2

2 + � � � +x2n. The output of the jth neuron in the classification

layer is denoted by g�Yj�. The L2 norm is used in the equations for the purpose ofnormalising the input data. The function f�·� used in Equations (30) and (36) is anon-linear function, the purpose of which is for suppressing the noise in the inputpattern down to a prescribed level. The definition of f�·� is

(40) f�x� ={

0 if 0 ≤ x < �

x if x ≥ �

where � is a user defined parameter, it has a value between 0 and 1.

Learning Mechanism of ART-2: When an input pattern is applied to the ART-2network, it will pass through the 4 loops comprising F1 and then stimulate theclassification neurons in F2. The total excitation received by the jth neuron in theclassification layer is equal to Tj where

(41) Tj =∑

i

pizij

The neuron which is stimulated by the strongest total input signal will fire bygenerating an output with the constant value d. Therefore, for the winning neuron,g�Yj� equals d. When a winning neuron is determined, all the other neurons willbe prohibited from firing. The value d will be used to multiply the top-downexpectation of the firing class before the top-down expectation pattern is read outfor comparison in the vigilance test. When the winning neuron fires, all the otherneurons are inhibited from firing so it can be inferred that when there is a firingneuron (say j), Equation (38) becomes:

(42) pi = ui +dzji

otherwise if there is no winning neuron, it can be simplified as:

(43) pi = ui

The top-down expectation pattern is merged with the enhanced input pattern atpoint ri before they enter the vigilance test (see Figure 12). ri is defined by

(44) ri = q′i + cpi

�Q′�+�cP�

88 CHAPTER 3

The vigilance test is failed and the firing neuron will be reset if the followingcondition is true:

(45)�

�R� > 1

where � is the vigilance parameter.On the other hand, if the vigilance test is passed (in other words, the current

input pattern can be accepted as a member of the firing neuron), the top-down andthe bottom-up weights are updated so that the special features present in the currentinput pattern can be incorporated into the class exemplar represented by the firingneuron. The updating equations are as follows:

d

dtzji = d

[pi − zji

](46)

d

dtzij = d

[pi − zij

](47)

The bottom-up weights are denoted by Zij and the top-down weights by Zji.According to the recommendations in [Carpenter and Grossberg, 1987], all the top-down weights should be initialised with the value 0 at the beginning of the learningprocess. This can be expressed by the following equation:

(48) Zji �0� = 0

This measure is designed to prevent a neuron from being reset when it is allocatedto classify an input pattern for the first time. The bottom-up weights are initialisedusing the equation:

(49) Zji �0� = 1

�1−d�√

M

where M is the number of neurons in the input layer. This number is equal to thedimension of the input vectors. This arrangement ensures that after all the neuronswith the top-down expectations similar to the input pattern have been searched, itwould be easy for the input pattern to access a previously uncommitted neuron.

2.10 Spiking Neural Network

Experiments with biological neural systems have shown that they use the timingof electrical pulses or “spikes” to encode and transmit information. Spiking neuralnetworks, also known as pulsed neural networks, are attempts at modelling theoperation of biological neural systems more closely than is the case with otherartificial neural networks.

An example of spiking neural network is shown in Figure 13. Each connectionbetween neurons i and j could contain multiple connections associated with aweight value and delay [Natschläger and Ruf, 1998].


i j

1

1

2

j

i

n

m

INPUT

OUTPUT

wkij , d

kij

wkij , d

kij

wlij , d

lij

Figure 13. Spiking neural network topology showing a single connection composed of multipleweights wk

ij with corresponding delays dkij

t

PSP

s

a)

PSP

t

ε ij (t − s)

ε ij (t − s)

s

b)

Figure 14. Different shapes of response functions. a) Excitatory post synaptic potentials (EPSPs)function b) Inhibitory post synaptic potentials (IPSPs) function

90 CHAPTER 3

In the leaky integrate-and-fire model proposed by Maass [Maass, 1997], a neuronis regarded as a homogeneous unit that generates spikes when the total excitationexceeds a threshold value.

Consider a network that consists of a finite set V of such spiking neurons, a setE ⊆ V ×V of synapses, a set of weights Wu�v ≥ 0, a response function �u�v � R+ → Rfor each synapse u� v� ∈ E where R+ �= �x ∈ R �x ≥ 0� and a threshold function�v � R+ → R for each neuron v ∈ V . If Fu ⊆ R+ is the set of firing times of a neuronu, then the potential at the trigger zone of each neuron v at time t is given by:

(50) Pv �t� =∑u� u�v�∈E

∑s∈Fu�s<t

wu�v∗�u�v �t − s�

In the simplest model of a spiking neuron, a neuron v fires whenever its potentialPv �t� reaches a certain threshold �v�t�. This potential Pv �t� is the sum of theso-called excitatory post synaptic potentials (EPSPs) and inhibitory post synapticpotentials (IPSPs), which result from the firing of other neurons u that are connectedthrough a synapse to neuron v. The firing of a presynaptic neuron u at time scontributes to the potential Pv �t� at time t an amount that is modeled by the termwu�v

∗�u�v �t − s�, where wu�v is the weight of the connection between neurons u andv, and �u�v �t − s� is the response function. Some biologically realistic shapes of thepost synaptic potentials (PSPs) are shown in Figure 14. The change in threshold asa function of time is illustrated in Figure 15.

Learning can be achieved in spiking neural networks as in traditional neuralnetworks for tasks such as classification, pattern recognition and function approx-imation [Lannella and Back 2001; Bohte et al., 2002a; 2002b]. Different learning

Time

Threshold

Figure 15. Firing threshold of a neuron


algorithms have been proposed for the training of spiking neural networks [Maassand Bishop, 98; Gerstner and Kistler, 2002].

A supervised learning algorithm has been proposed [Bohte et al., 2002b], whereit is shown that a feedforward network of spiking neurons can be trained for clas-sification tasks by means of error backpropagation. Unsupervised learning can alsobe achieved for spiking neural networks [Bohte et al., 2002a] with self organisationas for Radial Basis Function (RBF) networks.

3. SUMMARY

This chapter has presented the main types of existing neural networks and hasdescribed examples of each type. For an overview of the different systems engi-neering applications of these neural networks, see the chapter on Soft Computingand its Applications in Engineering and Manufacture for example.

4. ACKNOWLEDGEMENTS

This work was carried out within the ALFA project “Novel Intelligent Automa-tion and control systems 11” (NIACS 11), the ERDF (Objective One) projects“Innovation in Manufacturing Centre” (IMC), “Innovative Technologies forEffective Enterprises” (ITEE) and “Supporting Innovative Product Engineering andResponsive Manufacturing” (SUPERMAN) and within the project “Innovative Pro-duction Machines and Systems” (I∗PROMS).

REFERENCES

Albus J S, (1975a), “A new approach to manipulator control: cerebellar model articulation control(CMAC)”, Trans. ASME, J. of Dynamics Syst., Meas. and Contr., 97, 220–227.

Albus J S, (1975b), “Data storage in the cerebellar model articulation controller (CMAC)”, Trans. ASME,J. of Dynamics Syst., Meas. and Contr., 97, 228–233.

Albus J S, (1979a), “A model of the brain for robot control”, Byte, 54–95.Albus J S, (1979b), “Mechanisms of planning and problem solving in the brain”, Math. Biosci., 45,

247–293.An P E, Brown M, Harris C J, Lawrence A J and Moore C J, (1994), “Associative memory neural

networks: adaptive modelling theory, software implementations and graphical user”, Engng. Appli.Artif. Intell., 7 (1), 1–21.

Bohte S M, La Poutre H and Kok J N, (2002a), “Unsupervised clustering with spiking neurons by sparsetemporal coding and multilayer RBF networks”, IEEE Trans. on Neural Networks, 13 (2), 415–425.

Bohte S M, La Poutre H and Kok J N, (2002b), “Error-back propagation in temporally encoded networksof spiking neurons”, Neuro Computing, 17–37.

Broomhead D S and Lowe D, (1988), “Multivariable functional interpolation and adaptive networks”,Complex Systems, 2, 321–355.

Carpenter G A and Grossberg S, (1987), “ART2: Self-organisation of stable category recognition codesfor analog input patterns”, Appl. Optics, 26 (23), 4919–4930.

Carpenter G A and Grossberg S, (1988), “The ART of adaptive pattern recognition by a self-organisingneural network”, Computer, 77–88.

Cichocki A and Unbahauen R, (1993), Neural Networks for Optimisation and Signal Processing,Chichester: Wiley.

92 CHAPTER 3

Elman J L, (1990), “Finding structure in time”, Cognitive Science, 14, 179–211.Gerstner W and Kistler W M, (2002), Spiking Neuron Models: Single Neurons, Populations and

Plasticity, Cambridge University Press, UK.Goldberg D, (1989), Genetic Algorithms in Search, Optimisation and Machine Learning, Reading, MA:

Addison-Wesley.Hassoun M H, (1995), Fundamentals of Artificial Neural Networks, MIT Press, Cambridge, MA.Haykin S, (1999), Neural Networks: A Comprehensive Foundation, 2nd Edition, Upper Saddle River,

NJ: Prentice Hall.Hecht-Nielsen R, (1990), Neurocomputing, Reading, MA: Addison-Wesley.Holland J H, (1975), Adaptation in Natural and Artificial Systems, Ann Arbor, MI: University of

Michigan Press.Hopfield J J, (1982), “Neural networks and physical systems with emergent collective computational

abilities”, Proc. National Academy of Sciences, 79, 2554–2558.Iannella N and Back A D, (2001), Spiking neural network architecture for nonlinear function approxi-

mation, Neural Networks, Special Issue, 14(6), 922–931.Jordan M I, (1986), “Attractor dynamics and parallelism in a connectionist sequential machines”, Proc.

8th Annual Conf. of the Cognitive Science Society, 531–546.Karaboga D, (1994), Design of Fuzzy Logic Controllers Using Genetic Algorithms, PhD thesis, University

of Wales, Cardiff, UK.Kohonen T, (1989), Self-Organising and Associative Memory (3rd ed.), Berlin: Springer-Verlag.Lannella N and Back A D, (2001), Spiking neural network architecture for nonlinear function approxi-

mation, Neural Networks, Special Issue, 14(16), 922-931.Maass W, (1997), “Networks of spiking neurons: The third generation of neural network models”,

Neural Networks, 10, 1659–1671.Maass W and Bishop CM, (1998), Pulsed Neural Networks, Cambridge: MIT Press.Moody J and Darken C J, (1989), “Fast learning in networks of locally-tuned processing units”, Neural

Computation, 1 (2), 281–294.Natschläger T and Ruf B, (1998), “Spatial and temporal pattern analysis via spiking neurons”, Network:

Computation in Neural systems, 9 (3), 319–332.Pham D T and Chan A J, (1998), “Control chart pattern recognition using a new type of self-organising

neural network”, Proc. of the Institution of Mechanical Engineers, 212 (Part I), 115–127.Pham D T and Chan A J, (2001), “Unsupervised adaptive resonance theory neural networks for control

chart pattern recognition”, Proc. of the Institution of Mechanical Engineers, 215 (Part B), 59–67.Pham D T and Karaboga D, (1993), “Dynamic system identification using recurrent neural networks and

genetic algorithms”, Proc. 9th Int. Conf. on Mathematical and Computer Modelling, San Francisco.Pham D T and Liu X, (1992), “Dynamic system modelling using partially recurrent neural networks”,

Journal of Systems Engineering, 2 (2), 90–97.Pham D T and Liu X, (1994), “Modelling and prediction using GMDH networks of Adalines with

nonlinear preprocessors”, Int. J. Systems Science, 25 (11), 1743–1759.Pham D T and Oh S J, (1992), “A recurrent backpropagation neural network for dynamic system

identification”, Journal of Systems Engineering, 2 (4), 213–223.Pham D T and Oztemel E, (1994), “Control chart pattern recognition using learning vector quatization

networks”, Int. J. Production Research, 32 (3), 721–729.Rumelhart D and McClelland J, (1986), Parallel distributed processing: exploitations in the micro-

structure of cognition, volumes 1 and 2, Cambridge: MIT Press.Widrow B and Hoff M E, (1960), “Adaptive switching circuits”, Proc. 1960 IRE WESCON Convention

Record, Part 4, IRE, New York, 96–104.

CHAPTER 4

APPLICATION OF NEURAL NETWORKS

D. ANDINA1, A. VEGA-CORONA2, J. I. SEIJAS3, M. J. ALARCÓN1 Departamento de Señales, Sistemas y Radiocomunicaciones (SSR), Universidad Politécnicade Madrid (UPM), Ciudad Universitaria C.P. 28040, Madrid, España. [email protected] Facultad de Ingeniería, Mecánica, Eléctrica y Electrónica (FIMEE), Universidad de Guanajuato(UG), Salamanca, Gto., México. [email protected] Departamento de Señales, Sistemas y Radiocomunicaciones (SSR), Universidad Politécnicade Madrid (UPM), Ciudad Universitaria C.P. 28040, Madrid, España. [email protected]

Abstract: This chapter is dedicated to the scope of which facts should be considered when decidingwhether a Neural Network (NN) solution is suitable to solve a given problem. This isfollowed by a detailed example of a successful and useful application: a Neural BinaryDetector

INTRODUCTION

There are three main criteria which need to be applied when deciding whether agiven problem lend to a neural solution.a) Unknown Algorithm. The solution to the problem cannot be explicitly described

by an algorithm, a set of equations or a set of rules. Making the decision betweena conventional and a neural computing solution on the basis of this criterionis not always entirely clear. There are problems where both, conventional andneural computing, may be able to provide appropriate solutions. The choice thendepends on the resources available and the ultimate goals of the designer.

b) Conviction of the existence of a solution. There must exists some evidencethat an input-output mapping exists between a set of input variables x andcorresponding output data y, such as y = f�x�. The form of f , however isnot known. For example, let us suppose that we have input data patterns andcorresponding desired output patterns. We pretend using a supervised algorithmto train a NN that will establish high-order (non-linear) correlations between theinput features x and the output data y in order to minimize the output error. But,to train the NN, the fact that a database with a set of input and output pairscan be assembled does not necessarily means that a mapping can be constructed

93


94 CHAPTER 4

between the input variables and the desired data. Also, no matter how good thealgorithm is, efficient results will not be achieved unless relevant features areused as the input to the network.

c) Availability of data. There should be a large amount of data available, i.e. manydifferent examples with which to train the network. If there is any doubts abouttheir availability then, in all probability, the NN will not be efficient to solvethe problem. Lack of suitable data is one of the main causes of problems duringneural computing applications. Such data has to be collected and compiled ina computer readable form. Designer of the NN application must take accountthat there may be considerable practical problems associated with collectingand processing data; special instrumentation and recording facilities may berequired, and specific experiments may be needed to ensure that the data coverthe necessary range of conditions. For example, if a NN is to be applied todetect signals in noisy environments, data corresponding to all expected valuesof noise and signal should be used to train the network.

1. FEASIBILITY STUDY

If a given application accomplishes the three main criteria stated above, the designerhas to choose the type of the NN to apply. A thorough literature search for referencesto related work should be carried out at this designing stage. Nowadays, neuralcomputing has a large applications literature and it is highly likely that there arepapers on applications similar to the one you are considering.

Once the literature search has been carried out, an outline design should beprepared as early as possible in the feasibility study stage. This may only be a paperexercise, but it could extend to the collection of sample of data and even someprototype work. Specially relevant is the consideration of pre-processing the inputto extract and select the features that characterize the problem and a preliminaryassessment of the required neural network architecture. If the prototype work issuccessfully, and the designer has sufficient training data available, the design ofthe NN can start; a work that involves a lot of empirical research. That will beshown in the following section.

2. APPLICATION OF NNs TO BINARY DETECTION

It is expected that neural hardware will provide crucial tools for the enhancement anddevelopment of algorithms applied to automatic classification. In this sense, somerelevant studies about the possibilities of neural nets [Kim and Guest, 1990, Decatur,1992, Roth, 1990] have selected four principal areas of application: automatictarget recognition, speech recognition, seismic signal processing and sonar signalprocessing.

The availability of Neural Networks (NNs) as detectors is based on their capa-bility to incorporate a great quantity of information of several classes, their ability

APPLICATION OF NEURAL NETWORKS 95

to generalize from noisy or incomplete information, and their robustness to ambi-guities. Another advantage of NNs is the computational simplicity of their nodes.Although they are computationally intensive in general purpose computers, they canbe implemented in specific massive parallel processing hardware that can overcomethe implementations on the most powerful computers.

Neural networks can have interesting robust capabilities when applied as binarydetectors. This type of networks has proved its abilities in classifying problems,and we could reduce the binary detection problem as having to decide if an inputhas to be classified as one of two outputs, 0 or 1. However, the NNs detectors havesome typical drawbacks: slow and unpredictable training, and some difficulty foradding new information to the net that employs too much time in its retraining.These problems become more critical as the size of the net increases.

This application example deals with the possibilities of applying a Multilayer Per-ceptron (MLP) Neural Network to binary detection. After an optimization process,the performance of the proposed neural detector is compared with the optimal one,whose performance is given by the Optimum Neyman-Pearson detector, commonlyused in radar detection applications.

The detector input is modeled as a signal with additive noise (given by its complexenvelope). The binary output is 1 or 0. As the MLP output can accurately estimatethe a posteriori probabilities of the input classes [Ruck et al., 1990], a thresholdT is established at the NN output to satisfy the Neyman-Pearson requirements[Van Trees, 1968]. The performance is evaluated by Monte-Carlo simulations overseveral input models. Their Receiver Operating Characteristics (ROC) and detectioncurves are obtained.

The main topics of designing the network are:(a) The network structure design. Variations in performance are analyzed for dif-

ferent numbers of inputs and hidden layers, and different number of nodes inthese hidden layers.

(b) The network training. Using the BackPropagation (BP) algorithm with momen-tum term, we study the training parameters: initial set of weights, thresholdvalue for training, momentum value, and the training method. The preparationof the training set, whose key parameter, after preliminary exprimental resultsto be Training Signal-to-Noise Ratio (TSNR), is discussed. Also, it is shownhow to find the TSNR value that maximizes the detection probability �Pd� fora given false alarm probability �Pfa�, as is required by the Neyman-Pearsoncriterion.

(c) The dependence of the training and the performance on the criterion func-tion to be minimized by the BP algorithm. The following criterion func-tions are analyzed: Least Mean Square (LMS) Error [Hush and Horne, 1993],Minimum Misclassification Error (MME) [Telfer and Szu, 1994, Telfer and Szu,1992] and El-Jaroudi and Makhoul (JM) criterion [El-Jaroudi and Makhoul,1990].

(d) The results (training time, threshold values, error probability curves, ROC anddetection curves) are analyzed, and appropriate conclusions are extracted.

96 CHAPTER 4

3. THE NEURAL DETECTOR

The detector under consideration is a modified envelope detector [Andina andSanz-González, 1995, Root, 1970], as is shown in Figure 1. The binary detectionproblem is reduced to decide if an input complex value (the complex envelopeinvolving signal and noise) has to be classified as one of two outputs, 0 or 1. Theneed of processing complex signals with an all-real coefficient NN, requires tosplit off the input in its real and imaginary parts (the number of inputs doubles thenumber of integrated pulses); then, a threshold T is established at the NN output.

The input r�t� is a band-pass signal, and the complex envelope x�t� = xC�t�+jxS�t� is sampled each T0 seconds. Then

(1) x�kT0� = xC�kT0�+ jxS�kT0�� k = 1� � � � �M��j = √−1�

At the neural network output, values in �0� 1� are obtained. A threshold valueT ∈ �0� 1� is chosen so that output values in �0� T� will be considered as binaryoutput 0 (decision D0) and values in �T� 1� will represent 1 (decision D1).

The two hypotheses H0 (target absent) and H1 (target present) are defined asfollows

H0 � x�kT0� = n�kT0�(2a)

H1 � x�kT0� = S�kT0� · ej�kT0� +n�kT0�(2b)

where T0 is the pulse repetition period, k varies from 1 to the number of integratedpulses �M�, S�kT0� is the signal amplitude sequence, is the signal phase andn�kT0� is the complex envelope of the noise sequence, i.e. n�kT0� = nC�kT0� +jnS�kT0�.

3.1 The Multi-Layer Perceptron (MLP)

The NNs have clear advantages over classical detectors as nonparametric detectors.In this case, the statistical description of the input is not available. The onlyinformation available in the design process is the performance of the detector on a

z–1

z–M–12π−

LPF

LPF

r(t)

sinwct

xs(k0T)=xs(k)

xc(k0T )=xc(k)

xs(t)

xc(t)

NeuralNet

(MLP)

Comparison Device

Threshold, T

D0D1

z–M–1

z–1

~

Figure 1. The Neural Detector


group of patterns called training patterns. For this task, the BP algorithm carries outan implicit histogram of the input distribution, adapting freely to the distributionsof each class; so, NNs contribute with a new level of sophistication to the classicaltechniques of nonparametric detection.

For detection purposes, the Multi-Layer Perceptron (MLP) trained with the Back-Propagation (BP) algorithm, has been found more powerful than other types of NNs[Kim and Guest, 1990]. While other types of NNs can learn topological maps usinglateral inhibition and Hebbian learning, a MLP trained with BP can also discovertopological maps. Furthermore, the MLP can be even superior to the parametricbayesian detectors when the input distribution departures from the assumptions.

3.2 The MLP Structure

Since mathematical methods to calculate the dimensions of the net are not yetavailable, one main question is: “How should the net be designed in order to obtainthe desired decision regions in a reasonable amount of time?”. Generally, eachproblem requires different capacities of the net and it is not clear that a general ruleto calculate its size would be found. If the net is very small, it could not be capableof forming a good model for the problem; if it is too large, it could implementseveral solutions, and many of them would probably be suboptimal.

At least, for the majority of problems, only one hidden layer has demonstratedto be necessary, and it seems to be the case of binary detection. After a thoroughstudy of different MLP structures, that included growing and pruning of the NN,there have not been found any performance improvements by adding more thanone hidden layer, with the inconvenient of increasing critically the training time.So, the structure we have chosen is a MLP with one hidden layer, and we write2M/N/1 for a MLP with 2M input nodes, N hidden layer nodes and one outputnode.

Also, there is no way of establishing a priori the number of nodes in the hiddenlayer. For realizing exactly the input-output relation, it has been demonstrated thatan upper bound for the number of nodes in the hidden layer of a MLP is of thenumber of training patterns [Hush and Horne, 1993]. But, for practical purposes, thenumber of hidden nodes should be much lower than the number of training patterns;otherwise, the net simply “memorizes” the training set, losing its generalizationcapability. In general, the size of the net has to be determined by means of a testand error procedure. Therefore, to choose the size of the hidden layer, empiricalcurves as the one presented in Figure 2 have been used [Andina, 1995].

The parameter Training Signal-to-Noise Ratio (TSNR) is the Signal-to-Noise-Ratio used to generate the training set patterns, and is one of the key designparameters of the NN. If 2 is the noise power and A is the received signal amplitude(Marcum model [Marcum, 1960]), the Signal-to-Noise Ratio (SNR) is defined as

(3) SNR = A2

2

More details will be given later.

98 CHAPTER 4

0 10 20 30 40 50 60 70 80 90 1000

5

10

15

20

25

30

35

40TSNR=0dB TSNR=3dBTSNR=6dBTSNR=12dB

Num. Nodes

Pe(%)

Figure 2. Error probability �Pe� vs. number of nodes (N) in the hidden layer for an MLP, dependingon Training Signal-to-Noise Ratio (TSNR)

In Figure 2, each curve presents a knee where the most effective relation betweencomplexity and performance is verified. After a thorough study of the values of M ,N and TSNR the structure 16/8/1 has been chosen as the most efficient one. Insection 2 of this chapter, it will be shown that there is a range on the number ofintegrated pulses M (half the number of inputs) where the NN works efficiently.

4. THE TRAINING ALGORITHM

4.1 The BackPropagation (BP) Algorithm

Even if the size of the net has been precisely determined, finding the adequateweights is a difficult problem. The BackPropagation (BP) algorithm modifies theoutput layer weights during the training as (see Appendix A, section 1)

(4) w�L�ij �t +1� = w

�L�ij �t�−�

� �W�t��

�w�L�ij �t�

where w�L�ij is the weight connecting output from i-th node of the layer L−1 to the

j-th node in the output layer L; the iteration counter is t, � is the learning rate and �W� is the criterion or error function, that depends on the weights matrix, W .


In order to improve the learning time, it is common to include the moment termwithin the basic BP algorithm. This term is ��w

�l�ij �t�−w

�l�ij �t−1��, where 0 < � < 1

and w�l�ij �t� is the weight that connects the i-th output of layer l − 1 to j-th node

in layer l. By means of the inclusion of this term, the current search direction isan exponentially weighted average of past directions, helping to keep the weightsmoving across flat portions of the error surface, after they have descended fromthe steep portions. In the case of our binary detector, � = 0�8 has been used. Thisvalue has been chosen from empirical results [Andina, 1995].

The dynamic of learning utilized is cross-validation, a method that monitorizesthe generalization capabilities of the net. This method demands to split learningdata into two sets: a training set, employed to modify the net weights, and atesting set, which is utilized to measure its generalization capability. During thetraining, the performance of the net on the training set will continue improving,but its performance on the test set will only improve until a point, beyond which itwill begin to degrade. It is at this point when the net begins to be overtrained (itis excessively specialized on the training set), loosing capacity of generalization;consequently, the training should be finished. In practice, the training has beenstopped when a typical number of iterations has been carried out, choosing the netwith the smallest error probability �Pe� over the test set. Although there is not anyguarantee that an absolute minimum has been reached, for the optimized networkthe smallest Pe is achieved typically in the order of 1,500 iterations.

The Least-Mean-Squares (LMS) criterion is the most widely used for training aMulti-Layer Perceptron (MLP). However, depending on the application, there is noreason to think that this criterion is the optimal one [Barnard and Casasent, 1989].With the purpose of using an adequate criterion function for our detector, we haveanalyzed the following criterion functions (see also Appendix A, section 2):(a) Least-Mean-Squares (LMS). This criterion is the most widely used in Back

Propagation (BP) learning algorithm. It has been proved that it approximatesthe Bayes optimal discriminant function and yields a least-squares estimate ofthe a posteriori probability of the class given in the input [Ruck et al., 1990].It minimizes the expression

(5) ELMS = 1P

P∑

p=1

12

�yp − yp�2

where P is the number of training pairs, p is the training pair counter, yp ∈�0� 1� is the net output (the neural detector has only one output, as it has beenmentioned in the Section 3), and yp is the desired output, 0 or 1 for the binarycase.

One of the LMS drawbacks is that a least squares estimate of the probabilitycan lead to gross errors at low probabilities [El-Jaroudi and Makhoul, 1990].

(b) Minimum Misclassification Error (MME). This criterion minimizes the numberof misclassified training samples. It approximates class boundaries directly fromthe criterion function, and it could perform better than LMS for less complexity

100 CHAPTER 4

networks [Telfer and Szu, 1992, Telfer and Szu, 1994]. In our detector, itminimizes the expression

(6) EMME = 1P

{

P −P∑

p=1

f(�2yp −1�W T

p Y �L−1�p

)}

where Y �L−1�p is the output vector of layer L− 1 and W T

p is the weight vectorof the output layer L, f�·� is the node activation function (see Appendix A,section 3.

(c) El-Jaroudi and Makhoul (JM) criterion. It is similar to the Kullback-Leiblerinformation measure and results in superior probability estimates when com-pared to least squares [El-Jaroudi and Makhoul, 1990]. For the binary detectioncase it minimizes

(7) EJM = − 1P

P∑

p=1

ln�1−�yp − yp��

Details about how this criterion has been implemented are given in Appendix A,section 4.

4.2 The Training Sets

The training set has been formed by an equal number of signal plus noise patternsand only noise patterns (i.e., P�H0� = P�H1� during the training), presented alter-natively, so the desired output varies from D0 (detecting noise) to D1 (detectingtarget) in each iteration. Other choices of presenting the training pairs, as in arandom sequence, are also suitable.

The pattern sets are classical input models for radar detection [Marcum, 1960,Swerling, 1960]. The net inputs are samples of the complex envelope of the signal.There are NNs with complex coefficients [Kim and Guest, 1990] to process complexsignals, but it seems more convenient to sample its “in-phase” and “quadrature”components, obtaining an all real coefficient net. This provides generality to thisstudy, because the same NN can be utilized with a different preprocessing [Andina,1995].

There are no methods that indicate the exact number of training patterns to beused. Assuming that the test and training patterns have the same distribution, therehave been found limits to the number of training patterns necessary to achieve agiven error probability: this number is approximately the number of weights dividedby the desired error. If we utilize this upper limit, the number of patterns necessaryfor training becomes prohibitive. After an empirical study, an upper limit of 2,000training patterns has proved to be sufficient for any criterion function or targetmodel.

The fact of using a model for the laboratory experiments is, partially, obligated.The acquisition of real patterns representative of the environment is expensive,


difficult, and even, some times, impossible. We must not overestimate the real datavalue, and it is really difficult to obtain the data under all propagation conditions.The use of learning from a model in the construction of the system lets the designerto train the MLP easily. Then the robustness of the system could be sufficient toachieve quasi-optimal results (see Section 5.2) over a real input distribution; or, ifthere is time to on-line training, the NN could be then refined.

5. COMPUTER RESULTS

In order to make the resulting network independent of the initial weight values, eachone has been trained four times, with random initial values of the weights rangingin �−0�1� 0�1�, selecting as the final network the NN that provides the smallestprobability of error. The training threshold has been set to 0.5.

The signal input model in the following figures corresponds to the Marcummodel [Marcum, 1960], that is

(8) x�kT0� = A · ej�0 +n�kT0�

where

A: signal amplitude of constant value.�0: initial phase, constant for each input pattern and uniformly distributed in

�0� 2�� between patterns.n�kT0�: complex white Gaussian noise of zero mean and variance 2 in each

component.

The signal-to-noise ratio for training or testing is defined as

(9) TSNRorSNR = A2

2

The parameters for the study are the following:

Training signal-to-noise ratio: TSNRInput signal-to-noise ratio: SNRStructure: N0/N1/1 (N0 inputs / N1 nodes in the hidden layer / 1 output)

Let us call probability of detection, Pd, the probability that the detector decidesD1 (target present) under the hypothesis H1 (target present), Pd = Pr�D1/H1� andfalse alarm Probability Pfa, the probability of deciding D1 under H0 (target absent),Pfa = Pr�D1/H0�. The performance of the detector is measured by detection curves(Pd vs. SNR, for a fixed Pfa) and ROC curves (Receiver Operating Characteristicscurves, Pd vs. Pfa for a fixed SNR).

102 CHAPTER 4

5.1 The Criterion Function

In this section we present significant results of the comparison of criterion functions.

5.1.1 Convergence of the algorithm

The best results are those of the JM criterion, followed by the LMS one. The MMErequires, in general, a higher training iterations than the others. In Figure 3, theresults for JM and LMS criterions are presented for Training Signal-to-noise ratio(TSNR) of 13 dB, showing that significant convergence improvements are obtainedby using JM instead of LMS. This conclusion does not depend on the value ofTSNR.

Another general result is that rising the TSNR decreases the number of trainingiterations, as the decision regions to be separated by the MLP become more different.

5.1.2 Detection curves

First, we compare the detection characteristics of the networks with a fixed value ofPfa. In Figure 4, we present the best results for each criterion (these results dependon TSNR [Andina et al., 1995]) for two Pfa values: 10−2 and 10−3, respectively. Aswe can see, the performance differences become more clear as the design conditionsare more restrictive (i.e. lower values of the false alarm probability, Pfa). The resultsshow that criterion JM is the best for our detection problem. These results supportthe idea suggested in [El-Jaroudi and Makhoul, 1990] that the estimation of the aposteriori probabilities carried out by the JM criterion is more accurate than theLMS one.

The error surface for the JM criterion is also more appropriate for the gradientsearch (faster convergence of learning).

The value of TSNR that achieves better performance is 13 dB for JM criterionand 6 dB for LMS. The MME criterion presents the worst characteristics (see alsoFigure 5). This criterion does not adapt to the application, because it minimizes theclassification error for a training threshold different from the threshold T used to

0 10,0006,0002,000

Pd0.5

0.2

0.3

0.4

0.1

Pd0.5

0.2

0.3

0.4

0.1

0 10,0006,0002,000Iterations Iterations

Figure 3. Error probability �Pe� vs. iteration for TSNR = 13 dB (a) JM criterion (b) LMS criterion.


1

-2 0 2 4 6 8 10 12

LMS13dB

MME13dB

JM13dB

Pd

Pfa(b)

Pfa = 0.001

.8

.6

.4

.2

Pd

Pfa = 0.01

1

LMS6dB

MME6dB

JM13dB

.8

.6

.4

.2

Pfa(a)

-2 0 2 4 6 8 10 12

Figure 4. Detection probability �Pd� vs. Signal-to-Noise Ratio (SNR) for a MLP trained with differentcriterion functions and Training Signal-to-Noise Ratio (TSNR). LMS6dB means LMS criterion and

TSNR = 6 dB. MME6dB means MME criterion and TSNR = 6 dB, and so on. False alarm probability,Pfa = (a) 0.01 (b) 0.001

104 CHAPTER 4

1

LMS6dB

MME13dB

JM13dB

.96

.92

.88

.84

.80 .02 .04 .06 .08 .1

Pfa

Pd

SNR= 6 dB

Figure 5. Detection probability �Pd� vs. false alarm probability �Pfa� for a MLP trained with allcriteria in study. SNR = 6 dB

achieve the design value of Pfa. Unless you can find a way to estimate the valueof T before the training, the network will be forced to suboptimal performance inthe direct operating mode, degrading its performance.

5.1.3 Receiver Operating Characteristic (ROC) curves

Now, the performance of the networks, as Pd vs. Pfa, under a fixed SNR is presented.In Figure 5, it is observed that JM provides the best results. The MME criterionpresents the worst characteristics, as in the case of the detection curves.

5.2 Robustness Under Different Target Models

For more complex target models as is the classical Swerling I (SWI) [Swerling,1960] target model where S in Equation (2b) has Rayleigh distribution, the resultsare also very near the optimal detector (Neyman-Pearson detector), as shown inFigure 6.

In Figure 6, an interesting phenomenon is observed. In this case, for the purposeof evaluating the robustness of the network, a net trained by the simple Marcummodel is evaluated over the SWI case. The network is not only capable to generalizeand obtain good results over the more complicated input model, but also its resultsare superior to those of the network trained over the same SWI distribution. Thisresult suggests that the neural network may achieve better results trained oversimple hypotheses and then letting it to generalize over more complicated ones.This characteristic may be very useful in real cases, where input distributions willgenerally be different from those assumed in the design stage.


SNR (dB)

1

-10 -5 0 5 10 15

TSNR6

TSNR13

TSNR15

Optimum

Pfa=0.01

0.8

0.6

0.4

Pd

SNR (dB)

-10 -5 0 5 10 15

TSNR6

TSNR13

TSNR15

Optimum

Pfa=0.01

0.2

1Pd

0.8

0.6

0.4

0.2

0

(a) (b)

Figure 6. Detection probability �Pd� vs. Signal-to-Noise Ratio (SNR) for a MLP of structure 16/8/1(16 inputs, 8 nodes in the hidden layer and one output) and different TSNR. Pfa = 0�01 (a) Swerling Ifor training and testing (b) Net trained with Marcum model and tested over Swerling I target model.

APPENDIX: ON BACKPROPAGATION AND THE CRITERIONFUNCTIONS

1. THE BACKPROPAGATION ALGORITHM

The BackPropagation algorithm modifies the weight values of input i in node j,wij , following the expression

(A.1) wlij�t +1� = wl

ij�t�+��lij�t�

where

l is the layer counter: l = 0� · · · �L. l = 1 is the first hidden layer, l = L is theoutput layer.

t is the iteration counter.� is the learning rate.� is the gradient of the criterion function, �W�.

Let us define the error term for each node as

(A.2) ��l�j = −� �W�

� ˆ�y��l�

j

·f �l�′j = −� �W�

�w�l�ij

· 1

y�l−1�i

where y�l�j is the actual output of node j in layer l, f

�l�j is the derivative of the

activation function of the same node, and

(A.3) W�l�j = �w

�l�0j �w

�l�1j � · · · �w

�l�Nl−1j�

then, we have

(A.4) ��l�ij = � �W�

�w�l�ij

= −��l�j · y�l−1�

i

106 CHAPTER 4

applying the chain rule, the error gradient for each node can be also expressed as afunction of the error terms in the previous layers and Equation A.1 can be expressedas

(A.5) w�l�ij �t +1� =

{w

�L�ij �t�+�y

�L−1�i �t��L

j �t�� l = L

w�l�ij �t�+�y

�l−1�i �t��l

j�t�� l = L−1�L−2� � � � � 1

being 0 ≤ i ≤ Nl−1 −1, 0 ≤ j ≤ Nl −1, where Nl is the number of nodes in layer land

(A.6) ��l�j �t� = f

�l�′j

Nl+1−1∑

n=0

�l+1n �t�w

�l+1�jn �t�

Now, to adapt the algorithm to each criterion, it is only necessary to calculate thevalue of �

�L�j and apply Equation (A.5) (bias values, always set to “1” must be

added to properly train the network).

2. LEAST MEAN SQUARES (LMS)

The criterion function to minimize is the mean square error between the actualoutput y and the desired output y, in the training set, i.e.

(A.7) ELMS = 1P

P∑

p=1

Ns∑

j=1

12

�yj�p − yj�p�2

where P is the number of training pairs and Ns the number of output nodes.Approximating Equation A.7 by its value over one pattern,

(A.8) LMS =Ns∑

j=1

12

�yj − yj�2

As L is the number of layers of the MLP, y�l�j = yj . For the binary case, NS = 1 and

from Equation (A.8) and (A.2)

(A.9)� �W�

�y�L�= −�y − yL� ⇒ ��L� = �y − y�L�� ·f �L�′

3. MINIMUM MISSCLASSIFICATION ERROR (MME)

This criterion minimizes the classification error on the training set and is proposedin [Telfer and Szu, 1994, Telfer and Szu, 1992]. The criterion function, when thenetwork has one output node whose output values are in (0,1) and whenever thefinal values for the components of the weight matrix are high in module (whatusually happens [Telfer and Szu, 1994, Telfer and Szu, 1992]), is

(A.10) EMME = 1P

{

P −P∑

p=1

f��2yp −1�W T Y �L−1�p �

}


where f�·� is the node activation function.Minimizing Equation (A.10) in the step by step training is equivalent to minimize

(A.11) MME�W� = 1−f��2y −1�W T Y �L−1��

For the binary case,

(A.12) ��L� = �2y −1� ·f �L�′

4. THE JM CRITERION

This criterion is similar to the Kullback-Leibler information measure, and it mini-mizes

(A.13) HQ/P =∑

i

P�i� · lnP�i�

Q�i�

being P�i� = yi = Pr�Hi�X� the a posteriori probability of hypothesis Hi�i = 0� 1�and Q�i� = yj = Pr�Hi�X�. When there is enough training vectors, minimizingEquation (A.13) is equivalent to minimize

(A.14) EJM = −∑i

∫

∀xp�x� · ln yi�x�Hi�W� ·dx

being x the input vector. For the binary case,

(A.15) EJM = − 1P

[∑

Xp∈H0

ln y0�x�H0�W�+ ∑

Xp∈H1

ln y1�x�H1�W�

]

being P the number of input patterns. As there is just one output node with valuesin �0� 1�, Pr�H1�X� = y, Pr�H0�X� = 1− y and Equation (A.15) can be written as

(A.16) EJM = − 1P

[∑

Xp∈H0

ln�1−�y�x�H0�W�+ ∑

Xp∈H1

ln y�x�H1�W�

]

After some simple simplifications, Equation (A.16) can be written as function ofthe output y

(A.17) EJM = − 1P

P∑

p=1

ln�1−�yp − yp��

and, for each iteration step

(A.18) JM�W� = − ln�1−�y − y�W��Finally,we have

(A.19) ��L� = sgn�y − y�

1−�y − y� ·f �L�′

where sgn�·� is the sign function.

108 CHAPTER 4

REFERENCES

M.S.Kim, C.C.Guest, Modification of backpropagation networks for complex-valued signal processingin frequency domain, IEEE Proc. Int. Conf. Neural Networks, IJCNN, Vol. III, pp. 27–31, San Diego,June 1990.

S. E. Decatur, Application of neural networks to terrain classification, Proc. Int. Conf. Neural Networks,pp. 283–288, 1989.

N. Miller, M.W. McKenna, T.C. Lau, Office of Naval Research Contributions to Neural Networks andSignal Processing in Oceanic Engineering, IEEE Journal of Oceanic Engineering, vol. 17, no. 4, Oct.1992.

M.W. Roth, Survey of neural network technology for automatic target recognition, IEEE Trans. NeuralNetworks, vol. 1, no. 1, pp. 28–43, Mar. 1990.

D.W. Ruck, S.K. Rogers, M. Kabrisky, M.E. Oxley, B.W. Suter, The Multilayer Perceptron as anApproximation to a Bayes Optimal Discriminant Function, IEEE Trans. on Neural Networks, vol. 1,no. 4, pp. 296–298, Dec. 1990.

H.L. Van Trees, Detection, Estimation and Modulation Theory, Part I, Eds. Wiley and Sons, New York,1968.

D.R. Hush and B.G. Horne, Progress in Supervised Neural Networks. What’s new since Lippmann?,IEEE Signal Processing Magazine, pp. 8–51, Jan. 1993.

B.A. Telfer, H.H. Szu, Implementing the Minimum-Misclassification-Error Energy Function for TargetRecognition, Neural Networks, vol. 7, no. 5, pp. 809–818, 1994.

B.A. Telfer, H.H. Szu, Energy Functions for Minimizing Misclassification Error With Minimum-Complexity Networks, Proc. of Int. Joint Conf. Neural Networks, IJCNN, vol IV, pp. 214–219, 1992.

A. El-Jaroudi, J. Makhoul, A New Error Criterion For Posterior Probability Estimation With NeuralNets, Proc. of Int. Joint. Conf. Neural Networks, IJCNN, vol. I, no. 5, pp. 185–192, 1990.

D. Andina, J.L. Sanz-González, On the problem of Binary Detection with Neural Networks, Proc. of 38Midwest Symposium on Circuits and Systems, Rio de Janeiro, Brazil, vol. I, pp. 554–557, Aug. 1995.

W.L. Root, An Introduction to the Theory of the Detection of Signals in Noise, Proc. of the IEEE, vol58, pp. 610–622, May 1970.

D.Andina, Optimización de Detectores Neuronales: Aplicación a Radar y Sonar, Ph. D. Dissertation (inSpanish), ETSIT-M, Polytechnic University of Madrid, Spain, Dec. 1995.

J.L. Marcum, A Statistical Theory of Target Detection by Pulsed Radar, IRE Trans. on InformationTheory, vol. IT-6, no. 2, pp. 59–144. Apr. 1960.

E. Barnard, D. Casasent, A Comparison Between Criterion Functions for Linear Classifiers, with Appli-cation to Neural Nets, IEEE Trans. Systems, Man, and Cybernetics, vol. 19, no. 5, pp. 1030–1040,Oct. 1989.

P. Swerling, Probability of detection for fluctuating targets, IRE Trans. on Information Theory, vol.IT-6, no. 2, pp. 269–308, Apr. 1960.

D. Andina, J.L. Sanz-González, J.A. Jiménez-Pajares, A Comparison of Criterion Functions for a NeuralNetwork Applied to Binary Detection, Proc. of Int. Conf. Neural Networks, ICNN, Perth, Australia,Vol I, pp. 329–333, Nov. 1995.

CHAPTER 5

RADIAL BASIS FUNCTION NETWORKS AND THEIRAPPLICATION IN COMMUNICATION SYSTEMS

ASCENSIÓN GALLARDO ANTOLÍN1, JUAN PASCUAL GARCÍA2,JOSÉ LUIS SANCHO GÓMEZ3

1 Departamento de Teoría de la Señal y Comunicaciones, EPS-Universidad Carlos III de Madrid,Avda. de la Universidad, 30, 28911-Leganés (Madrid), SPAIN. [email protected] Departamento de las Tecnologías de la Información y las Comunicaciones, Universidad Politécnicade Cartagena, Campus de la Muralla del Mar, s/n, 30202, Cartagena (Murcia) [email protected] Departamento de las Tecnologías de la Información y las Comunicaciones. Universidad Politécnicade Cartagena Campus de la Muralla del Mar, s/n, 30202, Cartagena (Murcia) [email protected]

Abstract: Among the different types of Neural Networks (NN), the most popular and frequently usedarchitectures are the Multilayer Perceptron (MLP) and Radial Basis Functions (RBF) dueto their approximation capabilities. In this chapter we discuss the use of RBF networksto solve problems in different areas related to communications systems. In the first partof the chapter, we revise the structure of the RBF networks and the main procedures totrain them. In the second part, the main applications of RBF networks in communicationsystems are presented and described. In particular, we will focus our attention in antennaarray signal processing (direction-of-arrival estimation and beamforming) and channelequalization (intersymbol interferences and co-channel interferences). Other applicationssuch as coding/decoding, system identification, fault detection in access networks andautomatic recognition of wireless standards are also mentioned

Keywords: Neural networks, radial basis functions, communication systems, channel equalization,antenna array signal processing

1. RADIAL BASIS FUNCTION NETWORKS

The Radial Basis Function (RBF) networks are one type of layered feedforwardneural networks (NN) capable of approximating any continuous function with acertain precision level. The performance of the RBF networks can be viewed as afunction approximation that consists of a multidimensional curve fitting problem.

109


110 CHAPTER 5

The function approximation is realized by means of a first nonlinear mapping fromthe input space to a high dimensional hidden space and a second linear mapping fromthe hidden space to the final output space. This operation is used in complex patternclassification problems because once the input space has been mapped in a high-dimensional space in a nonlinear way, the patterns are easier separable by meansof a linear transformation. The support of this operation is given by the Cover’sTheorem of the separability of patterns [Cover, 1965]. According to this theorem,a complex pattern-classification problem is more likely to be linearly separable ifa nonlinear cast in a high-dimensional space is done. The aforementioned mappingused in the pattern classification problem can be viewed as a surface constructionthat can also be used in the interpolation problems.

This approach in the RBF networks design was taken by first time in [Broomheadand Lowe, 1988]. In the RBF networks the function that interpolates the data is ofthe form [Powell, 1988]:

(1) F�x� =N∑

i=1

wi��x − ci��

where ��·� is the radial basis function, � ·� is the Euclidean norm, N is the numberof the training data pairs, and ci is the center of the i-th radial basis function. In thisapproximation, there are so many centers as number of inputs, being their valuesthe same as the input vectors, i.e., ci = xi. During the training stage a known set ofinput and output data pairs are delivered to the RBF network to select the centersand compute the output layer weights. F function has to satisfy the interpolationcondition F�xi� = di, where di is the desired training output value. The methodapplied to carry out the output layer weights is easier to explain if the RBF networkperformance is expressed in a matrix form:

(2) d = �w

where d = �d1�d2� � � � � dN T is the desired output vector, � is an N -by-N matrixwith elements �ji = ��xj −ci�� with j� i = 1� 2� � � � N , and w = �w1�w2� � � � �wN T

is the weight vector. This weight vector is computed as follows:

(3) w = �−1d

where �−1 is the inverse of the matrix �.The Micchelli’s theorem [Micchelli, 1986] gives the conditions under the RBF

network produces an output surface or function F that passes through all the trainingpoints. To measure the generalization capabilities of the trained network, i.e., itsbehavior with patterns that have not been used in the training phase, new points aredelivered to the RBF network. The strict interpolation surface usually implies a poorgeneralization capability, especially when the number of training points is high. Theoverfitting produced by the strict interpolation is undesirable in the most of casesbecause the new RBF network outputs, different from the training outputs, will not

RADIAL BASIS FUNCTION NETWORKS 111

be correct. Due to the noise present in the training data or the lack of data in someregions of the input and output space, the construction of the interpolation surface isan ill-posed problem. The regularization theory was proposed by Poggio and Girosito solve the ill-posed surface reconstruction problem [Poggio and Girosi, 1990]. Itcan be proved that this regularization network is a universal approximator and theapproximation performed is optimal. Because of the regularization network trainingis a very computational demanding task, a suboptimal solution was proposed in[Poggio and Girosi, 1990]. This proposal allows different RBF so that the functionapproximation is realized by:

(4) F�x� =m∑

i=1

i�i��x −xi��

where the number of radial basis is m ≤ N . Typical examples of radial basis areinverse multiquadric functions and Gaussian functions.

In [Park and Sandberg, 1991], the universal approximation theorem for RBF net-works is proved. This theoretical support allows to design RBF networks which havethe capability of approximating any continuous function when a set of appropriateparameters are selected.

2. ARCHITECTURE

A RBF network consists of two layers as seen in Figure 1. The input level (notconsidered as a layer) is responsible of the data acquisition. The first layer iscalled the hidden layer because it is between the input layer and the output layerand it performs a nonlinear transformation by means of the radial basis functions.Finally, the output layer produces the output of the network by means of a lineartransformation.

In Figure 1, it is shown how each input data is delivered to every neuron (RBF)in the hidden layer. A bias term is applied multiplying a constant function plus itslinear weight. Every radial basis function output is multiplied by the correspondinglinear weight and all are summed up to carry out the RBF network output. The

x1

x2

xn

w1

w2

wm

w0=b

Hidden Layer Output Layer

F(x)

ϕ=1

ϕ

ϕ

ϕ

Figure 1. Radial Basis Function Network structure

112 CHAPTER 5

architecture depicted in Figure 1 refers to an output space of one dimension. Itis straightforward to generalize the model to the case of multidimensional outputspace. Among all possible Radial Basis functions, one of the most used is theGaussian function. Thus:

(5) �i�x� = exp�1

2�2i

�x − ci��

where ci is the radial basis center, �2i is the variance of the Gaussian function, and

x is the input vector. The variances are usually chosen to be common to all theGaussian functions, although other alternatives have been proposed; one of them isdescribed in the next section. The Gaussian radial basis functions considered abovecan be generalized to allow for arbitrary covariance matrices �i. Thus, the basisfunctions take the form:

(6) �i�x� = exp�−12

�x − ci�T �−1

i �x − ci�

It is sometimes useful to consider the covariance matrices equal and introduce aweighted norm of the form:

12

�−1 = CT C(7)

C = diag�1�1

�1�2

� · · · �1

�m

�(8)

where C is the norm weighting matrix and �−1 is the covariance matrix. Thisweighting matrix is diagonal with different values along the diagonal, then theGaussian function will have different variances along the different dimensions ofthe input space.

3. TRAINING ALGORITHMS

The design of a RBF network is composed of two phases: the training process andthe testing phase. During the first one, the network learns from a well known set ofinput-output data pairs called the training set. After training process is completed,the generalization capability of the trained network is checked in the testing phase.To do this, a set of input-output data pairs different from these used in the trainingphase is delivered to the RBF network to measure the performance of the network.

The aim of a training algorithm is to choose the centers, variances (or thecovariance matrix if the norm weighting is used) and the linear layer weights of theoutput layer. The most important issue in the RBF network design is the centersselection. Usually, training strategies use combined supervised and unsupervisedstrategies. The supervised algorithms take into account the desired output valuesto compute the weights. An unsupervised algorithm only uses the input data tocalculate the radial basis centers and variances. Following this explanation, fourdifferent training algorithms are presented.


3.1 Fixed Centers Selected at Random

This unsupervised strategy consists of selecting the centers of the Gaussian RadialBasis randomly from the input data of the training set. Each center, ck, is equal toone training input pattern. The variance of each Gaussian function is calculated as:

(9) � = d2max

2m

where dmax is the maximum distance between the selected centers, and m is thenumber of centers [Haykin, 1999]. Choosing too peaked or too flat Gaussian func-tions should be avoided since extreme variance values produce a bad performance.The output layer weights can be carried out with the pseudo-inverse procedure. TheRBF network operation can be expressed in a matrix form as:

⎛

⎜⎜⎜⎝

d1

d2

dN

⎞

⎟⎟⎟⎠

=

⎛

⎜⎜⎜⎝

��x1� c1� ��x1� c2� � � � ��x1� cm��x2� c1� ��x2� c2� � � � ��x2� cm�

��xN � c1� ��xN � c2� � � � ��xN � cm�

⎞

⎟⎟⎟⎠

⎛

⎜⎜⎜⎝

w1

w2

wm

⎞

⎟⎟⎟⎠

(10)

d = �w(11)

where d is the vector of the desired training outputs, w is the weight vector, and� is a non-square matrix with components equal to the output of the radial basisfunctions evaluated in the training points.

The weights are calculated as:

w = �+d(12)

�+ = ��T ��−1�T(13)

where �+ is the pseudo-inverse of the � matrix. If the training set is numerous,the pseudo-inverse calculation requires extensive computation; nevertheless, thereexist many efficient algorithms to compute the pseudo-inverse matrix [Golub andVan Loan, 1996]. Another problem arises when �T � is singular or nearly singular.In this case, the direct solution given by Equation (12) can lead to numericaldifficulties. In practice, such problems are best resolved by using the technique ofSingular Value Decomposition (SVD) to find a solution for the weights, [Goluband Van Loan, 1996].

3.2 Self-Organized Selection of Centers

The fixed centers selected at random is a simple and fast training strategy but itusually leads to a poor performance or large RBF networks. Moreover, if the centersare chosen close together it produces ill conditioning in the pseudo-inverse problem.The training algorithms included in the self-organized selection of centers category

114 CHAPTER 5

try to avoid the previous problems by means of clustering procedures. In thissection two algorithms are explained: the k-means clustering and the self-organizingmap clustering. The RBF network training is completed with the last layer weightestimation. Procedures as pseudo-inverse or a gradient descent algorithm may beutilized to deal with this last part of the training. The most frequently used gradientdescent algorithm is the Widrow-Hoff or Least Mean-Square (LMS) rule due toits simplicity, ability to operate satisfactorily in an unknown environment, and itsability of tracking variations of input statistics [Haykin, 2002].

3.2.1 K-means clustering algorithm

This learning strategy allocates the centers along the regions of the input spacewhere the relevant data are present. The algorithm begins with the random selectionof m centers. At each iteration of the algorithm the training input pattern x isassigned to the k center if: �x − ck� < �x − cj� for j = 1� 2� � � � �m and j �= k. Thenew center is computed by ck = 1

Nk

∑x�Sk

x where Nk is the number of input patternsthat belong to the cluster Sk defined as:

(14) Sk = �x� with �x − ck� < �x − cj� for j = 1� 2� � � � �m and j �= k �

The procedure stops when there is no variation in the position of the centers.After the centers have been calculated, several ways to calculate the variance

of the Gaussian functions may be used. Normally, the variance of each Gaussianfunction is the mean of the distances of some certain group of neighboring centersor input patterns; for example, we may select as variance value of the k-th Gaussianfunction, the mean of the distances to the p nearest centers by �k = � 1

p

∑pi=1 li,

where li is the distance between the k-th center and the i-th nearest center and l1 ≤� � � ≤ li ≤ � � � ≤ lm. Usually, the Euclidean norm is used to compute the distances.Another alternative to select �k consists of computing the distance between thecurrent center and the nearest center belonging to other class and establishing�k = �min��cj − ck�� with j = 1� 2� � � � �m� j �= k, and � a scale factor. The �k

can also be estimated making use of the mean of the input patterns belonging tothe cluster Sk.

3.2.2 Self-Organizing map clustering: SOM

The Self-Organized Map is a method to classify the input patterns into groupsor clusters each one characterized by a particular center. This iterative algorithmdeveloped by T. Kohonen can be divided in two steps [Kohonen, 1990]. First, aninput training pattern is randomly selected and assigned to the closest (“winner”)cluster, this is called the competitive phase. In a second step, the center of thewinner cluster, together with a predefined neighborhood, are updated such that theyare moved towards the input vector. This is called the updating phase.

The initial position of the centers are initialized to random values. All the centershave to be different and it is recommendable that the centers take small euclideannorms. In the neural network structure, a neighborhood Nc is defined around each


neuron. At the beginning of the algorithm, because of the random initialization of thecenters, the corresponding centers of the neurons belonging to each neighborhoodset Nc can be located in distant positions in the input space. The aim of the algorithmis therefore to place the centers of the corresponding neighborhood set neurons innearby positions in the input space.

During the i-th iteration of the the competitive phase all the Euclidean distancesbetween the given input pattern and the cluster centers are calculated. The winnercenter, ck, is that one with the minimum distance, i.e., k�x�i�� = argminj�x�i�−cj�i�� for j = 1� 2� � � � m. During the i-th iteration of the updating phase, the centersbelonging to the neighborhood, Nc, of the winner center, ck, are adjusted movingtowards the input; the rest of the centers are not updated. Thus, the updatingequations are:

(15) ck�i+1� ={

ck�i�+��i��x − ck�i�� if k ∈ Nc�

ck�i�� if k �∈ Nc

The parameter ��i� is an adapting rate that may take values comprised between 0and 1. The ��i� is related to the gain used in the stochastic approximation processes.As in these methods it should decrease with time. Once the centers are calculatedaccording to Equation (15), the final structure is such that similar input vectorsactivate the same output (center or neuron), while different vectors activate differentneurons.

3.3 Supervised Selection of Centers

One approach to develop a supervised selection of centers is to apply a gradientdescent algorithm to a network error function. The supervised selection of centerstries to avoid the inherent problems found in the previous training strategies. Theheuristic methods as selecting the centers at random or k-means clustering usuallylead to large networks for a given error or do not generalize well. The time toconverge to a solution is higher in the supervised selection than in the algorithmspreviously explained. The supervised procedure can also be extended to the calcu-lation of the gaussian variances and the output layer weights. In the most generalform of the supervised training algorithm, the gradient descent procedure is appliedto the evaluation of the covariance matrices �−1

k .Once the error function has been defined, the relevant error function gradients

respect the corresponding parameters are calculated. The change in every iterationis proportional to the evaluated gradient. One possibly error function is the Sum ofSquare Errors (SSE) defined as:

(16) E = 12

N∑

j=1

e2j

116 CHAPTER 5

where N is the number of the training patterns, and ej is the error committed bythe RBF network when one pattern is presented. This error can be written as:

(17) ej = dj −m∑

i=1

i��xj − ci��

The following equations implement the updating step of this supervised trainingmethod for the case of a one-dimensional output space [Haykin, 1999]. It is apparentto generalize it to the general case of a multidimensional output space. The initialvalues of centers, matrix covariance, and weights are chosen to be in a useful regionof the parameter space. This initial constraint is done to reduce the probability ofgetting stuck in a local minimum.

The updating equations for the k-th radial basis center in the j-th iteration iscalculated as follows:

�E�j�

�ck�j�= 2k�j�

N∑

i=1

ei�j��′��xi − ck�j��−1

k �xi − ck�j�(18)

ck�j +1� = ck�j�−�1

�E�j�

�ck�j�(19)

The linear weight updating is carried out by means of the equations:

�E�j�

�k�j�=

N∑

i=1

ei�j��xi − ck�j��(20)

k�j +1� = k�j�−�2

�E�j�

�k�j�(21)

Finally, the �−1k covariance matrix is adjusted with the next equations:

�E�j�

��−1k �j�

= −k�j�N∑

i=1

ei�j��′��xi − ck�j��Qik�j�(22)

Qik�j� = �xi − ck�j��xi − ck�j�T(23)

�−1k �j +1� = �−1

k �j�−�3

�E�j�

��−1k �j�

(24)

In the above equations, the parameters updating is made when all the trainingpatterns have been presented to the network. This is a batch update. The adjust ofeach network parameter can as well be accomplished in a continuous way. In thissecond case, the parameter updates are applied after every pattern presentation. Thisprocedure is known as sequential or pattern-based updating. The �1, �2 and �3

terms are the learning rate parameters of the centers, weights, and matrix covariancerespectively.


3.4 Orthogonal Least Squares (OLS)

The Orthogonal Least Squares algorithm can be applied to the RBF network centersselection problem [Chen et al., 1991]. The objective of the OLS is to select anappropriate centers set in a rational way and to maximize the contribution of theselected centers to the desired response. This iterative algorithm completes thetraining process evaluating the output linear weights. From the point of view ofthis procedure, the RBF network can be considered as a special case of the linearregression model. In matrix notation, the output of the RBF network is given byPw. In this way, we can write:

(25) d = Pw +E

where d = �d�1��d�2�� d�N� is the desired output vector of the training dataset, P = �p1� � � � � pM is the regression matrix, w = �w1� � � � �wM is the weightvector, and E = �e1� e2� � � � � eN is the error vector of the training patterns. In thiscase, the output space is one-dimensional. The objective of the OLS algorithmis to find w and pj being M as lower as posible such that the energy of E isminimized.

Each column of the regression matrix is a regressor vector that is calculated as theresult of evaluation onto every input data the radial basis nonlinear function with thecorresponding center: pi = ��x1 −ci��x2 −ci�� xN −ci��T . We canconsiderer that the regressor vectors pi form a set of basis vectors. The Least-Squaressolution w of the above problem satisfies the condition that Pw is the projectionof d in the space spanned by the regressors. The square of this projection is partof the desired output energy. Since the initial regressors are generally correlatedand in order to measure the individual contribution to the output energy, the OLStransforms the initial set of M regressors in a set of m ≤ M orthogonal basis vectors.The individual contribution to the desired output energy of each orthogonal vectoris easy to find. It is usually to select as initial regressors set the ones that result fromusing as centers all the training input data. In this last case, we have N regressorvectors. The orthogonalization of the initial regressor vectors is realized by meansof the well known Gram-Schmidt method. This method decomposes the regressionmatrix in a product of a two matrices U and A:

(26) P = UA

where U matrix is composed of orthogonal columns and A is a triangular matrix. Ateach iteration of the algorithm, a column of U and the corresponding column of Aare calculated. Each column of U is the result of the orthogonalization of a selectedregressor p. An error reduction ratio is calculated to select the regressor that hasto be orthogonalized in the corresponding iteration. Each column of U representsthe selection of a certain center. The elements of the corresponding column ofthe matrix A are calculated with the selected regressor and the already calculatedorthogonal columns of U. The algorithm stops when a maximum pre-set error is

118 CHAPTER 5

reached. The triangular matrix A is used in the computation of the output weights.In this procedure the variance is common to all the radial basis functions and it isa value to be set previously to the algorithm beginning.

In [Shertinsky and Picard, 1996], it is shown that the OLS does not producethe smallest set of centers if a nonorthogonal basis is used. When the basis isnonorthogonal, the energy contributions of the basis vectors are not independent. Inthis case, the OLS is not able to determine the regressor that produces the maximalalignment in a global sense. Despite of this suboptimal performance, the OLS hasproved its validity in numerous applications and it usually leads to more compactset of centers than the previous learning strategies [Haykin, 1999]. In the last yearssome versions of OLS have been developed. In the next sections a brief explanationof two OLS variant methods is given.

3.4.1 Recursive Orthogonal Least Squares (ROLS)

The recursive orthogonal least squares is useful in the time variant problems [Gommand Yu, 2000]. In this proposal, the classical OLS is transformed in a set of matrixequations in which the orthogonal decomposition is made in a recursive way.The matrices in each iteration depends on the previous ones. A recursive residualequation is introduced to measure the RBF network accuracy. Higher values of theresidual represent better accuracy levels. In [Gomm and Yu, 2000], two operationmodes of the ROLS are described. In the backward mode, a center is removed ateach iteration. In this operation, the center that causes the smallest increase in theresidual is removed from the network. In the forward mode, a center is added ateach iteration. In this last operation mode, the center that allows for the maximumincrement in the residual is selected. In [Gomm and Yu, 2000], the Givens rotationsare used to achieved an efficient implementation of the forward and backwardtechniques.

3.4.2 Locally Regularised Orthogonal Least-Squares (LROLS)

The locally regularised orthogonal least-squares, is an algorithm developed todesign RBF networks that can generalize well in situations with high noise levels[Chen, 2002]. Moreover, the local regularization enforces the sparsity of the subsetmodel obtained by the original OLS training algorithm. The LROLS techniquecombines the local regularization approach with the OLS training. To achieve thelocal regularization, each orthogonal regressor uj has associated a regularizationparameter �j . This new parameter is included in a new error reduction ratio equa-tion. Just as in the classical OLS, this new error reduction ratio is evaluated toselect the orthogonal regressor uj . Since the optimal value of the regularizationparameter is unknown, an iterative procedure is carried out. At each iteration aregressors subset is selected with the current values of the regularization parame-ters. The iterative procedure stops when there is no changes in the regularizationparameters.


4. RELATION WITH SUPPORT VECTOR MACHINES (SVM)

The Support Vector Machines are another kind of feedforward neural networksdeveloped by Vapnik [Boser, Guyon and Vapnik, 1992], [Cortes and Vapnik,1995],[Vapnik, 1995] and [Vapnik, 1998]. The aim of a SVM is to constructan optimal hyperplane that allows a maximum margin classification in the caseof linear separable patterns. When the patterns are nonseparable, the constructedhyperplane minimizes the probability of classification error. The hyperplane solutionis found, thanks to the application of the method of structural risk minimization,so that the optimal hyperplane is the one that minimizes the Vapnik-Chervonenkisdimension.

As seen in the introduction, the Cover’s theorem assures that a set of nonlinearlyseparable patterns are linear separable with high probability if the input space ismapped into a higher dimension space. This new feature space must have a highenough dimension and the mapping has to be nonlinear. In a SVM, the optimalhyperplane is not constructed in the input space but in the new feature space. Thehyperplane construction involves the evaluation of an inner-product kernel. TheMercer’s theorem [Hochstadt, 1989] allows to determine if a certain kernel is aninner-product kernel and therefore if the kernel is acceptable in the design of aSVM. Once the kernel has been chosen, the optimum SVM design implies themaximization of an objective function subject to certain constraints by means ofthe Lagrange multipliers. The design finishes when a weight matrix and a set ofsupport vectors have been evaluated. The SVM output calculation consists of severaloperations. First, the inner-product kernel between the input and the support vectors.Each inner-product kernel is multiplied by the corresponding weight. Finally, all themultiplications between the inner-product kernels and weights are summed up toproduce the desired output. It is easy to implement the SVM operation as a neuralnetwork structure. Depending on the inner-product kernel there are different typesof support vector machines. The polynomial learning machine uses the function�xT xi +1�p. A two layer perceptron is constructed if the function tanh��0xT � xi +�1�is applied. If the Gaussian function exp�− 1

2�2 �x−xi�2� is used, then a RBF networkis obtained. The RBF networks are from this point of view a special case of SVM.The SVM designing algorithm allows to construct RBF networks avoiding theheuristics needed in the conventional RBF network training algorithms.

5. APPLICATIONS OF RADIAL BASIS FUNCTION NETWORKSTO COMMUNICATION SYSTEMS

The general aim of this section is to cover the applications of radial basis function(RBF) networks to different areas related to communication systems. In particu-lar, we will focus our attention in antenna array signal processing and channelequalization. We will also mention other applications such as coding/decoding,system identification, fault detection in access networks and automatic recognitionof wireless standards.

120 CHAPTER 5

5.1 Antenna Array Signal Processing

Recently, antenna array signal processing (AASP) is receiving growing attentionfrom researchers as a possible means of improving the performance and capacityof many communication systems, such as mobile radio systems.

AASP comprises basically two research areas: direction-of-arrival (DoA) esti-mation and beamforming.

Conventional methods for AASP are usually linear. However, they can beimproved by using non-linear techniques, typically neural networks. Interested read-ers can find a complete review of neural methods for antenna array signal processingin [Du et al., 2002].

Radial basis function networks stand out other neural approaches due to severalreasons. Firstly, RBFs have the property of universal approximation, i.e. they canapproximate continuous functions arbitrarily well when a large number of neuronsis considered. The importance of this property will be highlighted in the subsectiondevoted to RBF-based DoA estimators. Secondly, it has been demonstrated thatRBF methods are very robust against noise, so they perform well in a wide range ofsignal-to-noise ratios (SNRs). Finally, their training process is faster as comparedto other neural networks as multilayer perceptron network (MLP) allowing an on-line adjustment of the network. This last property is very important because DoAand beamforming are usually applications in which both adaptive adjustment totime-varying conditions and real-time processing are required.

The rest of the section is devoted to the description of some of the most relevantcontributions in the field of RBF-based methods applied to AASP problems.

5.1.1 Direction-of-Arrival (DoA) estimation

The purpose of DoA algorithms is to obtain the direction of arrival of signalsfrom the information contained in the measurements of antenna array outputs. Theantenna array can be viewed as a device which performs a non-linear mapping,G � S → �, from the space of angles of arrival � to the space of sensor output S.

So, one method for estimating the values of � is to approximate the inversemapping F � � → S. Due to its universal approximation property, RBFs can be usedfor this purpose.

Figure 2 is a block diagram of a DoA estimator based on RBF networks. As itcan be observed, this system is composed of the sensor array, a preprocessing stageand a RBF network. The array output S is passed through the preprocessing modulefor removing irrelevant information. This step is fundamental because it contributesto minimize the required size of the RBF network. The preprocessing data is theinput of the RBF network which performs the non-linear mapping by approximatingthe function F . Finally, the RBF output is the estimation of the directions ofarrival.

Several RBF-based DoA estimator have been developed following the genericscheme in Figure 2. Two representative examples are [Lo et al., 1994] and [ElZooghby et al., 1997]. In both cases, the preprocessing stage performs the compu-tation of the normalized covariance matrix of the sensor output in order to eliminate


PREPROCESSING

Sensor array

DOA estimation (θ)

...

RBF

NETWORK

Sensor outputs

s1 s2 sk

Figure 2. Block diagram of a DoA estimator based on RBF networks

the initial phase and gain of the array output. A similar scheme was used in [Southallet al., 1995], but, in this case, the RBF input consisted of the sines and cosines ofthe phase differences between the measured signals of the elements and the refer-ence element. All these approaches outperform one of the most used algorithms inthis field, the MUSIC (Multiple Signal Classification method) algorithm [Schmidt,1986], in both accuracy and speed.

Another example of tracking system based on RBF networks is proposed in[Mukai et al., 2002]. In this article, the authors used an adaptive RBF in conjunctionwith an array feed compensation system for acquisition and continuously tracking ofa Deep Space Network (DSN) antenna for communications at Ka-band of 32 GHz.

However, the extension of these methods for multiple-source tracking is notstraightforward. In fact, for detecting more than three sources, the resulting RBFnetwork presents a large size and its training is impracticable [Du et al., 2002]. Inaddition, the knowledge of the number of sources is required.

The solution proposed in [El Zooghby et al., 2000] overcomes these limitations.In this work, El Zooghby et al. developed a new DoA algorithm (N-MUST algo-rithm) with application to smart antennas for wireless terrestrial and satellite mobilecommunications. N-MUST consists of two stages. In the first stage (detection), acoarse detection of sources is performed. This result is refined in the second stage(estimation) in which several RBF networks were trained for different ranges ofangles of arrival.

122 CHAPTER 5

5.1.2 Beamforming

The main objectives of beamforming techniques is to acquire and reconstruct theoriginal signal of the desired source while rejecting the rest of non-desired sources.

Beamforming is of special importance in modern mobile satellite communicationsystems and global positioning systems (GPS) because they require the presenceof smart antennas capable of distinguishing signals from multiple sources. Thesesystems must adapt the radiation pattern of the antenna in order to cancel theinterfering signals and emphasize the desired ones.

In [El Zooghby et al., 1998] it was demonstrated that RBF networks can beused for this purpose. In this case, RBF-based beamforming was applied in oneand two-dimensional adaptive arrays obtaining a results very close to the optimumWiener solution.

In the context of digital beamforming (DBF) antenna array, Xu et al. proposed thecombination of the so-called Constant Module algorithm (CMA) and a novel RBF-based beamformer for interference cancellation and Gaussian and non-Gaussiannoise [Xu et al., 2000]. The presence of non-Gaussian noise (for example, atmo-spheric noise) is typical of satellite communications. In these noisy conditions, theauthors showed that their approach outperformed the conventional CMA algorithm.

5.2 Channel Equalization

Digital communication systems are often impaired by several types of distortionswhich produce significant changes in both, the amplitude and phase of transmit-ted signals. One of the most common distortions is the intersymbol interference(ISI) which causes an overlap of the transmitted symbols over successive timeintervals. ISI is usually a result of the restricted bandwidth of the communicationchannel and it is also produced when several versions of the transmitted signalsarrive at the receiver due to multi-path propagation. Other sources of distortionwhich degrade the performance of the communication system are noise, the intrin-sic characteristics of the transmission channel, co-channel and adjacent channelinterference, non-linear distortions (for example, those produced by memorylessnon-linear devices such as the travelling-wave tube (TWT) amplifiers included insatellite communication systems), fading, time-varying characteristics, etc.

Figure 3 illustrates a digital communication system in which the channel includesthe effects of the transmitter filter, the transmission medium and the receiver filter.Stationary linear dispersive channels can be characterized by a N taps finite impulseresponse (FIR) digital filter, given by the coefficient vector h = �h0� h1� � � � � hN−1�.As it can be observed in Figure 3, the digital data sequence x�k� is passed throughthis FIR filter and it is corrupted by an additive zero-mean Gaussian noise, n�k�,producing the distorted received sequence, y�k�. The relationship between the trans-mitted signal, x�k�, and the received signal, y�k�, can be expressed as

(27) y�k� =N−1∑

i=0

hix�k− i�+n�k�


CHANNEL +

n(k)

Transmittedsignal

Noise

x(k)

Receivedsignal

y(k)EQUALIZER

Estimatedsignal

x(k)^

Figure 3. Digital communications system

For a more general model of the channel (for example, non-linear channels) thereceived sequence, y�k�, is related to the input, x�k�, of the channel through thisexpression

(28) y�k� = fh�xk� xk−1� � � � � xk−N+1�+n�k�

in which fh is some linear or non-linear function which models the behavior of thechannel.

In this context, adaptive channel equalization is a major issue in digital commu-nication systems. The role of the channel equalization is to recover the transmittedsymbols from the distorted received signal. Therefore, the equalizer is the part ofthe receiver that is employed for mitigating channel disturbances such as noise,intersymbol interferences and the other distortions previously mentioned. In addi-tion, in most practical communication systems, the channel characteristics varieswith time. Therefore, it is necessary to build adaptive equalizers.

Equalizers can be classified in two groups: sequence and symbol-by-symbolequalizers. Estimation theory suggests that the best performance for symbol detec-tion is obtained by using equalizers belonging to the first group [Proakis, 2001].Such equalizers are maximum likelihood sequence estimators (MLSE) and they areusually implemented by means of the well-known Viterbi algorithm. However, theentire transmitted sequence and a certain knowledge of the channel characteristics(i.e. a channel estimator) are required for decoding the optimum sequence of sym-bols. In addition, sequence equalizers present a high computational complexity anda large decision delay. All these requirements seriously limit their use in manypractical communication systems.

Symbol-by-symbol equalizers, in which only one output symbol is estimated ineach symbol period, are the most popular alternative to MLSE equalizers. Thereexist three common symbol-decision architectures: transversal equalizers (TE), deci-sion feedback equalizers (DFE) and growing memory structures [Mulgrew, 1996][Mulgrew, 1998].

In TE architectures, the equalizer reconstructs each input symbol x�k� usingM consecutive channel outputs �y�k�� y�k − 1�� y�k − M + 1��. This way, theequalizer output, x�k�, is an estimate of the channel input. The integer M is knownas the equalizer order and determines the behavior of the detector. In fact, theperformance and the complexity of the system increase with the number M of

124 CHAPTER 5

received symbols used for making the decision. Often, the equalizer operates witha decision delay d, thus at time k the equalizer produces an estimate of the inputsymbol x�k−d�.

The DFE architecture is an extension of the TE one. In this case, for the estimationof x�k−d� not only the M most recent channel observations are used, but also theP past detected symbols �x�k−d−1�� x�k−d−2�� x�k−d−P +1��, whereP is the equalizer feedback order.

Finally, equalizers with a growing memory structure are usually implementedusing recurrent networks. Recurrent networks provide the possibility of improv-ing the quality of symbol-by-symbol decisions by using recursively all the pre-viously received symbols instead of the last M channel outputs as in the TEarchitecture, while memory requirements of the equalizer are kept as small aspossible. Thus, recurrent networks offer a compromise between performance andcomplexity.

A traditional method of adaptively compensating for the ISI is to construct anapproximation to the inverse of the channel [Qureshi, 1985]. From this perspec-tive, equalization is viewed as a deconvolution problem and the equalizer can beimplemented as a linear adaptive filter. Linear transversal equalizers (LTE) are themost common example of this approach. However, several shortcomings of thismethod have been reported [Chen et al., 1993a], [Mulgrew, 1996]: firstly, adaptivefilters can not effectively mitigate the additive noise component and secondly, thelinear approach completely ignores the fact that, in the absence of noise, a receivedsequence y�k� can take only values belonging to a predefined finite alphabet ofsymbols.

Precisely, the latter consideration allows to view channel equalization as a deci-sion problem. This is the reason why non-linear approaches (many of them basedon neural networks) have attracted the interest of several researchers in the field ofchannel equalization. In fact, from the Bayesian decision theory [Proakis, 2001], itcan be seen that the optimal solution for the symbol-detection decision correspondsto a non-linear classification problem.

Several neural networks have been proposed for channel equalization, like multi-layer feedforward neural networks (MLP) and radial basis function networks (RBF)[Ibnkahla, 2000]. The main drawback of MLP networks is their training process,which requires a great number of examples and it is very time-consuming [Mulgrew,1996]. This fact restricts the use of MLP networks in scenarios in which adap-tation is a fundamental issue as occurs in channel equalization problems. On theother hand, RBF networks have a structure with a close relationship to the opti-mal Bayesian equalizers [Chen et al., 1993a] and they are well suited for solvingnon-classification problems. For these reasons, in the recent years, RBFs have beenconsidered as an attractive alternative to linear-based approaches.

In the next subsections, we summarize some of the most important research worksrelated to the application of RBF networks to channel equalization. In particular,we will focus on the methods developed for the mitigation of intersymbol andco-channel interferences.


5.2.1 Intersymbol interference

Numerous transversal adaptive equalizers based on RBF networks have been intro-duced in the literature to overcome ISI and nonlinear distortions in communicationsystems. In all cases, the training phase is the most important process involved in theequalizer design. Many communication channels transmit at certain time intervalsa training sequence which can be used for estimating the network parameters.

Former adaptive RBF-based equalizers were designed for real-valued binarychannels. In these systems, the training of the RBF networks is usually carried outin two stages [Chen et al., 1993a]. The first stage consists of a clustering algorithmwhich computes the optimal centers of the network. If the training symbol sequenceprovided by the communication system is available, the learning of the networkcenters can be done in a supervised manner. If not, an unsupervised clusteringalgorithm must be used (generally, the unsupervised k-means algorithm). In thesecond stage the network weights are updated using, for example a least meansquare (LMS) algorithm.

Chen et al. [Chen et al., 1994] developed a complex-valued version of their previ-ous equalizer which can be used for two-dimensional signalling such as quadratureamplitude modulation (QAM) channels. In [Cha, 1995] another approach for train-ing complex-valued RBF network equalizers was proposed: a stochastic gradient(SG) algorithm. The main advantage of this algorithm is the possibility of trainingsimultaneously all the free parameters of the network (centers and weights). In bothcases, it has been shown that the proposed approach is capable of equalizing suc-cessfully and slowly the time-varying nonlinear channels and outperform classicallinear equalizers.

The achievement of RBF networks with a minimum size (and complexity) isone of the main issues considered in this application area. Gan et al. [Gan et al.,1999] developed an equalizer in which the number of centers depended only onthe decision delay an not on the equalizer order like in previous approaches. Theyobtained a very good results in the equalization of fast time-varying complex-valued channels (4-QAM). In this context, it is worth mentioning the applicationof Minimal Resource Allocation Networks (MRAN) for equalization of real-valued[Chandra Kumar et al., 1998] or complex-valued channels [Jianping et al., 2002].MRAN is a network which uses a scheme for adding or removing RBF centersdepending on the channel conditions yielding to a minimum network structure.

Satellite mobile communication channels are characterized for the presence oflinear distortions due to the emitter and receiver filters, and nonlinearities causedby on-board power amplifiers and multipath propagation. Bouchired et al. reportedthat RBF networks have also been applied to satellite channel equalization, out-performing LTE and MLP-based equalizers [Bouchired et al.,1998a]. In the sameapplication area, Bouchired et al. obtained very successful results in the equal-ization of satellite channels by using Kohonen’s self-organizing maps (SOMs) forimproving the clustering stage in a RBF network equalizer [Bouchired et al.,1998b].

The application of RBF networks to decision feedback equalizers have also beenconsidered. Chen et al. [Chen et al., 1993b] obtained significant improvements

126 CHAPTER 5

in both performance and reduction of computational complexity by combiningRBF networks with DFE structures. They applied this novel equalizer to highlynon-stationary channels (as fading mobile radio channels) and demonstrated thesuperior performance of their approach in comparison to a conventional Viterbi-based equalizer.

As mentioned before, higher-order equalizers perform better than low-order ones.In RBF-based equalization, an increment of the order implies an exponential grow-ing of the number of centers of the RBFs involved. For solving this drawback,recurrent RBF networks have been proposed in [Cid-Sueiro, 1994], [Mimura andFurukawa, 2001] for linear and non-linear channels.

5.2.2 Co-channel interference

Many communication systems such as cellular mobile channels are impaired bythe so-called Multiple Access Interference (MAI). It originates from the frequencyreuse plan which allows several cells to share the same set of frequencies. Thisway, the signal received at the mobile station consists of the sum of all signals sentby the base station in the same cell, and those in the other cells transmitting in thesame frequency band or in adjacent frequency bands (co-channel interference, CCIor adjacent-channel interference, ACI). Figure 4 illustrates a digital communicationsystem with co-channel interferences. Although, traditionally, the main objective ofchannel equalization is to reduce the effects of ISI, several works have shown thatequalizers are able to mitigate the distortions due to co-channel and adjacent-channelinterferences.

Adaptive RBF networks have been applied successfully to overcome CCI. In[Chen and Mulgrew, 1992] a RBF-based transversal equalization was constructed

OWN

CHANNEL+

n(k)

Transmittedsignal

Noise

x(k)

Receivedsignal

y(k)

Interferedsignal

xc1(k) CO-CHANNEL#1

Interferedsignal

xcN(k) CO-CHANNEL#N

Figure 4. Digital communication system with co-channel interference


for this purpose. This approach exploits the differences between the interferencesignals and noise for improving the performance of a conventional equalizer.

Chen et al. extended their previous work to an equalizer with decision feedbackarchitecture [Chen et al., 1996] and adaptive implementation. They showed that inpresence of severe CCI, the RBF approach produced significant improvements inthe performance of the system in comparison to the conventional approach MLSE.

Finally, in [Chen et al., 2003] a RBF-based adaptive equalization in presenceof interference symbol, additive gaussian noise and co-channel interference wasproposed. The novelty of this study was the definition of a new algorithm (LBER)for training the RBF network. LBER is a stochastic gradient adaptive algorithmwhich tries to minimize the bit error rate (BER) instead of the mean square error(MSE) which is the criterion used in conventional equalizers. As the real goal ofequalization is the minimum BER, this method improved significantly the resultsobtained with a linear MMSE equalizer.

5.2.3 Blind equalization

The equalization techniques presented in the previous subsection rely on the exis-tence of a training signal which is transmitted at regular intervals. This allows theregular adjustment of the RBF network parameters in order to improve its per-formance in the presence of time-varying channels. However, when the trainingsequence is not available or the channel is highly time-varying (as for example,Rician fading channels typical of outdoor mobile communications environment),it is necessary to develop more sophisticated techniques generally known as blindequalization methods.

Most of blind equalization methods include a channel estimator in order toadapt the centers and weights of the RBF network using non-supervised trainingtechniques. For example, in [Cid-Sueiro, 1993] the channel response was estimatedthrough a non-supervised, non-decision-directed algorithm applied to a recurrentRBF network equalizer.

However, other approaches do not require a channel estimator. For example,Gomes and Barroso modified the RBF equalizer proposed by Chen et al. [Chen et al.,1993a] for blind equalization. In this case, a non-supervised clustering algorithmwas used for updating RBF’s centers and radii, while the network weights wereadjusted by minimizing an appropriate function cost [Gomes and Barroso, 1997].Finally, in [Lin and Yamashita, 2002] the RBF equalizer was directly designedusing only the received signal and a novel cluster map algorithm.

5.3 Other RBF Networks Applications

In the previous sections, some of the most common applications of radial basis func-tion networks in the field of digital communication systems have been described.However, in this context, there are other important areas in which RBFs havedemonstrated their usefulness and good performance. These areas will be summa-rized in the next paragraphs.

128 CHAPTER 5

Several authors have applied RBFs to coding-decoding systems. In [Kaminskyand Deshpande, 2003], the functional and structural similarities between directedgraphs in the conventional Viterbi decoding and RBFs networks were exploited fordeveloping an adaptive RBF-based decoder for a trellis coded modulation (TCM)system, which was able to take advantage of its learning capabilities for improvingthe decoding decisions. Müller and Poor have employed RBFs for constructing anovel chaotic-based coding/decoding method [Müller and Elmirghani, 2002] andthey showed that this new strategy presented high noise robustness compared toother conventional schemes for the same problem.

System modelling and identification is another fundamental issue in many com-munication systems. It can be used for channel estimation as a part of a blindequalization system or for computer-based simulation of communication systems.Two good examples are the studies conducted in [Yingwei et al., 1996], [Leonget al., 2002], in which RBFs were used for adaptive identification of non-linearsystems taking advantage of their universal approximation property.

RBFs can also be applied successfully to the problem of fault detection in accessnetworks as shown in [Zhou and Austin, 1999] in which RBF networks were used forautomatically learning the relationship between several telephone line parametersand fault occurrences.

Finally, RBF networks can be used for automatic recognition of modulationschemes or identification or wireless standards. These issues have a direct applica-tion in the design of wireless reconfigurable receivers [Palicot and Roland, 2003].The interest of this novel concept of device is clear due to the fast growing ofservices which possibly have to be carried out on several heterogeneous wire-less networks like the Global System for Mobile Communications (GSM) or theUniversal Mobile Telecommunication System (UMTS) standards.

REFERENCES

Cortes, C. and Vapnik, V.N., Support vector networks. Machine learning, 20: 273–297, 1995.Boser, B., Guyon, I. and Vapnik, V.N., A training algorithm for optimal margin classifiers. Fifth annual

workshop on computational learning theory, :144–152, San Mateo, CA, 1992.Bouchired, S., Ibnkahla, M., Roviras, D. and Castanié, F., Equalization of satellite mobile communication

channels using combined self-organizing maps and RBF networks, In Proceedings of IEEE Inter-national Conference on Acoustics, Speech and Signal Processing (ICASSP’98), 3377–3380, Seattle,WA, USA, April 1998b.

Bouchired, S., Ibnkahla, M., Roviras, D. and Castanié, F., “Equalization of satellite UMTS channelsusing RBF networks”, In Proceedings of IEEE Workshop on Personal Indoor and Mobile RadioCommunications (PIMRC), Boston, USA, September 1998a.

Broomhead, D.S. and Lowe, D., Multivariable functional interpolation and adpative networks ComplexSystems, 2:321–355, 1988.

Cid-Sueiro, J. and Figueiras-Vidal, A. R., Recurrent radial basis function networks for optimal blindequalization, In IEEE-SP Workshop on Neural Networks for Signal Processing, 562–571, Baltimore,MA (USA), September 1993.

Cid-Sueiro, J., Artés-Rodríguez A. and Figueiras-Vidal, A. R., Recurrent radial basis function networksfor optimal symbol-by-symbol equalization, In Signal Processing, vol. 40, no. 1, 53–63, October1994.


Cha, I. and Kassam, S. A., Channel equalization using adaptive complex radial basis function networks,In IEEE Journal on Selected Areas in Communications, vol. 13, no. 1, 122–131, January 1995.

Chandra Kumar, P., Saratchandran, P. and Sundararajan, N., Non-linear channel equalisation usingminimal radial basis function neural networks, In Proceedings of IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP’98), 3373–3376, Seattle, WA, USA, April 1998.

Chen, S., Cowan, C. F. N. and Grant, P. M., Ortoghonal least squares learning algorithm for radial basisfunction networks, In IEEE Transactions on Neural Networks, vol. 2, 302–309, March 1991.

Chen, S. and Mulgrew, B., Overcoming co-channel interference using an adaptive radial basis functionequalizer, In EURASIP Signal Processing Journal, vol. 28, no. 1, 91–107, 1992.

Chen, S., Mulgrew, B. and Grant, P. M., A clustering technique for digital communications channelequalization using radial basis function networks, In IEEE Transactions on Neural Networks, vol. 4,no. 4, 570–579, July 1993a.

Chen, S., Mulgrew, B. and McLaughlin, S., Adaptive bayesian equalizer with decision feedback, InIEEE Transactions on Signal Processing, vol. 41, no. 9, 2918–2927, September 1993b.

Chen, S., McLaughlin, S. and Mulgrew, B., Complex-valued radial basis function networks, Part II:Application to digital communication channel equalization, In Signal Processing, vol. 36, no. 2,175–188, March 1994.

Chen, S., McLaughlin, S. and Mulgrew, B. and Grant, P. M., Bayesian decision feedback equalizer forovercoming co-channel interference, In Proc. Inst. Elect. Eng., vol. 143, 219–225, August 1996.

Chen, S., Multi-output regression using a locally regularised orthogonal least-squares algorithm. IEEproceedings-vision image and signal processing, 149 (4): 185–195, 2002.

Chen, S., Mulgrew, B. and Hanzo, L., “Least bit error rate adaptive nonlinear equalisers for binarysignalling”, In IEE Proc. Communications, vol. 150, no. 1, 29–36, February 2003.

Cover, T.M., Geometrical and statistical properties of systems of linear inequalities with applications inpattern recognition. IEEE Transactions on Electronic computers, EC-14:326–334, 1995.

Du, K.-L., Lai, A. K. Y., Cheng, K. K. M. and Swamy, M. N. S., Neural methods for antenna arraysignal processing: a review, In Signal Processing, vol. 82, 547–561, 2002.

El Zooghby, A. H., Christodoulou, C. G. and Georgiopoulos, M., Performance of radial basis functionwith antenna arrays, In IEEE Transactions on Antennas and Propagation, vol. 45, no. 11, 1611–1617,1997.

El Zooghby, A. H., Christodoulou, C. G. and Georgiopoulos, M., Neural network-based adaptivebeamforming for one- and two-dimensional antenna arrays, In IEEE Transactions on Antennas andPropagation, vol. 46, no. 12, 1891–1893, 1998.

El Zooghby, A. H., Christodoulou, C. G. and Georgiopoulos, M., A neural network-based smart antennafor multiple source tracking, In IEEE Transactions on Antennas and Propagation, vol. 48, no. 5,768-776, May 2000.

Gan, Q., Saratchandran, P., Sundararajan, N. and Subramanian, K. R., A complex valued radial basisfunction network for equalization of fast time varying channels, In IEEE Transactions on NeuralNetworks, vol. 10, no. 4, 958–960, July 1999.

Golub,G.H. and Van Loan, C.G., Matrix computantions Johns Hopkins University Press, 1996.Gomes, J. and Barroso, V., Using a RBF network for blind equalization: desing and performance evalu-

ation, In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP’97), vol. 4, 3285–3288, April 1997.

Gomm, B.J. and Yu, D.L., Selecting radial basis function network centers with recursive orthogonalleast squares training. IEEE transactions on neural networks, 11 (2) :306–314, 2000.

Haykin, S., Neural Networks: a comprehensive foundation Prentice-Hall, New York, 1999.Haykin, S., Adaptive Filter Theory Prentice Hall, New Jersey, 2002.Hochstadt, H., Integral equations. Wiley Classics Library. John Wiley and Sons Inc., New York, 1989.

ISBN 0-471-50404-1. Reprint of the 1973 original, A Wiley-Interscience Publication.Ibnkahla, M., Applications of neural networks to digital communications - a survey, In Signal Processing,

vol. 80, 1185–1215, 2000.Jianping, D., Sundararajan, N. and Saratchandran, P., Communication channel equalization using

complex-valued radial basis function neural networks, In IEEE Transactions on Neural Networks,vol. 13, no. 3, 687–696, May 2002.

130 CHAPTER 5

Kaminsky, E. J. and Deshpande, N., TCM decoding using neural networks, In Engineering Applicationsof Artificial Intelligence, vol. 16, 473-489, 2003.

Kohonen, T., The Self-Organizing Map. Proceedings IEEE, 78 (9) :1464–1480, 1990.Leong, T. K., Saratchandran, P. and Sundararajan, N., Real-time performance evaluation of the minimal

radial basis function network for identification of time varying nonlinear systems, In Computers andElectrical Engineering, vol. 28, 103–117, 2002.

Lin, H. and Yamashita, K., Blind RBF equalizer for received signal constellation independent channel,In Proceedings of 8th International Conference on Communication Systems (ICCS’02), 82–86, 2002.

Lo, T., Leung, H. and Litva, J., Radial basis function neural network for direction-of-arrivals estimation,In IEEE Signal Processing Letters, vol. 1, no. 2, 45–47, February 1994.

Micchelli, C.A., Interpolation of scattered data: distance matrices and conditionally positive definitefucntions, 2:11–22, 1986.

Mimura, M., Furukawa, T., A recurrent RBF network for non-linear channel, In Proceedings of IEEEInternational Conference on Acoustics, Speech and Signal Processing (ICASSP’01), 1297–1300, 2001.

Müller, A. and Elmirghani, J. M. H., “Novel approaches to signal transmission based on chaotic signalsand artificial neural networks”, In IEEE Transactions on Communications, vol. 50, no. 3, 384–390,March 2002.

Mukai, R., Vilnrotter, V. A., Arabshahi, P. and Jamnejad, V., Adaptive acquisition and tracking for deepspace array feed antennas, In IEEE Transactions on Neural Networks, vol. 13, no. 5, 1149–1162,September 2002.

Mulgrew, B., Applying radial basis functions, In IEEE Signal Processing Magazine, vol. 13, no. 2,50–65, March 1996.

Mulgrew, B., Nonlinear signal processing for adaptive equalisation and multi-user detection, In Proc. IXEuropean Signal Processing Conference (EUSIPCO’98), 537–544, Rhodes, Greece, September 1998.

Palicot, J. and Roland, C., A new concept for wireless reconfigurable receivers, In IEEE CommunicationMagazine, 124–132, July 2003.

Park, J., and Sandberg, I.W., Universal approximation using radial basis function networks. NeuralComputation,3 :246–257, 1991.

Poggio, T. and Girosi, F., Networks for approximation and learning. Proceedings of the IEEE, 78:1481–1497, 1990.

Powell, M.J.D., Radial basis function approximations to polynomials Numerical Analysis 1987 proceed-ings, :223–241, 1988.

Proakis, J. G., Digital communications, McGraw-Hill, Boston, 4th edition, 2001.Qureshi, S. U. H., Adaptive equalization, In Proc. IEEE, vol. 73, 1349–1387, 1985.Schmidt, R., Multiple emitter location and signal parameter estimation In Antennas and Propagation,

IEEE Transactions on (legacy, pre - 1988) Volume 34, Issue 3, Mar 1986 Page(s):276–280Shertinsky, A. and Picard, R.W., On the efficiency of the orthogonal least squares training method for

radial basis function networks. IEEE transactions on neural networks, 7 (1) :195–200, 1996.Southall, H. L., Simmers, J. A. and O’Donnel, T. H., Direction finding in phased arrays with a neural

network beamformer, In IEEE Transactions on Antennas and Propagation, vol. 43, no. 12, 1369–1374,1995.

Vapnik, V.N, The nature of statistical learning theory. Wiley, New York, 1995.Vapnik, V.N., Statistical learning theory. Wiley, New York, 1998.Xu, C. Q., Law, C. L. and Yoshida, S., Interference rejection in non-Gaussian noise for satellite

communications using non-linear beamforming, In International Journal of Satellite Communicationsand Networking, vol. 21, 13–22, 2003.

Yingwei, L., Sundararajan, N. and Saratchandran, P., Adaptive nonlinear system identification usingminimal radial basis function neural networks, In Proceedings of IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP’96), vol. 6, 3521–3524, 1996.

Zhou, P. and Austin, J., Neural network approach to improving fault location in local telephone networks,In Proc. Artificial Neural Networks, 958–963, 1999.

CHAPTER 6

BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIALNEURONS

JAVIER ROPERO PELÁEZ, JOSE ROBERTO CASTILLO PIQUEIRAEscuela Politécnica da Universidade de Sao Paulo, Departamento de Engenhariade Telecomunicações e Controle

Abstract: Mc Culloch-Pitts neuron is a simple yet powerful building block for most of nowa-days neural networks. However recent advances in neurosciences show that this classicalparadigm can be certainly improved. For example, biological synaptic weights have newproperties like synaptic normalization and meta-plasticity that are crucial for develop-ing new neural-networks architectures. Other peculiar biological mechanisms like thesynchronization among neurons, allowing the identification of the neuron with maximalactivation, and the dual behavior (high/low frequency) of some biological neurons can beused for improving the performance of artificial neural networks. As an example a newneural network that mimics the human thalamus will be analyzed and tested

INTRODUCTION

Warren McCulloch and Walter Pitts published in 1943 the pioneering work entitled“A Logical Calculus of the Ideas Immanent in Nervous Activity” [McCulloch andPitts, 1943] in which they proposed that neurons are essentially binary units per-forming Boolean computations. For their proposal they used the limited knowledgeabout the nervous system available at that moment. Their work was the source ofinspiration for many neural networks designers: from Frank Rosenblatt, the creatorof the Perceptron [Rosenblatt, 1956] , a neural network for recognizing characters,to Rumelhart, Hinton and Williams [Rumelhart et al., 1986] who developed thebackpropagation algorithm for adjusting the synaptic weights in multi-layered neu-ral networks [McClelland, 1986] [McClelland and Rumelhart, 1986]. Many otherneural networks were developed inspired by McCulloch and Pitts model like theKohonen’s self-organizing map [Kohonen, 1982], the Hopfield’s auto-associativemodel [Hopfield, 1982], the Grossberg’s ART [Carpenter and Grossberg, 1988] etc.All these neural networks use essentially McCulloch’s neuron model although they

131


132 CHAPTER 6

lack many recently discovered properties of real neurons. For the proper functioningof these networks a great deal of mathematical algorithms counterbalance the lackof up-to-date neurons’ properties.

The purpose of this work is to review some of the recently found properties ofindividual neurons that can improve the conventional model of the neuron. Thefollowing section introduces these properties in a bottom-up order: first the synapticplasticity properties followed by the neurons properties and ending by networkproperties that are relevant for the functioning of individual neurons. Afterwards anew updated paradigm will be presented in which synapses, neurons and networksare revisited. Finally we integrate the updated neuron model in a model of a realbrain architecture, the thalamus, in the core of the brain.

In the following sections the nitty-gritty is explained in the text while the finedetails are in the figures that are self-explanatory.

1. BIOLOGICAL PROPERTIES

1.1 Synaptic Plasticity Properties

Synaptic plasticity, the occurrence of sustained changes in synapses was envisionedin 1894 by Santiago Ramón y Cajal who pointed out that learning could pro-duce changes in the communication between neurons and that this changes couldbe the essential mechanisms of memory. In 1948 Konorski alluded to persistentplastic changes in memory and in 1949 Hebb postulated that during learning synap-tic connections are strengthened due to the correlated activity of presynaptic andpostsynaptic neurons. The mathematical formalization of the Hebb rule is:

(1) �w�n� = � ·A�n�B�n�

in which the increment of weight at instant n is proportional (through factor �)to the presynaptic activity at time n, A(n) and to the postsynaptic activity B(n).Unfortunately this rule is not able to produce reversible changes in synapses. Thisreversible changes appeared in posterior models like in the Sejnowski’s covari-ance rule and in the BCM (Bienestock, Copper and Munro) model [Bienestocket al., 1982].

None of these models were formulated considering the real nature of plasticity interms of molecular interactions (see Figure 1) [Bear et al., 2001]. From this point ofview it is relevant to notice that events A and B that appears equals in importancein the Hebbian learning rule are not equally important during the plastic changesthat takes place in real synapses. We will call this property directionality. Thepresynaptic event A, a presynaptic action potential, and the postsynaptic event B,a postsynaptic increase of potential above a certain value, do not perform the samerole in real synapses. For example, in the Hebbian learning rule, when event A ispresent and event B is not, no alteration of synaptic strength takes place. Howeverin this same circumstances real synapses are depressed because the low calciumconcentration in the postsynaptic area leads to synaptic depression (see Figure 1).

BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS 133

Mg2+

Ca2+

Ca2+

Ca2+ LTPCa2+ LTD

Presynaptic

Postsynaptic

NMDAAMPA

Na+

Na+

Figure 1. Potentiation and depression is accomplished in biological synapses in the following way:Neurotransmitter Glutamate is released in the synaptic cleft from the presynaptic neuron. This

neurotransmitter acts like a key for opening the postsynaptic gates that are of two types NMDA andAMPA gates (channels). To open the NMDA gates it is also necessary to expel the magnesium that

blocks the NMDA channels. For this purpose, a certain voltage should be attained in the postsynapticarea. Once the channels are opened, ion Na+ is allowed to enter both type of channels. Ca2+ hydratedmolecule is big and for this reason it only passes through the NMDA channels. A big amount of Ca2+

entering the postsynaptic space produces an increment of the number and size of postsynaptic channels(synaptic potentiation or LTP) while a moderate amount produces a decrement of them (synaptic

depression or LTD)

In the case of event A taking place without event B little activation is expected inthe postsynaptic neuron. This is different in the case events A and B both occurredin which a high activation probably takes place in the postsynaptic neuron. In thejust mentioned BCM model [Bienestock et al., 1982] these two intervals of high andlow activation correspond respectively to regions of increment and depression ofsynaptic strength. Artola et al. [Artola et al., 1990] corroborate the BCM model inslices of rat visual cortex. This model is explained in more detail in Figure 2 wheretwo thresholds separates the different regions: the LTD (Long Term Depression)threshold separates the region of no change of synaptic strength and the region ofsynaptic depression and the LTP (Long Term Potentiation) threshold separates theregion of synaptic depression from the region of synaptic potentiation.

1.1.1 Synaptic metaplasticity

Synaptic metaplasticity [Abraham and Tate, 1997] [Pompa and Friedrich, 1998]which some authors consider the plasticity of synaptic plasticity [Abraham andBear, 1996] consists in the dependence of the LTP threshold on the initial synapticweight. If the initial weight is low the LTP threshold is also low and when theinitial weight is high the LTP threshold is placed over higher values of postsynapticactivation (see detailed explanation in Figure 3).

134 CHAPTER 6

LTD threshold

LTP threshold

Postsynaptic activity

Cha

nge

in s

ynap

tic s

tren

gth

Figure 2. Change of synaptic strength due to the postsynaptic activity in biological neurons. Ifpostsynaptic activity is high a positive change of synaptic strength takes place. If the postsynaptic

activity is between the LTD and LTP thresholds a negative change of postsynaptic strength takes place.For lower values of postsynaptic activity no increment nor decrement of synaptic strength is noticed

Cha

nge

in s

ynap

ticst

reng

th Postsynaptic activity (voltage)

Potentiation

Depression

Δw

Δw

LTP threshold

Cha

nge

in s

ynap

ticst

reng

th Postsynaptic activity (voltage)

Potentiation

Depression

Metaplasticity=LTP threshold variation

Initial weight

Initial weight

Figure 3. Synaptic metaplasticity consists in the shift of the LTP threshold according to the initialweight of the synapse. The above two figures show graphically this idea. For higher values of the

initial synaptic weight the curve is elongated so that the LTP threshold value corresponds to highervalues of postsynaptic activity


None of the just presented models of synaptic weight alteration considered theproperty of synaptic metaplasticity. In section 3 we will propose an alternativemodel that also accomplishes this important property.

1.1.2 Synaptic normalization

Synaptic normalization is a property of biological synapses [Turrigiano , 1998] con-sisting in the normalization of the synaptic weights of a neuron. This is equivalentto the mathematical operation of dividing the synaptic weights by their norm whichis calculated by the overall sum of weights. For example if the set of weights ofa neuron are expressed as vector: W = �0�4� 0�5� 0�1� 0�2� 0�7� 0�1� its norm iscalculated as: norm�W� = 0�4+0�5+0�1+0�2+0�7+0�1 = 2. Dividing W by thisnorm, norm�W� = 2, a new W’ is obtained W’=(0.2, 0.25, 0.05, 0.1, 0.35, 0.05). Inbiological neurons the norm can be multiplied by an arbitrary multiplicative factor

k. This norm is therefore calculated as: norm�W� = kn∑

i=1wi where wi is each one

of the n synaptic weights. The biological counterpart of synaptic normalization isexplained in more detail in Figure 4.

Synaptic metaplasticity and synaptic normalization are considered homeostaticproperties of synapses. Homeostasis is the property of a system to return to itssituation in equilibrium when it is forced to leave that situation. In this sensemetaplasticity favours synaptic depression when the initial synaptic weight is highand it favours synaptic potentiation when the synaptic weight is low. Synapticnormalization is also an homeostatic property because when the total synaptic

High activity

Weight increase Normalization

Moderate activity Moderate activity

Pre-synapticterminal

Post-synapticterminal

Figure 4. Synaptic normalization. The above figure shows the evolution of two nearby synapsesbefore, during and after the stimulation of the second of them. In the initial stage, the two synapseshave two ionic channels each, giving four ionic channels in total. During the stimulation (center) thesecond synapse increments the number of its ion channels to six while the number of ion channels inthe first synapse remains the same: two. Therefore the weight in the second synapse becomes three

times bigger than in the first, in a proportion 6:2=3. After some hours the total number of channels inboth synapses goes back to the initial value, four : one channel in the first synapse and three channelsin the second. However the relative proportion of ion channels obtained during the stimulation: 6:2=3

is maintained, being in the last stage 3:1=3

136 CHAPTER 6

weight of a set of nearby synapses is altered, the tendency is to draw back this setto its original overall weight.

Up to this point some homeostatic properties of synapses were explained. In thefollowing section we will start with an homeostatic property of neurons: the spikethreshold adaptation property.

1.2 Neuron’s Properties

1.2.1 Spike threshold adaptation

The firing rate of a biological neuron depends on the value of its synaptic inputsaccording to the solid line in Figure 5. If the synaptic input is high the neuron firesat its maximal rate and if it is low it stops firing. In a intermediate range the firingrate is almost proportional to its synaptic input. This is the only useful range forthe activity of the neuron. If its synaptic inputs varied inside any one of the moreextreme ranges this variation produces no variation at the neuron’s outputs. For thisreason it is very important that the neuron is tuned in the intermediate region whichis the region of maximal firing rate variation.

This tuning actually occurs in real neurons according to the studies of Desai andcolleagues [Desai et al., 1999]. When the synaptic input activity is very high thesigmoidal activation function shifts to the right and when the synaptic input activityis very low this function shifts to the left (see Figure 5). The shift of the neuron’sactivation function was a characteristic of the Rosenblatt’s Perceptron [Rosenblatt,1956] which was a pioneering neural network for characters recognition.

1.2.2 Burst and tonic firing

The purpose of this paragraph is to illustrate that neurons are capable of havinga dual frequency behaviour. Thalamo-cortical neurons which are placed in the

Low Activity High Activity

Synaptic Input

Firi

ng R

ate

Figure 5. The activation function of a biological neuron is capable of shifting to the right or to the leftdepending on the synaptic input of the neuron. Let us suppose that the activation function is given by

the solid line. If synaptic input levels become very high the curve shifts to the right to avoid thesaturation of the firing rate. Conversely if synaptic input levels are very low the activation function

shifts to the left in order to prevent the neuron to stay silent


>100 msec

Tonic firing Burst firingInhibition

Figure 6. Some types of neurons like thalamo-cortical neurons present a dual firing behaviour: in theirtonic firing mode the frequency of their response is proportional to the stimulus (10–165 Hz). Howeverwhen they are stimulated and afterwards inhibited during at least 100 msec. their response changes to

burst firing with much higher frequency rates (150–320 Hz)

thalamus, at the core of the brain, are able to fire either in tonic or in burst modeas shown in Figure 6. The main characteristic of the tonic mode is that the spikingfrequency is proportional to the stimulus being in the range of 10 to 165 Hz.However in the burst mode, the frequency is not related to the input activation,being in the range of 150 to 320 Hz. This burst mode is very interesting becauseit takes place after a precise sequence of preliminary facts. For the burst mode tohappen, the thalamo-cortical neuron needs to be positively stimulated and afterwardsinhibited during at least 100 msec. After these two previous events the burst firingis produced when a slight positive stimulation is given to the neuron. For a deeperstudy of these mechanisms see [Llinas and Jahnsen, 1982], [Llinas, 1994], [Steriadeand Llinas, 1988].

The purpose of this dual behaviour is still a matter of controversy. Ropero[Ropero, 1997] proposed that the tonic mode served for intrathalamic operations.When the result of this intrathalamic operations are concluded the result is relayedto the cortex via the burst firing mode.

1.3 Network Properties

1.3.1 Synchronization among neurons

Some type of neurons for example, the granule cells of the olfactory bulb and thereticular cells in the thalamus are able to synchronize their activity and, afterwards,oscillate together [McCormick and Pape, 1990], [Steriade et al., 1987]. One of thecauses of this behaviour is that these neurons posses dendro-dendritic [Descheneset al., 1985] electric contacts in which the potential is communicated directly fromone neuron to the other without any kind of neurotransmitter in between. Thesituation is as if we had a set of ping-pong balls tied by fine cords and we usedtwo very big bats to play with them. The movement of the balls becomes more andmore uniform and synchronized during the play.

The kinetic energy given by each one of the bats over the balls corresponds tothe electric energy of ions entering the neurons. One type of ions increments theinner potential of the neurons when it is below a certain threshold and other type ofions reduces the potential when the potential is above an upper voltage threshold.

138 CHAPTER 6

These play beetween ions and the potential sharing of dendrodendritic connectedneurons generates the synchronized oscillations. This behaviour was modelled andprogrammed in Matlab [Ropero, 2003] with the results shown in Figure 7.

1.3.2 Normalizing inhibition

Inhibitory neurons were supposed to only perform subtraction [Carandini andHeeger, 1994] over other neurons and this property was used for biasing the neu-rons in conventional neural networks models like backpropagation or radial basisnetworks. The operation of biasing the neurons was equivalent to shifting the acti-vation function of these neurons to the right or to the left in a similar way tothe one explained in section 2.2. This kind of subtracting or biasing inhibitionis performed by means of GABA-B (Gamma-aminobutyric acid) neurotransmitterin real neurons. However inhibition is performed in many of the cases by meansof GABA-A neurotransmitter instead of GABA-B, being the effect of GABA-Ainhibition divisive and not subtractive.

We postulate that this GABA-A inhibition could perform a scaling or normalizingeffect of the input patterns arriving at a certain layer of the brain.

Many structures in the brain have a layered organization. The input to each layergoes to two type of neurons: (A) To neurons that perform an excitatory projectiononto the following layer (B) To GABA-A neurons that produce inhibition inside itsown layer thereby creating an inhibitory divisive field in the layer (see Figure 8).

Figure 7. The height of each intersection of lines over the surface represents the activation of a 7×7net of neurons. If each of the neurons in this net has an oscillatory activity and the potential of each of

them is partially shared between the other neurons, a synchronization of the activities takes place.From top to bottom and from left to right a computer simulation of the synchhronization of a 7×7 net

of networks is shown


I1 I2

I4 I5

I3

I6

i1 i2 i3

i4 i5 i6

++

+

+

+

+ + ++ + +

-Field of inhibitoryinterneurons

O

+

+

Figure 8. Normalization of synaptic inputs due to an inhibitory field of GABA-A inhibitoryinterneurons. The six neurons of the lower layer of the figure form an excitatory input

I = �I1,I2,I3,I4,I5,I6� impinging on a second layer of neurons (middle). This pattern produces anexcitation �+� over the six neurons in the middle layer and over GABA-A inhibitory interneurons that

are not shown. Once these inhibitory interneurons are activated, they creates an inhibitory field thatdivides the activation of these middle layer neurons by

n�I� = n�I1�+n�I2�+n�I3�+n�I4�+n�I5�+n�I6�. In this way the neuron at the top receives anormalized input i = �i1,i2,i3,i4,i5,i6� that is the result of dividing each of the components of pattern I

by the constant n(I)

The activation of excitatory and inhibitory neurons in each layer is almost the sameabsolute value because the input pattern impinges at the same time excitatory andinhibitory neurons. Therefore this inhibitory divisive field is proportional to thisactivation.

This divisive inhibition is able to produce a sort of normalization over inputpatterns (see Figure 8 for more details).

2. UPDATING MC CULLOCH-PITTS MODEL

Up to this point we introduced several properties of real neurons with remarkableinterest for computational purposes. Using some of them we tried to update someof the characterisitics of the McCulloch-Pitts paradigm of neural computation.

2.1 Up-to-date Synaptic Model

The classical model of synaptic weight alteration due to Hebb lacked many ofthe properties that were mentioned in previous sections. Here we propose another

140 CHAPTER 6

model that not only mimics the way biological reinforcement and depression isproduced but also accomplishes the property of metaplasticity [Ropero and Simoes,1999].

In our model the synaptic weight between the presynaptic neuron A and thepostsynaptic neuron B is calculated as:

(2) wAB = P�B/A�

where B is a postsynaptic activation above a specific threshold and A a presynap-tic action potential. As shown the synaptic weight is calculated as a conditionalprobability. The above expression can also be written as:

(3) wAB = P�B/A� = n�A I B�

n�A�

in which the operator “n( )”, number of times, quantifies how many times a certainevent takes place, for example how many times event A, event B or the intersectionof A and B occurs. Starting with different values of the numerator and denominator,i.e. different initial weights, and allowing the postsynaptic neuron to fire accordingto a non-linear squashing function (logistic) a 3-D version of Figure 2 is obtainedin Figure 9.

In this figure a continuous line drawn on the surface shows the evolution ofthe LTP threshold in function of the initial weight. It can be noticed that a verysimple statistical expression is able to account for a big variety of properties like

Cha

nge

in s

ynap

tic s

treng

th

Normalized postsynaptic activity (voltage)

Initial weight

Weight = P(B/A)

LTP threshold

Figure 9. The computer simulation above shows that metaplasticity takes place when the synapticweight is calculated using the conditional probability P(B/A), being B a suprathreshold activation of

the postsynaptic neuron and A the presynaptic action potential. A line joins the different LTPthreshold, each one of them corresponding to a different initial synaptic weight


reinforcement, depression and metaplasticity. Therefore, talking into account thatconditional probabilities can be computed in synapses, the more obvious questionthat arises here is: Are real synapses the tiniest pieces of the more fascinatingstatistical computer ever imagined?

2.2 Up-to-date Neuron Model

We propose a neuron model (see Figure 10) using the just presented equation formodelling the synaptic weights [Ropero, 1996]. In the soma each of the excitatorypostsynaptic potentials (EPSPs) are summed. An EPSP is obtained by multiply-ing the probability P�Ii� of an action potential in the presynaptic neuron by thecorresponding weight P�O/Ii�.

Although there are no probabilities at the presynaptic space but action potentials atdifferent frequencies, the product P�Ii�P�O/Ii� can approximate each of the EPSPs.These EPSPs are formed by the sum of the voltage humps produced each one ofthem by a presynaptic action potential in a process known as temporal summation.When these humps are nearby, the humps ride over previous humps creating atallest EPSP. When they are far away, as for example when the presynaptic actionpotential is low, they can hardly ride over each other and the resulting EPSP islow. Given that the maximal frequency of presynaptic action potential is limited,the height of the resulting EPSP is also limited. This maximal height correspondsto a P�Ii�P�O/Ii� of value 1.

All the EPSPs go from the dendrites to the soma where they are summed. Thissum is the so-called activation of the neuron which is transformed afterwards intoa frequency of action potentials by means of a logistic or sigmoidal function.

To prevent the saturation of the weights a normalization of the input pattern bymeans of divisive inhibition is commonplace in the brain.

P(I1)

P(I3)

P(I2)

P(O/I1)

P(O/I2)

P(O/I3)

P(O/I) = P(O/I1)P(I1) + P(O/I2)P(I2) + P(O/I3)P(I3)

Figure 10. Model of a neuron based on conditional probabilities for calculating the synaptic weights.In each synapse the probability of presynaptic action potential is multiplied by the synaptic weight andthe result gives the postsynaptic activation in each synapse. The sum of postsynaptic activations gives

the activation of the neuron which is calculated asP�OI� = P�O/I� = P�O/I1�P�I1�+P�O/I2�P�I2�+P�O/I3�P�I3�

142 CHAPTER 6

2.3 Up-to-date Network Model

If the same pattern is input to several neurons, instead of only one, a competitiveprocess can take place so that only one neuron, the one whose activation is maximal,becomes the winner of the competition. When the winner fires, the remainingneurons are kept silent. Silencing the not winning neurons is usually done by aninhibitory feed-back or lateral inhibition. For avoiding that only one neuron becomesthe winner for every pattern, the probabilistic synapses should be normalized alongtime (see Figure 11). This is one of the possible roles of biological synapticnormalization, giving every neuron the same opportunity to fire.

But what biological mechanisms are involved in the selection of this winningneuron? In section 2.3.1. it was introduced that the synchronized oscillation ofneurons is a mechanism found at least in the thalamus and the olfactory bulb. Thissynchronized oscillation of neurons can allow the finding of the neuron with maxi-mal activation: if a common oscillatory potential were summed to the activations ofa layer of neurons the neuron whose total activation arrives first to a certain firingthreshold is at the same time the one with biggest activation [Ropero, 2003].

t1 = 0.2

t2 = 0.4

t3 = 0.5

0.2

0.4

0.5

0.2

0.4

0.5

w11 = 0.6

w12 = 0.4

w13 = 0.2

0.2

0.4 y2

0.6

0.4

0.6

0.2

3

j = 1y1 = ∑w1j

.tj = 0.38

3

j = 1y2 = ∑w2j

.tj = 0.50

3

j = 1y3 = ∑w3j

.tj = 0.42

y1

y3

a.

b.

c.

Figure 11. Synaptic normalization allows a competitive process among neurons. The neuron whosesynaptic weight distribution wij is most similar to the input pattern of frequencies T = �t1, t2, t3 tj is

also the one with maximal activation. This is the case of neuron b whose weights [0.2, 0.4, 0.6] aremost similar to vector T = �0�2� 0�4� 0�5 Therefore the sum of the products of the input frequenciesmultiplied by its weights yields the maximal value. Notice that due to the synaptic normalization thenumber of ionic channels is the same in the three neurons. In summary, synaptic normalization is theproperty that allows that the neuron whose weight distribution is most similar to the input pattern also

exhibits the maximal activation


3. JOINING THE BLOCKS: A NEURAL NETWORK MODEL OFTHE THALAMUS

Probabilistic synapses, synchronized oscillations, weights normalization and nor-malizing inhibition, all of them are properties that were used to implement a real-istic computational model of the thalamus. The thalamus is a structure at thecore of the brain that relays sensorial information from the senses to the cortex.The function of this structure was unknown. The model we propose helps in thisway to understand the role of the thalamus inside the computation in the brain[Ropero, 1996], [Ropero,1997], [Pelaez, 2000]. The thalamus is basically a twolayered brain structure. The first layer is formed by thalamo-cortical neurons thatreceive sensorial patterns and after approximately 100 msec. send the result of theinner thalamic computation to the cortex. The second layer formed by reticularneurons that oscillate synchronically performs a competitive process by which eachone of the neurons fires in the presence of specific characteristics of the input patterns.

When several of these neurons fire, they produce several inhibitory masks that,when superposed, create a negative replica of the input pattern shown in Figure 12over the first layer. If the input patterns were damaged or noisy the negativereplica recreates a perfect version of the input without defects or noise. Patternreconstruction and noise rejection are two of the tasks that we postulate the thalamusis able to perform. For these tasks, a process of learning must take place at the levelof the thalamus.

Our computer model of the thalamus programmed in Matlab has these two layers,each one of 9 × 9 = 81 neurons. The two layers are completely interconnected toeach other having 2×81×81 = 13122 connections. It learned 36 characters duringseveral epochs and is able to recognize and complete damaged or noisy patterns(see Figure 12). The learning capability of the model shows that the real thalamushave also learning capabilities, a fact, that was completely ignored until now in thethalamus’ research.

4. CONCLUSIONS

In this review we have presented several properties of synapses, neurons andnetworks that were not considered in previous neural network models but that haveinteresting computational potential.

McCulloch Pitts neuron’s model was based in the restricted knowledge aboutneurons that existed in the forties. Nowadays a more comprehensive knowledgeabout the amazing properties of neurons can be used to update McCulloch Pittsmodel. In the case of synaptic plasticity we presented several properties of synapticweights like directionality, existence of both potentiation and depression thresholds,metaplasticity and normalization. Regarding neurons relevant properties were intro-duced to the reader like the spike threshold adaptation and the dual behaviour infrequency of some types of neurons. Finally, and concerning networks of neurons,we studied the synchronization of a set of neurons and the normalizing inhibitionproduced by a set of GABA-A neurons over the input pattern of another neuron.

144 CHAPTER 6

Figure 12. A biologically realistic computer model of the thalamus constituted by two layers of9×9 = 81 neurons each. An example of the pattern reconstruction capability of the thalamus model is

shown (a) After being trained with 36 different characters (letters and numbers) a very noisy anddamaged testing pattern is input which vaguely resembles a B. (b) An “I” shaped sustained feedbackinhibition over the first layer is produced by a reticular neuron in the second layer. After firing, thereticular neuron rests in refractoriness. This inhibition reduces the subsequent activation in the firstlayer. (c) Another neuron fires and immediately enters in the refractory period producing another

sustained inhibition that is superposed over the previous one. Both inhibitions are shaped like an E. (d)Finally, another reticular neuron fires and the total inhibition completely reconstructs letter B showingthe reconstruction capability of the thalamic model. The central figure of each screen gives the value

of the activations of a net of reticular neurons

With all these elements in mind we proposed a new equation for synaptic rein-forcement based in conditional probabilities. The paradigm of a neuron was alsomodified taking into account that the neuron is always integrated in a network. Forexample, if the neuron was detached from the inhibitory field that normalizes itsinputs, its active synaptic weights will increase without bound and the neuron willbe saturated most of the time.

It was also shown that the normalization of synaptic weights is an importantcondition for allowing a competitive process between neurons. An example of suchcompetition and of all the mentioned properties working together is the model ofthe thalamus that we programmed in Matlab. It learned 36 characters and exhibitsthe property of completing damage or noisy patterns.


We expect that the reader benefits from this paper’s account of recently foundneural properties when creating new artificial neural networks or trying to emulatethe functioning of the brain.

REFERENCES

Abraham, W.C., and Bear, M.F. (1996) Metaplasticity: the plasticity of synaptic plasticity. Trends inNeuroscience 19:126–130.

Abraham, W.C., and Tate, W.P. (1997) Metaplasticity: a new vista across the field of synaptic plasticity,Progress in Neurobiology 52:303–323.

Artola, A , Brocher, S., and Singer, W. (1990) Different voltage-dependent threshold for inducinglong-term depression and long-term potentiation in slices of rat visual córtex. Nature 347:69–72

Bear, M.F., Connors, B.W., and Paradise, M.A. (2001) Neuroscience. Exploring the Brain. Lippincott,Williams & Wilkins. USA

Bienestock, E.L., Cooper, L.N., and Munro, P.W. (1982) Theory for the development of neuron selec-tivity: orientation specificity and binocular interaction in visual córtex. The Journal of Neurosciences2(1):32–48.

Carandini, M., and Heeger, D.J. (1994) Summation and division by neurons in primate visual cortex.Science 264(5163):1333–6.

Carpenter, G., and Grossberg, S. (1988) The ART of adaptive pattern recognition by a self-organizingneural network. Computer 21(3):77–88

Deschenes, M., Madariaga-Domich, A., and Steriade, M. (1985) Dendrodendrític synapses in the catreticularis thalami nucleus: a structural basis for thalamic spindle synchronization. Brain Research334:165–168.

Desai, N.S., Rutherford, L.C., and Turrigiano, G.G. (1999) Plasticity in the intrinsic excitability ofcortical pyramidal neurons, Nature Neurosciences 2:515–520

Hopfield, J.J. (1982) Neural Networks and physical systems with emergent collective computationalabilities. Proceedings of the National Academy of Sciences 79:2554–2558

Kohonen, T. (1982) Self-organized formation of topologically correct feature maps. Biological Cyber-netics 43:59–69.

Llinás, R., and Jahnsen, H. (1982) Electrophysiology of mammalian thalamic neurones in vitro. Nature297:406–408

Llinas, R., Ribary, U., Joliot, M., and Wang, X.J. (1994). Content and Context in Temporal Thalam-ocortical Binding. In G.Buzsaki et al. (Eds.), Temporal Coding in the Brain (pp. 151–72). Berlin:Spring-Verlag

McClelland, J.L., Rumelhart, D.E., and The PDP Research Group. (1986). Parallel distributed processing:Exploration in the microstructure of cognition. Cambridge, MA: MIT Press.

McClelland, J.L., and Rumelhart, D.E. (1988). Explorations in parallel distributed processing. Cambridge,MA: MIT Press.

McCormick, D.A., and Pape, H.-C. (1990) Properties of a hyperpolarization activated cation current andits role in rhytmic oscillation in thalamic relay nurons. Journal of Physiology (London) 431:291–318.

McCulloch, W. and Pitts, W. (1943) A logical Calculus of the Ideas Immanent in Nervous Activity.Bulletin of Mathematical Biophysics, 1943.

Ropero Peláez, J. (1996) A Formal Representation of Thalamus and Cortex Computation. Proceedingsof the International Conference of Brain Processes, Theories and Models. Edited by Roberto Moreno-Díaz and José Mira-Mira. MIT Press.

Ropero Peláez, J. (1997) Plato’s theory of ideas revisited. Neural Networks, 1997 Special issue 10(7):1269–1288.

Ropero Pelaez, J., and Godoy Simoes, M. (1999) A computational model of synaptic metaplasticity.Proceedings of the International Joint Conference of Neural Networks 1999. Washington DC.

Ropero Peláez, J. (2000) Towards a neural network based therapy for hallucinatory disorders. NeuralNetworks, 2000 Special Issue 13(2000):1047–1061.

146 CHAPTER 6

Ropero Peláez, J. (2003) Phd Thesis in Neuroscience: Aprendizaje en un modelo computacional deltálamo. Faculty of Medicine. Autónoma University of Madrid.

Rosenblatt, F. (1956) The perceptron: A probabilistic model for information storage and organization inthe brain. Psychological Review 65:386–408

Steriade, M., Domich, L., Oakson, G., and Deschenes, M. (1987) The deafferented reticular thalamicnucleus generates spindle rhythmicity. The Journal of Neurophysiology 57:260–273.

Steriade, M., and Llinas, R.R. (1988), The Functional State of the Thalamus and the Associated NeuronalInterplay. Physiological Review 68(3):649–739.

Tompa, P., and Friedrich, P. (1998). Synaptic metaplasticity and the local charge effect in postsynapticdensities. Trends in Neuroscience 21(3):97–101.

Turrigiano, G.G., Leslie, K.R., Desai, N.S., Rutherford, L.C., and Nelson, S.B. (1998) Activity-dependentscaling of quantal amplitude in neocortical neurons. Nature 391:892–896.

CHAPTER 7

SUPPORT VECTOR MACHINES

JAIME GÓMEZ SÁENZ DE TEJADA1,JUAN SEIJAS MARTÍNEZ-ECHEVARRÍA2

1 Escuela Politécnica Superior, Universidad Auónoma de Madrid2 Escuela Técnica Superior de Ingenieros de Telecomunicaciones, Universidad Politécnica de Madrid

Abstract: Support Vector Machines is the most recent algorithm in the Machine Learning commu-nity. After a bit less than a decade of live, it has displayed many advantages with respectto the best old methods: generalization capacity, ease of use, solution uniqueness. It hasalso shown some disadvantages: maximum data handling and speed in the training phase.However, these disadvantages will be overcome in the near future, as computer powerincreases, leaving an all-purpose learning method both cheap to use and giving the bestperformance. This chapter provides an overview about the main SVM configuration, itsmathematical applications and the easiest implementation

Keywords: Support Vector Machines, Machine Learning

INTRODUCTION

Machine Learning has become one of the main fields in artificial intelligencetoday. Whether in the pattern recognition field or in function estimation, statisticalMachine Learning tries to find a numerical hypothesis which adapts correctly tothe given data, that is, machines able to generalize the statistical distribution of arepresentative data set. Once we have generated the hypothesis, all future unknownpatterns following the same distribution will be correctly classified.

From the principles of statistical mechanics, a handful of algorithms have beendevised to solve the classification problem, such as decision trees, k-nearest neigh-bour, neural networks, Bayesian classifiers, radial basis functions classifiers, and,as a newcomer, support vector machines (from now on SVM).

The basic SVM is a supervised classification algorithm introduced by VladimirVapnik, motivated by VC (Vapnik Chervonenkis) theory [Vapnik, 1995], fromwhich the Structural Risk Minimization concept was derived. In the late 70’s,

147


148 CHAPTER 7

Vapnik studied the numeric solution of convex quadratic problems applied toMachine Learning, and defined an immediate ancestor of SVM called ‘Generaliza-tion portrait’. In the early 90’s, Vapnik joined Bell laboratories, where his ideasevolved until the creation of the term ‘support vector machines’ in 1995.

Nevertheless, the basic mathematics behind SVM were developed much earlier.The concept of a non-input space hyperplane generation to separate data in inputspace, the heart of SVM, was settled in 1964. The study of convex quadraticproblems gave the Karush-Kuhn-Tucker optimality conditions in 1936, while thedefinition of valid kernel functions for the transformation described above wasformulated by Mercer in 1909.

This chapter provides an introductory view to SVM, so that any computer scientistor engineer reader can develop his own SVM implementation and apply it to anyreal world machine-learning problem. For that purpose, we will sacrifice somemathematical completeness for the sake of clarity. It has four sections: first, theSVM will be defined and analysed; second, the main SVM principle mathematicaluses will be developed; third a comparison between SVM and neural networks willbe studied; last, the best current implementation approach will be shown.

Support Vector Machines are easy to understand, not too difficult to implement,and child’s play to use. If you need a generic Machine Learning method, forgetabout neural networks or any other method you previously learnt: the SVM familyglobally outperforms them all.

1. SVM DEFINITION

1.1 Structural Risk

Classifiers having a big number of adjustable parameters (and so, great capacity)most probably will generate overfitting, thus learning the training data set withouterrors, but with poor generalization ability. On the contrary, a classifier with insuf-ficient capacity will not be able to generate a hypothesis complex enough to modelthe data. A mid-point must be found where adjustable parameters are neither toomuch nor too scarce, both for the training ant test set. For that reason, it is essentialto choose the kind of functions a learning machine can implement.

For a given problem, the machine must have a low classification error, andalso small capacity. Capacity is defined as the ability of a given machine to learnany training set without errors. For example, the 1-nearest neighbour has infinitecapacity, but is a poor classifier for unseen test data with complex distributions andnoisy sets. A machine with great capacity will tend to generate overfitting over thedata, making it no longer useful because it does not learn. For extended informationabout these issues, see [Burges, 1998].

There are a handful of mathematical bound expressions that define the relationsbetween a machine learning ability and its performance. The underlying theory triesto find under which circumstances and how fast the performance measure convergeswhile the number of input data for training increases. On the limit, with an infinitenumber of points, we could have a correct performance value, better than just an

SUPPORT VECTOR MACHINES 149

estimation. With respect to the SVM, we will use one limit definition in particularwhich will take us to the Structural Risk Minimization (SRM) principle [Vapnik,1995].

Suppose we have l observations, input data in the training phase. Each dataconsists on a pair of values �xi� yi�, where xi is a vector ∈ �n� i = 1� � � � � l and thefixed associated label yi ∈ �1�−1�, given by a consistent data source. We assumethere is an unknown probability distribution P(x,y), from which the data points havebeen drawn. Data is always assumed to be independently drawn and identicallydistributed.

Suppose we have a machine whose task is to learn the mapping xi → yi. Thisgeneric machine is really defined by a set of possible mappings xi → f�x��, wherefunctions f�x�� are generic, defined for the set of adjustable parameters �. Thismachine is by definition deterministic, that is, for a given input vector xi, and aparameter set �, we will always obtain the same output f�xi��. Choosing theparameter set � gives a trained machine. For example, a neural network with afixed architecture and fixed weights (parameter set �) would be a trained machineas defined in these paragraphs.

Thus, the expected error in the phase test for a trained machine is:

(1) R�� =∫ 1

2�y− f�x��dP�x� y�

The value R�� is called expected risk. Nevertheless, this expression is difficultto use because the probability distribution P(x,y) is unknown. Thus, a variation ofthe formula is developed to use the finite number of available observations. It iscalled empirical risk, and is defined as the measured mean error rate on the trainingset:

(2) Remp�� = 12l

l∑i=1

�yi − f�xi��

Remp�� is a fixed number for a given parameter � and test set. It has been shownin [Vapnik, 1995] that the following condition holds:

(3) R�� = Remp��+g�h�

where g(h) is a real number which is directly related to the VC dimension.Again, a learning machine is defined as a set of parameterised functions (called

a family of functions) having a similar structure. The term Vapnik-Chervonenkis(VC) dimension is a non-negative integer that measures the generalization capacitypreviously defined. The VC dimension for a given learning machine is definedas the maximum number of points that can be correctly classified using functionsbelonging to the family set. In other words: if VC dimension = h, then thereexists a set of h points that can be classified with family functions regardless ofthe point labels. Note that, first, there cannot exist a set of h + 1 points satisfyingthe constraint; second, you only need one set of h points for the definition to beapplicable (it did not say “for all h-points sets”).

150 CHAPTER 7

Figure 1. Three wisely chosen points

Let’s try an example. Suppose we are in �2 space and the learning machineL1 is defined as the set of “one straight line” classifiers. In figure 1 we choosethree points. We see (you can try) that, for all combination of labels (8 possiblecombinations using 3 points with two labels), they can be separated using onestraight line. For each combination it would use a different straight line, but it wouldstill be a component of the family set. Therefore the analysed learning machine VCdimension is at least 3. If we try 4 points (any 4 point set) we will not be able tosatisfy all constraints, so we can state that “one straight line” classifiers in �2 spacehave VC dimension equal to 3.

Another example. Suppose we are in �2 space and the learning machine L2 isdefined as the set of “two-segment line” classifiers (continuous but non-derivablein the joint point). In figure 2 we choose five points. Again, try all possible labelcombinations (now 32). Using a two-segment line you can separate all 32 cases, butit would not be possible to separate 6 well-chosen points (any 6 points). ThereforeVC dimension for this learning machine is 5.

Figure 2. Five wisely chosen points


Figure 3. Training set and two valid classifiers, “straight-line”(dashed line) and “two-segment-line”(solid line)

When facing a problem that can be classified using different learning machines,as can be seen in figure 3, which one is better.

The SRM principle will try to find the learning machine with the lowest VCdimension that correctly classifies all data points. The consequences are analysedin section 2.12.

In what regards SVM definition, SRM principle and VC dimension conceptrequires that the chosen classifier be the one with the largest margin (linear SVMuse the family of linear hyperplanes in input space), defined in next section.

1.2 Linear SVM for Separable Data

The simplest case for a SVM is that of linear machines trained with a separabledata set (see Figure 4a).

Figure 4a. Linear separable training set

152 CHAPTER 7

Suppose we have a training data set made of pairs �xi� yi�� i = 1� � � � � l, suchthat xi ∈ �d� yi ∈ �1�−1�. Suppose there exists a hyperplane in �d which separatespositive from negative examples (after their yi value). Points that are exactly on thehyperplane satisfy the condition:

(4) w•x +b = 0

where w is the hyperplane perpendicular vector (regardless of the norm), �b�/��w��(absolute value of term b divided by module of vector w) is the distance from theorigin to the hyperplane, and the operator • is defined as the dot product in theEuclidean space in which the data belong (we will use the scalar product betweentwo d-dimension vectors).

Let d+�d−� be the shortest distance between the plane and a positive (negative)example; the margin of the hyperplane is defined as d+ + d−. We can say that, atmaximizing the classifier margin, we will decrease the risk limit defined in (3).This is the base for the following SVM mathematical development.

For the linear and separable case, the SVM algorithm calculates the separatorhyperplane that maximizes the classifier margin. Thus, all training data must satisfythe following constraints:

w•xi +b ≥ +1� for yi = +1(5)

w•xi +b ≤ −1� for yi = −1(6)

which can be formulated in one expression:

(7) yi�w•xi +b�−1 ≥ 0 ∀i

All points for which equality at inequality (5) holds, are on hyperplane H1:w • xi + b = 1, parallel to the separator hyperplane and distance �1 − b�/��w�� tothe origin. In much the same way, those points for which equality at inequality(6) holds, are on hyperplane H2: w • xi + b = −1, parallel to H1 and the separatorhyperplane and distance �−1−b�/��w�� to the origin.

Thus, d+ = d− = 1/��w��, and so the margin is 2/��w��. We must find a pairof planes �H1� H2� that maximize the margin, minimizing ��w��2, with respect toconstraints defined in inequality (7).

Note that, in the training phase, no data point will be between H1 and H2 or onthe wrong side of its class plane (that is the reason for calling it separable case).Those points that satisfy the equality in inequality (7), (those placed on H1 or H2),and that, if eliminated from the training set, would give a different solution (bydefinition would change d+ or d−), are called support vectors.

The name comes from the fact that the learning machine is completely definedwith these points and their weight on the hyperplane. All other training points,which are at a greater distance from the hyperplane than the support vectors, serveno purpose: if we had begun the training without them, the solution would haveremained the same (see Figure 4b).


Figure 4b. Linear SVM classifier. Support vectors are encircled, the margin is shown with two dashedlines and the separator hyperplane is shown with a solid line

The problem can be reformulated using Lagrange multipliers. It will help us toadd constraints to the problem more easily, and will let the training data appearonly in the form of dot products between vectors. This will let us generalize theSVM algorithm to the non-linear case. The general rule for creating the Lagrangeformulation is: for constraints of the type c ≥ 0, the constraint equation is multipliedby a Lagrange multiplier and subtracted from the objective function. Thus, we intro-duce non-negative Lagrange multipliers �i� i = 1� � � � � l , one for each constraint ininequality (7), that is, one for each training point. The Lagrangian we obtain is:

(8) LP = 12

w2 −l∑

i=1

�iyi�w•xi +b�+l∑

i=1

�i

We want to minimize LP with respect to w and b (the variables that definethe plane), and require that partial derivatives of LP with respect to the �i be 0.By definition, this is a convex quadratic optimisation problem, because objectivefunction is convex and constraints are also a convex set [Burges, 1998].

This means we can solve the problem using the dual formulation [Fletcher, 1987].This Wolf-dual formulation has the following property: maximization of LD (incontrast with primal formulation LP) with the defined constraints occurs at the samevalue of w and b than the minimization of LP, shown in the previous paragraph.All partial derivatives must be zero at the optimum. Calculating partial derivativesof LP with respect to b and w, we obtain the following conditions:

w =l∑

i=1

�iyixi(9)

l∑i=1

�iyi = 0(10)

154 CHAPTER 7

which substituting in equation (8) gives:

(11) LD =l∑

i=1

�i −12

l∑i=1

l∑j=1

�iyi�jyj�xi •xj�

Therefore, now the problem is written as “Maximize LD with respect to all �i,satisfying conditions (7) and (10)”.

There is a Lagrange multiplier for each training point, but only those having�i > 0 are of any importance in calculating the separator hyperplane with equa-tion (9). These are the support vectors, which were defined in previous paragraphs.

Geometric interpretation of (11) is easier if the second term is substituted using(9). Suppose we are in an intermediate optimisation state, and we want to calculatethe second term at step i = 0. Thus the term is:

l∑j=1

0jy0yj�x0 •xj� = 0y0

l∑j=1

jyj�x0 •xj�

= 0y0�x0 •l∑

j=1

jyjxj� = 0y0�x0 •w�

The scalar product of a point and a normal-to-the-hyperplane vector gives thepoint projection over the vector, that is, relative distance between point and hyper-plane. The relying concept under the formula is:

= �0 ∗ Correctness of classification ∗ distance between point andsscurrent defined separator plane

At each sep i, the relation between xi and current-state w is calculated. Therefore,we can deduce some hand-made optimisation rules:A) If classification is correct, the term is negative, so �i should decrease, and thus

reduce its weight (its importance) in the calculation of current w, in case theoptimum has not been reached.

B) If distance is big with respect to other points of the same class, and it is correctlyclassified, �i should decrease, while other same-class point �k closer to themargin should increase.

Note that when evaluating the correctness of a point during the training phase,the point itself is used. If a point is misclassified, the algorithm will increase itsmultiplier as much as needed, forcing the hyperplane definition until this pointcondition is satisfied. For the linear separable case this strategy is valid, becausesooner or later the point must be correctly classified. But for non-linear or non-separable cases, this strategy may give poor results. If we have some noise in thetraining data, the algorithm will try to force the hyperplane definition to classifypoints that are wrong. This will generate overfitting over the data so the performancewill be poorer.

Therefore, the SVM training algorithm consists of the following basic steps:


1. Identify all training data points, and their labels.2. Optimize (maximize) the dual Lagrangian, maintaining constraints defined in (7)

and (9). For that purpose, there are many convex quadratic problem optimisationmethods described in mathematical literature [Fletcher, 1987]. The optimisationphase result is the set of all Lagrange multiplier values �i. Basic optimizationmethods have important limits about the resources (time and memory) needed inbig problems (more than 10.000 patterns). Thus, at the beginning of SVM history,efficient optimization algorithms were the basic research line. In section 5 thebest SVM algorithm will be shown: SMO.

3. Throw away all those points which are not support vectors after the trainingprocess (i.e. those having �i = 0), and calculate the value of w and b fromsupport vectors and formulas (9) and (7). Then, we will have a completelydefined optimum separator hyperplane.

1.3 Karush-Khun-Tucker Conditions

Karush-Khun-Tucker (KKT) conditions represent necessary and sufficient condi-tions for a solution to exist to the problem defined in step 2 in the previous algorithm.This solution identifies the objective function �LP� optimum value with respect toall available parameters (all �i).

Many SVM algorithms use these KKT conditions to identify if the machine’scurrent state is the optimum, and if not so, which are the points that violate theseoptimality conditions the most. For the basic SVM definition, given in this chapter,optimality conditions are:

w =l∑

i=1

�iyixi(7.9)

l∑i=1

�iyi = 0(7.10)

yi�w•xi +b�−1 ≥ 0(7.7)

�i�yi�w•xi +b�−1� = 0(7.7 bis)

�i ≥ 0

Most of them have been introduced in previous sections of this chapter, but theyhave been repeated here for better comprehension of the optimisation process.

The new equation (7.7 bis) is easy to be interpreted. It regards to the points thatmust hold equality in inequality (7). It could be defined in the following words:“Any training point, either holds equality in inequality (7), or its Lagrange multiplieris annulated, i.e. �i = 0”.

If it holds equality (7) and �i �= 0, then the point is on the margin hyperplane andis a support vector. It can also happen that both conditions hold, that is, equality(7) holds and �i = 0. In that case, the point is on his class margin hyperplane butit is not needed for the hyperplane definition, therefore it is not a support vector.

156 CHAPTER 7

1.4 Optimisation Example

To show with more clarity the optimisation process, we will introduce an example.Suppose we have 3 points ��1� 1�� 2� 1�� 3� 1�� ∈ �2 and labels �+1�−1�−1�

respectively (see figure 5).Suppose the initialisation routine defines Lagrange multipliers as �1 = 2��2 =

1��3 = 1 (holding condition (10)). We use formulas (9) and (11) to calculate thefollowing:

w1 = 2�1� 1�−1�2� 1�−1�3� 1� = �−3� 0�

LD1 = 4−1/2�−6+6+9� = −05

Then, we check if this is a valid solution for our SVM. For that purpose, we useKKT conditions, specially condition (7). Note that all three points would be supportvectors, so they must have the same value of b when substituting in condition (7).At this optimisation stage this is not true for w1, because we obtain b = 4� b = 5and b = 8. Thus we can say, without doubt, that this is no solution.

Now we must find another set of Lagrange multipliers that bring us to an increaseof LD. Point 3 is farthest from current pseudo-hyperplane (being correctly classified),so it is a good candidate for decreasing its weight in the definition of w (seesection 2.2). Suppose that new Lagrange multiplier values are �1 = 1��2 = 1��3 = 0(condition (10) must always hold).

w2 = 1�1� 1�−1�2� 1�−0�3� 1� = �−1� 0�

LD2 = 2−1/2�−1+2� = 15

We made a good choice because LD has increased. Nevertheless, we still do notsatisfy KKT conditions. When we substitute equation (7), we obtain b = 2 y b = 1for both points respectively (we have two support vectors only).

Figure 5. A linear separable set with margin and separator hyperplane


Now that we have two support vectors, with different class, their Lagrangemultipliers must change in the same way for condition (10) to hold. We increase,for instance, to �1 = 2��2 = 2��3 = 0.

w3 = 2�1� 1�−2�2� 1�−0�3� 1� = �−2� 0�

LD3 = 4−1/2�−4+8� = 2

Again, LD has increased, so we have chosen wisely. Moreover, at this optimisationstep, KKT conditions hold, having the same value b for all support vectors, b = 3.We can assert without any doubt that the optimum has been reached. For instance,if we continue to increase the multipliers to �1 = 3��2 = 3��3 = 0, the result wouldnot be valid. We would obtain:

w4 = 3�1� 1�−3�2� 1�−0�3� 1� = �−3� 0�

LD4 = 6−1/2�−9+18� = 15

Convexity required for the objective function definition holds: LD1 < LD2

< LD3 > LD4.Moreover, as the example is so small, some degree of uniform quadratic convexity

can be seen, as LD2 = LD4, underneath the optimum.During the optimisation process, while KKT conditions do not hold, the unique

separator hyperplane does not exist. At each new step (new set of values of �),there is one hyperplane direction only, but as many separator hyperplanes as supportvectors in the training set (different values of b). These hyperplanes do not need tohave a geometric meaning; they do not try to separate the data, even though theycould. As we get closer to the optimum (increasing LD), all support-vector-definedhyperplanes will come closer to each other (less difference in the b value). Thelimit is reached when LD gets to the optimum value, and all hyperplanes match upwith only one value of b: the separator hyperplane.

This concept differs largely on the search process followed by other similarmethods, like the perceptron. This last one always defines a separator hyperplanethat evolves at each training step trying to classify correctly all training data. Forthat reason, it can reach a state in which all data points are correctly classified,but whose margin is not the optimum. That is called a local minimum, where theperceptron will be trapped and will not be able to continue. The SVM algorithmperforms a quadratic optimisation in which no intermediate state can be consideredas a valid solution. There will be one solution only, it will be global, and it will bethe best you can have.

Even though soft-margin SVM definition will take place in next sections,this is a good place to see what happens when the optimisation algorithm isapplied to a non-separable data set. Suppose we have again the 3 points usedbefore ��1� 1�� 2� 1�� 3� 1�� ∈ �2 but now with different labels �+1�−1�+1� (seefigure 6). We have changed the third point label, so the training set becomes non-separable with a linear machine. Nevertheless, this information is not given to theSVM algorithm.

158 CHAPTER 7

Figure 6. A linear non-separable set

Suppose we initialise values as �1 = 1��2 = 2��3 = 1 (condition (10) holds).

w1 = 1�1� 1�−2�2� 1�+1�3� 1� = �0� 0�

LD1 = 4−1/2�+0−0+0� = 4

Of course, this cannot be a solution. We do not need to check KKT conditions,because w = �0� 0� does not define a hyperplane. At this stage we cannot guesswhich points are better changing, so we do it randomly. Suppose we define a newstate �1 = 15��2 = 2��3 = 05 (there are not many more alternatives).

w2 = 15�1� 1�−2�2� 1�+05�3� 1� = �−1� 0�

LD2 = 4−1/2�−15+4−15� = 375

We obtain LD2 < LD1, so we can be sure this is not a solution, and, even more,this way will take us nowhere. We choose another possible set, �1 = 2��2 = 4��3 = 2.

w3 = 2�1� 1�−4�2� 1�+2�3� 1� = �0� 0�

LD3 = 8−1/2�+0−0+0� = 8

As in the first case, this cannot be a solution. But LD has increased quite a lot,and we could think this is getting us closer to the solution. But it can be noted thatwe could increase the � multipliers anyhow, knowing LDn = �1 +�2 +�3, and so,the objective function increases without limit (note that in this example the problemis not characterized by a quadratic function, but by a linear function, so there cannotbe an optimisation solution).

Therefore, if the objective function increases without limit, then we are applyinga linear separable machine to a linear non-separable training set.


1.5 Test Phase

As it has been said, once we have trained a SVM, we obtain the values w andb. With these values, we define a separating hyperplane, w • x + b = 0, parallel toH1 and H2 and placed at the middle, at the same distance of both. To classify anunseen pattern x, we just need to know which side of the separator hyperplane thepoint is, i.e., the sign of �w •x +b�. Note that in the test phase we may have datapoints placed in between H1 and H2, and, if used during training, the solution foundwould have changed somehow. This concept may be useful when developing SVMtraining algorithms, because it could find a priori support vectors, before the wholetraining, saving computational power.

Up until now we have mentioned only the binary case, that is, data can onlyhave two classes. SVM classifiers can be easily extended to the multiple class case:for n classes, we just need to generate n-1 binary classifiers which separate oneclass form the rest. Nevertheless, this multiple classifier is O(n) more complex intime (memory resources are more difficult to estimate) than one binary classifierin the training as well as the test phase. As this extension does not give new majoradvances, it will not be mentioned in the rest of this chapter.

1.6 Non-Separable Linear Case

Now that we know everything that is needed to create and use a simple SVM, wewill upgrade its definition so that it will be able to deal with any real-life problem.

When the above-described algorithm for separable data is used over non-separabledata (see figure 7), no solution will be found, as the value of LD will grow withoutlimit (see section 2.5). For the non-separable data to satisfy initial constraints, wehave to introduce the concept of soft margin. This means that the algorithm willallow some training points to violate those constraints, and so, the rest of trainingdata will be correctly classified (regardless of violating points). For that purpose we

Figure 7. A linear non-separable set, which needs a soft-margin classifier. The distribution is definedas class = 1 if x1 +x2 > 75; class = −1 otherwise. The distribution has some noise

160 CHAPTER 7

introduce positive slack variables for each point in a way such that the followinginequalities hold [Cortes and Vapnik, 1995]:

w•xi +b ≥ +1−�i� for yi = +1(12)

w•xi +b ≥ −1+�i for yi = −1(13)

Values �i are not fixed prior to the training; they will be calculated during theoptimisation process. And because they are not fixed, we can be certain that allpoints will satisfy inequalities (12) and (13): just increase its �i until inequalityholds. We have solved our troubles: now, there will always be a solution. But itmay be that the solution is not close enough to the true distribution under the data.If that is so, then the solution is useless; so we have just changed the name of ourworries.

With the introduction of these variables must follow a primal Lagrangian LP

increase, so that classification errors during training will be minimized. For atraining pattern classification error to take place, its associated �i must be greaterthan 1, so

l∑i=1

�i

is a good estimate of the training errors’ upper bound with respect to the com-plete training set. Therefore, the objective function to be minimized changes from1/2��w��2 to

12

w2 +Cl∑

i=1

�i

being C a parametrizable non-negative real value. This value corresponds to theglobal penalization given to training errors. This new objective function couldhave been different. We could have devised other methods for forcing �i valuesto be as small as possible. The election of exactly that function follows simplicityreasons: the problem continues to be convex quadratic, and neither the �i, nor theLagrange multipliers associated to these new constraints, appear in the problemdual formulation. Therefore, we have to maximize LD:

(14) LD =l∑

i=1

�i −12

l∑i=1

l∑j=1


with constraints:

0 ≤ �i ≤ C(15)

w =l∑

i=1

�iyixi(16)

l∑i=1

�iyi = 0(17)


The only difference between the previous algorithm and this last one is that nowthe �i have an upper bound C. The training algorithm will not allow any point toincrease its weight indefinitely, and so, a solution will eventually be found.

The error term in the optimisation process goes to those points that have �i > 0,either because they are incorrectly classified or because they lie inside the margin.For any point that satisfies �i > 0, it can be stated �i = C. It still is a supportvector, and it will be treated as such in the calculation of w, but in the optimisationprocess its weight will grow no more. Soft margin philosophy (against hard margindefined in section 2.2), is not to forbid training errors, not even to minimizethem alone. The idea is to minimize the whole objective function, in which errorsmake some pressure as well as the hypothesis robustness, identified as the marginmaximization between those well-classified points at each side of the separatinghyperplane (characterised by constraint (7)).

Suppose, for instance, the case shown in figure 7. A hard margin classifier cannotbe found, but many soft margin classifiers will satisfy the constraints, and the onlydifference will be the C value. The first approach for newcomers is usually thehardest soft-margin possible, one that looks like figure 8. It is a valid solution, but ithas a very small margin. By definition of structural risk minimization, if we increasethe margin, test errors would decrease (better generalization performance). On theother hand, training errors should be avoided (or, at least, limited), so a balance mustbe found between margin maximization and error permissibility. A small quantityof noise may be accepted without modifying the generalization performance, bycreating a hypothesis that is developed after some common properties satisfied bythe data (the internal, true data distribution). In the case of figure 9, it is easilyseen that more training points become errors, but the classifier is much closer tothe underlying distribution concept.

The new parameter C becomes the only value (until now) that must be providedin the SVM architecture. As it has been said, C serves as a balance between errorpermissibility and generalization goodness.

Figure 8. The figure 7 set, with a rather hard soft margin classifier

162 CHAPTER 7

Figure 9. The figure 7 set, with a softer margin classifier

– If C is small, then errors are cheap. The margin will grow, and so will the numberof training patterns that violate the margin.

– If C is big, then the value of w has small relevance in the objective function opti-misation against training errors. We are approaching the hard margin philosophy.Because w value is closely related to the margin maximization, decreasing wrelevance will take us to a smaller margin, and maybe, to a worse generalizationability.To choose a good C value, model complexity and expected data noise must be

evaluated as a whole.

1.7 Non-Linear Case

In most real life cases, data cannot be separated using a linear hyperplane in inputspace. Even the use of slack variables could lead to a poor classifier, in case thelinear deviations are caused by the hypothesis structure and not because of noisydata. The next step is to introduce in the SVM algorithm non-linear separatingsurfaces instead of hyperplanes (see figure 10).

For that purpose, we generate an input data mapping into another Euclidean spaceH, whose dimension is higher than the input space. We use a mapping function �,such that:

� �d → H

In the problem dual formulation, input data vectors appear only as inner productsxi • xj, in the space they belong. Now they will only appear as ��xi� • ��xj� inspace H.

Space H will usually be a very high dimension space. It could even be an infinitedimension space. Therefore, performing operations in this space could be too costly.But if we could find a kernel function K such that K�xi� xj� = ��xi�•��xj�, thenwe would not need to explicitly map data vectors into space H, we would not even


Figure 10. Non-linear distribution set

need to know what � is. Now we just have to define a valid kernel function K, andsubstitute K�xi� xj� everywhere xi •xj stands in the algorithm.

When we use a much higher dimension space, many new data features, linearand non-linear, arise. Each new dimension offers a new possible correlation view,a new attribute with which we can separate the data, a new factor with which tocreate the hypothesis. It will be the training process responsibility to discriminatethose attributes that contain useful hyperplane-definition information from thosethat do not, by assigning a bigger weight in the linear combination of all features.

For those cases when there is some user information about data correlation, anexplicit mapping can be generated. Nevertheless this is not usual, and could leadto an inefficient implementation, depending on the previous knowledge credibility.Using generic mapping functions (we will see them later) offers the possibility togenerate an enormous number of new features, without taking care of the meaningof each one. In fact, these spaces use to be in the order of thousands, millions oreven infinite dimensions. It is difficult to accept such a big geometrical space. Itseems easier to identify it with a set of non-linear relations between input attributes,which can be assembled with linear relations in the optimisation process to createa surface (hyperplane in feature space, indefinable curve in input space), capableof separating input data one class from the other.

If we replace xi •xj by K�xi� xj� everywhere in the training phase formulas, thealgorithm defined in section 2.2 will generate a linear SVM in a high dimensionalspace (specified by the � mapping function). And most important, it will do it inroughly the same time complexity as a simple linear SVM created in input space(without mapping). All further development stays the same, as we are still creatinga linear separator, although in a different space.

In the linear case, the training phase output was the value of w and b, with whichthe hyperplane was completely defined, and so the test phase had just to see atwhich hyperplane side the new pattern was.

Now, we cannot explicitly calculate w, because it is defined in space H only andwe do not know exactly how the mapping is made.

164 CHAPTER 7

Through the support vector extension, the value of w can be written as:

(18) w =N∑

i=1

�iyi��si�

so we can write the classification function as:

(19) f�x� =N∑

i=1

��iyi��si�•��x��+b =N∑

i=1

��iyiK�si� x��+b

where si are the N support vectors, identified in the training phase as those patternswhose Lagrange multiplier is not zero. With this definition we avoid calculatingmapping function � once more.

Note that soft margin concept still applies to a non-linear classifier. Actually, itsimplementation remains very simple: Lagrange multipliers have an upper limit. Inthis case soft margin applies to the linear classifier in high dimension space. Theclearest advantage is that we still assert there is a solution. The use of a non-linearsurface as separator functions does not guarantee a solution will be found at all,even though it is more probable. Moreover, using the soft margin alternative givesthe classifier more robustness against noisy training patterns.

Training phase time complexity does not change, but test phase is different. Inthe linear case, having calculated explicitly w, algorithms complexity is O(1), usinginner product as the basic operation (which is O(d) if multiply-add is the basicoperation). For the non-linear phase, we need to perform O(N) operations, whereN was previously defined as the number of support vectors.

Because of the relation between support vectors number and complexity, algo-rithms have been devised that try to minimize, or even replace, support vectorsduring and after training, so that this phase may be competitive enough with othermachine learning methods, such as neural networks.

1.8 Mapping Function Example

For better understanding of the concept of new useful features generation, we willshow an example.

Suppose we have a data set ��xi� ci�� in �2 × �+1�−1� as shown in figure 11.It can be seen that this is not a linear separable case, and the soft margin linearseparator is not enough. In this example, training data has no noise.

We define � as a mapping function �2 → �3 with the form:

� �x1� x2� → �x1� x2� x1x2�

Therefore, we have added a new feature to the input definition, which gives usinformation about a specific kind of relation between the two initial variables. Thus,we can calculate the kernel function:

K�x� x�� = ��x�•��x�� = �x1� x2� x1x2�• �x�1� x�

2� x�1x�

2�

= x1x�1 +x2x�

2 +x1x2x�1x�

2


Figure 11. Non-linear distribution set. The distribution is defined as class = 1 if x1x2 < 145;class = −1 otherwise

Figure 12. Feature space view for the main points from figure 11. The margin �h1 −h2� is partiallyshown using solid lines

We have defined the mapping function and the new space implicitly, using theinner product in input space as the only valid operator.

In figure 12, the most important points, form the training data set, have beenrepresented, as well as the separator hyperplane the SVM algorithm would find andthose points that become support vectors. The separator hyperplane is z = 145.Note that in the final hypothesis only one feature is required to create the hyperplane(it is defined using just the third component) from the three available features. Thiswill be very common case in non-linear SVM: just a few features will form thelinear combination defining the separator hyperplane.

To represent the curve in input space that describes the generated hyperplane weneed to use the inverse mapping:

�−1 �x1� x2� x1x2� → �x1� x2�

166 CHAPTER 7

Figure 13. Non-linear classifier for the figure 11 set. Support vectors are encircled, margin is shownusing dashed lines and the separator curve is shown with a solid line

As the new axis z was defined as z = x1x2 in the high dimensional space, thosepoints that lie on the hyperplane hold x1x2 = 145, and so the curve in input spacecan be defined as x2 = �145/x1�.

In figure 13, the final result can be observed, with hyperboloid x2 = �145/x1� asthe non-linear class separator surface. Support vectors in this figure are those thatwere identified during training and highlighted in figure 12. It should not be thoughtthat those points that lie near the non-linear separator surface in input space shouldbecome support vectors, although it usually tends to it. The mapping function doesnot necessarily satisfy any input data relation properties, but the concept behind thesupport vector is: “significant point”, and the points that carry more informationare those that lie near other class points in input space.

In real world cases, this function � will not be useful, unless clear and easy a-priori information is given to the SVM engineer. Nevertheless, it is a valid mappingfunction and generates a valid kernel function. For this to happen, function K(x,y)must satisfy some constraints, known as Mercer conditions.

1.9 Mercer Conditions

Not all kernel functions are valid, that is, they describe a Euclidean space with theproperties required in previous sections. It is enough to satisfy Mercer conditions[Vapnik, 1995], which can be written as:

There exists a function K�x� y� = ��x� •��y� if and only if for all g(x), suchthat

∫g�x�2 dx is finite, the following inequality holds:

∫K�x� y�g�x�g�y�dxdy ≥ 0


For most cases, this is a very complicated condition to check, because it is said‘for all g(x)’. It has been demonstrated for

K�x� y� =P∑

i=1

Cp�x •y�P

when Cp is a positive real number and p is a positive integer.

1.10 Kernel Examples

The first (and only) basic kernels used to develop pattern recognition as well asnon-linear regression and principal component analysis with SVM are (for any pairof vectors x� y ∈ �d):

K�x� y� =�x •y+1�p(20)

K�x� y� = exp�−��x −y��2/2�2�(21)

K�x� y� = tanh��x •y−��(22)

Kernel (20) is a non-homogeneous polynomial classifier of degree p (anotherused variation is the homogeneous polynomial kernel, without term ‘+1’). It createsa space H with as many dimensions (data features) as p-combinations of x and y. Allpossible relations between input attributes until degree p appear in the new space.The margin maximization algorithm will discriminate those having informationfrom those that have not (should be most of them), so the number of adjustableparameters required to obtain a good solution decreases.

Kernel (21) is a Gaussian radial base function (RBF). The new space dimensionis not fixed, depends on actual data distribution, and it could get to infinite. Thiskernel visual effect is that near-by patterns form class clusters, as big as they can.Clusters have the support vectors as centres (in feature space), and the radius isgiven by the value of � and support vector weight, obtained during training.

Kernel (22) is similar to a two layer sigmoidal neural network. Using the neuralnetwork kernel, the first layer is composed of N sets of weights, each set consistingof d weights; the second layer is composed of N weights (the �i), so that anevaluation requires a weighted sum of sigmoids evaluated on dot products. Thestructure and weights (which defines the related neural network architecture) aregiven automatically by the training process. Not all values of �y� satisfy Mercerconditions [Vapnik, 1995].

We say (20), (21) and (22) are basic functions because new kernel functionscan be formulated combining them and still satisfying Mercer conditions. A lin-ear combination of two Mercer kernels is a Mercer kernel. This can be easilydemonstrated knowing that the integrator operator is distributive with respect to theadd operator. Also, another kind of slight changes can be implemented from thebasic functions, looking for a kernel function having a priori information about theinternal distribution.

168 CHAPTER 7

Nevertheless, it has been experimentally stated that, in many cases, kernel choiceis not a determining factor in the machine performance. For a real world problemwhose internal distribution is not particularly fitted to some kind of kernel, supportvector set tend to be very similar, no matter what non-linear function is used. Ofcourse weights are fairly different, as the evaluating function is so. But the result,the separating surface, tends to have a very similar geometrical shape, especiallywhere data density is high. As it was said in previous sections, the reason couldbe that those patterns that are important because they lie near other-class patternscontinue to be important regardless of the mapping function, so they become supportvectors.

Last, we will define the kernel matrix as a symmetric square M-order matrix(where M is the training pattern number), where position (i,j) describes the kernelfunction value K�xi� xj�.

1.11 Global Solutions and Uniqueness

As it has been shown in previous sections, the result of SVM training is a globalsolution for the optimisation process, i.e., the parameter set (values for w, b and �i)which give an objective function maximum. This term goes against ‘local solution’,defined as a parameter set whose objective function is optimum when comparedaround the vicinity. In the SVM algorithm, any local solution is also a globalsolution because it is characterised as a convex quadratic problem.

Nevertheless, global solution may not be unique. There could be more than oneparameter set where objective function gets the same value, and it could be theoptimum. It is not inconsistent with global solution definition. Solution uniqueness isguaranteed only in case the problem is strictly convex. The SVM training definitionassures the problem to be convex, but training data will make the problem be strictlyconvex or not.

Non-uniqueness occurs in two different ways:• When w and b values are not unique. In this case all w and b values between

two solutions are also global solutions. This is easy to accept, as the problem ischaracterized by a convex problem.

• When w and b values are unique, but the w value comes from different sets of�i values. Reaching one solution or the other depends on the training algorithmrandomness. Remember that there can be training data points that lie on thehyperplane but are not support vectors. Much alike when three points in a rowgive just one straight line and throwing away any of the three would give thesame result, it is easy to create one training set that would generate differenthard margin classifier support vector set depending on the listing order, althoughthe separator hyperplane would remain unchanged.

1.12 Generalization Performance Analysis

Mercer condition tells us whether a kernel function defines a new Euclidean spaceor not, but it does not define how the � mapping function must be applied or


the new space morphology. For easy cases, the feature space dimensions can bededuced. For instance, the p-degree homogeneous polynomial kernel has

(d+p−1

p

)

new features or dimensions. For a 4-degree polynomial kernel using 16×16 pixelimages (256 initial features), the new space dimension is 183181376. In real worldcases we will never have training sets that big. A classification machine with ahuge ‘features over data’ ratio would undoubtedly produce overfitting.

Let us use an easier example: 3-degree polynomial with 8×8 pixel data. The newspace dimension is 45760. If you are using a simple multi-layer perceptron neuralnet, the relation between number of weights and data points should not be greaterthan around 15%. Suppose you are generating a hidden layer with 45760 units (newfeatures), 64 units in the input layer and one unit as output. The number of weightsin the net gets around 2974400 (almost 3 million). Therefore, the minimum trainingdata set should have 19829333 patterns (almost 20 million). Now, that is an awfullybig data set.

Of course, not all 45760 new features are important. Many of them will havea null weight. But you cannot know at first which features will be needed andwhich ones will not be. Some algorithms have been designed to decrease the neuralnet while training, but even in this case the difference between useful feature anddisturbing feature is not easy to make.

A separator hyperplane in feature space H must have dim�H�+1 parameters. Anyclassification system needing so many parameters to create a discrimination functionwill be resource and time inefficient. Nevertheless SVM have a good classificationand generalization performance, in spite of treating data in an enormous space,which could be even infinite. The reason has not been formally demonstrated,although the maximum margin requirement has much to say about it. Within theSVM, the solution has at most l+ 1 adjustable parameters, being l the number oftraining patterns. After the training, the solution has N +1 parameters, being N thenumber of support vectors, which is much less than the number of new features.

In section 2.1 we left a question about which classifier is better out of twopossible choices. The answer is “the one having lowest VC dimension”, whichis the same as saying “the simplest”. It was shown that the bound on the risk isrelated to the VC dimension: the least the VC dimension, the least the risk bound.However, it does not assure you which one will have the least actual risk. There isno way to know it beforehand.

This approach is not only mathematically motivated, but we could also usesome philosophy statements on it. An English 14-century philosopher, Williamof Ockham, enounces the Ockham’s razor theory: Given some evidence and twohypothesis, one simple and one complex, both satisfying the evidence, then thesimplest hypothesis is most probable to be true. It does not say which one is true, butif you had to bet and you had no additional knowledge or evidence, you should gofor the first hypothesis. That is all about learning, be it machine or human: choose

170 CHAPTER 7

the one hypothesis which seems most probable with current evidence. Wheneveryou make a new assumption (using an unnecessary complex hypothesis) you aremost probably farther from the truth.

That answers the big question, why are support vector machines generalizationperformance good even when using high dimension feature space? Because SVMperformance is not related to the space dimension where data is separated, but to theclassifier VC dimension. Therefore, SVM classifier depends on the data hypothesissimplicity, not on the number of available features. If a simple hypothesis can dothe separating job, the SVM will use it, with no overfitting.

There is no magic any more. The SVM algorithm gives the simplest hypothesis,that is, the most probable one. But it does not mean there cannot be a better answerfor a given problem. In spite of our SVM hard militancy we do not deny SVMhave been slightly outperformed (mostly by specific neural networks) in someexperimental benches. The answer is simple: luck. The SVM gave the most probableanswer after one general-purpose execution. But the true internal distribution mayhave been slightly more complex, even though it did not show on the training data.

If you are trying a neural network architecture with a bit more complexity, whichway will you go? You cannot say unless you have additional information. Thesuccessful architect engineer would most probably try all possible ways. It meanstrying hundreds of different architectures and finally using the one having bettererror rates on the test set. But that approach falls down in many places: first, theengineer must decide how much complexity should the answer have (not an easytask at all); second, the training set must be slightly deviated from the internaldistribution for the SVM to lie behind; third, if you are generating many classifiersand you use the test set to decide which one is better, then the test set is no longera good validation set, because you are using it as a secondary training set (eventhough it is used as a validation set for publishing the results); last, the engineerspends a lot of time in the training phase.

And in spite of all this extra work, in cases where SVM are outperformed, theyare still very near to the highest results in this scientific ranking. Which means thatin the real world it is difficult to find the SVM outperformed.

Support Vector Machines are not easy to implement, but they are very easy touse. Nevertheless, its use has some limits. As it is a statistical method, symboliclearning does not suit too well. For instance, the parity problem with few datamakes the SVM decide that all points are support vectors. This is a clear hint forbad generalization performance, because it means: “one point has no relation withany other point”. In those cases a SVM is no better than a simple Nearest Neighbourclassification algorithm.

Other Machine Learning paradigms, for instance C4.5, are able to work withinput data having parameters with the unknown value (C4.5 uses the ‘?’ symbol).The algorithm identifies this value and treats the information accordingly. However,the SVM algorithm does not allow unknown values, diminishing the applicabilityto some data sets.


Inside the previously defined scope, SVM has a very light bias. It is a truegeneral-purpose machine learning method. Although a priori information can beincluded inside the kernel function, the number of new features is so wide that,regardless of the internal data distribution, there will always be a near-by hypothesismodel using those new features. The training algorithm will have embedded somesort of balance between using too few features (too simple hypothesis), and usingtoo many (overfitting).

The basic achievement in using SVM is that you just choose a generic kernelfunction (we won’t say “any kernel will do”, but it is not too far from the truth),and the confidence degree C (up until now, mostly heuristics are used, but youwill soon find it is quite easy). Then you push the button, and after some time youwill have the best classification machine. No need for an experienced engineer orscientist. No complicated architectures. No tailoring. No second thoughts. Child’splay.

2. SVM MATHEMATICAL APLICATIONS

The initial mathematical development for SVM has been applied to differentapproaches inside Machine Learning scope. All of them are based on the struc-tural risk minimization principle, in the problem Lagrange formulation, and inthe non-linear case generalization. For each approach you only need to define therequirements all points must satisfy, its effect on the objective function and themathematical steps through the Lagrange formulations.

2.1 Pattern Recognition

The first approach to SVM was in the pattern recognition field. In fact, the searchfor a new statistical paradigm able to optimise the class separation problem was theboost to V. Vapnik in his quadratic programming research.

For that reason, the SVM definition developed in the previous sections and theirimplementation shown in next sections, apply specifically to pattern recognition.Nevertheless, most concepts apply also to the other approaches defined in thissection.

2.2 Regression

Historically, the second approach the SVM had was non-linear regression andfunction estimation, called SVRM (Support Vector Regression Machines) [Vapniket al, 1997].

This field can be divided into two parts: first, ‘function approximation’ triesto find the curve that best adapts to the training data, acquired without noise(which makes it very similar to usual methods for interpolation); second, functionestimation (regression), where data is noisy and whose distribution is unknown,

172 CHAPTER 7

the method tries to estimate as simplest as possible unseen data points, includingextrapolation.

SVRM algorithm treats both cases in a very similar way. For each case, the costfunction can be slightly changed.

2.2.1 Definition

Suppose we have a training set with l data pairs �xi� yi�, where xi ∈ �d� i = 1� � � � � M(up until now, just the same as the pattern recognition case), and where yi ∈ �, isnot a label any more but a real number which represents the value of the functionwe want to estimate at xi, i.e. yi = fr�xi� + ni , being ni the noise associated topoint i.

We want to find a function f�x� having a deviation maximum of � with respect toall training yi. In the basic case, there can be no training points having a distance tothe expected value bigger than �, so the resulting curve must fit all points. This casecan be used only when data describe a linear function with a noise level ni < �∀i.

The estimating function has the form:

(23) f�x� = w•x +b

being w the vector defining the curve in input space, and b the free term (the bias).Similarly to the pattern recognition case, the structural risk minimization principle

demands the greatest possible simplicity to the approximation function. We willtry to minimize ��w��2, which will give us the flattest linear function from thosesatisfying the constraints (unlike the margin maximization definition in patternrecognition). Therefore, the optimisation problem is written as:

Minimize

1/2��w��2�with respect to constraints:

yi − �w•xi�−b ≤ ��

�w•xi�+b−yi ≤ �(24)

Nevertheless, following the same reasoning as in section 2, this inflexible for-mulation is only valid when there is at least one solution satisfying conditions (24).Because this is usually an unreal case, without noise in the data (it could be usedfor an interpolation approach), the soft margin idea must be introduced. We definepositive slack variables �i that give information about how far is the expected valuefrom the true value for point i. Thus, we are introducing in the algorithm the abilityto admit errors (points not satisfying constraints), but keeping the ability to finda solution representing the data distribution well enough. Likewise, a new costfunction must be defined giving a balance between the number of allowed errorsand simplicity (and usefulness) of the final estimating function. This cost function,c�x� y� f�, must fulfil some properties, discussed in next section.


To continue with the formulation development through this section we will usethe �-insensitive cost function [Vapnik et al, 1997], partially because it was the firstone proposed, and because it is the simplest to interpret and optimise. This costfunction is continuous non-derivable, so variables must be duplicated (formulationgets longer but no more difficult). Now all � and � turn �� ∗� and ��∗�, wherethe one without asterisk is associated to the yi ≥ f�xi� case, and the one with asteriskis associated to the yi < f�xi) case. Note that both cases cannot be true for any onepoint, so for all training points at least one of the duplicated variables will be zero.

Thus, objective function becomes:

(25) LP = 12

w2 +C

(1M

M∑i=1

c�xi� yi� f�

)

After the primal and dual formulation development (just like the pattern recog-nition case), the problem can be written as:

Maximise

LD = −12

M∑i=1

M∑j=1

��i −�∗i ��i −�∗

i ��xi •xj�−�M∑

i=1

��i −�∗i �+

M∑i=1

yi��i +�∗i �

with respect to:

M∑i=1

��i −�∗i � = 0�

�i��∗i ∈ �0� C��(26)

having C the same meaning as in section 2: an error permissibility balanceparameter.

This development remains defined as a convex quadratic optimisation prob-lem, which has to satisfy Karush-Khun-Tucker conditions at optimality. Therefore,implementation methods defined for pattern recognition are applicable, althoughwith some differences caused by the cost function. In the case of �-insensitive costfunction, duplicated Lagrange multipliers must be treated specifically. This willhappen to all non-derivable cost functions.

Again, support vectors are those training points whose Lagrange multipliers arenot zero (in the case of duplication, it means one of the multipliers is non-zero).Moreover, those points having a non-zero slack variable are considered as trainingerrors, and have the corresponding multiplier set to the maximum �

�∗�i = C (where

the symbol �∗� means “either of the duplicated items, the applicable one”). Supportvectors having a Lagrange multiplier not at bound �0 < � < C� are placed on the� margin (they are needed to define the � margin) and have a zero slack variable.Basically, the concept after the support vectors, weights and geometrical meaning,remain the same as the pattern recognition case (see figure 14).

174 CHAPTER 7

Figure 14. Linear regression machine. Support vectors are encircled, the margin/tube is shown withdashed lines and the estimated function is shown with a solid line

The value of the bias b can be calculated from a non-bound support vector, i.e.a non-error support vector. The equalities to be used are those in inequalities (24)

b =yi − �w• si�−� if�i �= 0y�i �= C(27)

b =yi − �w• si�+� if�∗i �= 0y�∗

i �= C(28)

In case all support vectors are errors (very unusual, and in any case, most probablya bad solution) the b calculation method is much more complex, and can be doneduring optimisation itself.

Likewise, we can define a non-linear mapping from input space to a featurespace, where the algorithm will try to find the flattest function approximating thedata well enough. The ‘flat’ property can usually be seen in the correspondinginput-space non-linear curve: its shape is the one having smaller tangent valuethrough the point set. The mapping concept and development is similar to the onedescribed in previous sections: using a kernel function K(x,y) making all operationsimplicitly in feature space, usually a much higher dimension space (see figures 15aand 15b).

Therefore, the non-linear problem is defined as:Maximize

LD =− 12

M∑i=1

M∑j=1

��i −�∗i ��i −�∗

i �K�xi� xj�

−�M∑

i=1

��i −�∗i �+

M∑i=1

yi��i +�∗i �

with respect to:

M∑i=1

��i −�∗i � = 0�


Figure 15a. Non-Linear regression machine. The dots follow the sinc(x) function, the dashed lines arethe �-tube, and the solid line is the function SVRM estimation. Note that support vectors are those

corresponding to the 3 tangent points (1 in the middle x = 0, and the other two at the limits)

Figure 15b. Non-Linear regression machine. The dots follow the same sinc(x) function, and the otherelements follow the figure 15a notation. Note that as the �-tube decreases, the function estimation getsmore accurate. At the limit, if noise allowance approaches 0, the function estimation error will also be

0 in this example

�i��∗i ∈ �0� C�(29)

and w support vector expansion and estimated function are written:

w =N∑

i=1

��i −�∗i ��si�(30)

f�x� =N∑

i=1

��i −�∗i �K�si� x�(31)

being si the resulting N support vectors, and being M the complete training set.It has been observed, through a number of experiments [Osuna and Girosi, 1998],

that SVRM tend to use a relatively low number of support vectors, compared to

176 CHAPTER 7

other similar machine learning processes. The reason could be the allowed flexibilitywhile errors are below a threshold, generating simpler surfaces, and thus needingless support vectors to define them.

Moreover, It has been proved that the algorithm works well when a non-linearkernel is applied in spite of having few training data. Other well-known methodswill easily overfit the data, while the SVRM dynamically controls its generalizationability, generating a hypothesis simple enough to model training data distributionbetter.

2.2.2 Cost Functions and �-SVRM

The cost function is one of the key elements in SVRM. As it was said in theprevious section, real data is usually acquired with a certain noise figure withunknown distribution. The cost function is in charge of accepting noise deviations,and penalizing wide deviations, whether they are caused by noise or by a currenttoo simple hypothesis. The point is how to make the difference between noise andhypothesis complexity.

Nevertheless, this function must satisfy certain features. For the sake of problemresolution usefulness, the cost function must be convex, thus maintaining problemconvexity and assuring solution existence, uniqueness and globality. Moreover, forthe mathematical development to remain simple, it is required to be symmetricand having at most two discontinuities at ±�, in the first derivative, being � ≥ 0.Therefore, even if we know the noise distribution, it would be too complex tointroduce that additional information inside the algorithm. We should then have tofind a convex cost function that may adjust to the noise distribution, but we wouldstill use an approximation. Not to mention the mathematical development for thenew cost function, notably difficult for non-expert mathematicians. The conclusionis: just use a general purpose cost function and let the SRVM automatic learningdo the engineering job.

The development described in the previous subsection refers to the �-insensiblecost function, which is the most commonly used, and is defined as:

(32) c�� ={

0 if �� ≤ �

��−� if �� > �

These kind of functions have an additional parameter �, which helps to adjustthe maximum allowable deviation for any given point. A validation process isrequired to adjust this parameter, even though its value can be approximated afterany additional knowledge about noise or data distributions.

To finish with the SVRM section, we will summarize a variation for the �-SVRM(using �-insensitive cost function), called �-SVRM [Schölkopf et al, 1998].

The difference consists not in the cost function itself (which remains the�-insensitive), but in the objective function. The �-SVRM gave the objectivefunction as:


(33) LP = 12

w2 +C

(1M

M∑i=1

��i +�∗i �

)

and now, in �-SVRM, the objective function is:

(34) LP = 12

w2 +C

(��+ 1

M

M∑i=1

��i +�∗i �

)

with respect to the same constraints as in (24).The resulting dual formulation problem gets:Maximize

LD =M∑

i=1

yi ��i −�∗i �− 1

2

M∑i=1

M∑j=1

��i −�∗i � ��i −�∗

i � K�xi� xj�

with respect to

M∑i=1

��i −�∗i � = 0�

M∑i=1

��i +�∗i � ≤ C��

�i��∗i ∈

[0�

CM

](35)

and leaving the estimating function in the same form as in (31). The values of band � can be calculated after training using constraints (24) for non-bound supportvectors. If the value of � increases, then the first term in the cost effect at (34), ��,will increase proportionally, while the second term will decrease as some points willbenefit from the softer constraints and will be inside the � bound (it also decreasesproportionally to the new lucky points). For the objective function to attain theoptimum, the value of � must increase until the fraction of error points (out ofbounds) is less than or equal to the value of �. Therefore the new parameter � is anupper limit for training errors (which are related to the number of support vectors).Obviously it must satisfy � ∈ �0� 1�.

It seems easier to pick a good � value rather than a � value. Moreover, �-SVRMis a superset of �-SVRM: after training with the first method we can calculate the� parameter value, which can be used in a �-SVRM algorithm giving exactly thesame solution obtained in the first place.

2.3 Principal Component Analysis

Support Vector Machines (regression included) and non-linear Principal ComponentAnalysis (PCA) were the first applications developed under the idea of a high

178 CHAPTER 7

dimension space mapping using Mercer kernels in Machine Learning. They differin the problem to solve even though they use similar means. SVM is a supervisedalgorithm, i.e. the system state changes whether an output for a given pattern isequal to the expected correct value or not. On the other hand, kernel PCA is anunsupervised algorithm, i.e. there are no labels, and the output is the training datadistribution covariance analysis [Schölkopf et al, 1998].

PCA is an efficient method to extract the input data in a certain structure, andcan be achieved by calculating the system eigenvalues and eigenvectors. Formallyspeaking, kernel PCA is an input space base transformation for diagonalizing thenormalized input data covariance matrix estimation with the form:

(36) C = 1M

M∑i=1

(xix

Ti

)

where M is the number of patterns xi.It is called principal component to the new coordinates described by the eigen-

vectors as base, i.e. the matrix vectors orthogonal projection over the eigenvectors.Eigenvalues � and eigenvectors V must be non-zero and satisfy �V = CV.

We introduce the usual non-linearity concept, with the mapping function and itscorresponding kernel. We assume there exist coefficients �1� � � � ��M, such that :

(37) V =M∑

i=1

��i��xi��

and the corresponding matrix kernel K (as defined in previous sections). Then wearrive to the problem:

(38) M�� = K�

being � the eigenvalues and � = ��1� � � � ��M� the eigenvectors coefficients.To extract the principal components for a given pattern, data projections in feature

space are calculated in the following form:

(39) Vk��x� =M∑

i=1

(�k

i ��xi��x�)=

M∑i=1

(�k

i K�xi� x�)

In this notation, k is a super-index representing the k-th eigenvector and itsk-th coefficients set. Note that after the previous calculation process, k non-zeroeigenvalues and eigenvectors are obtained, each one of them with a set of Mcoefficients.

To implement the kernel PCA algorithm, the following steps must be taken:1. Kernel matrix must be calculated, being of size MxM

Kij = �k�xi� xj��ij for all i,j ∈ �1� M�


Here comes the first problem when using kernel PCA. Any matrix calculationresources will grow at least with the square of its size, so with current algorithmsand hardware no more than 5000 data should be used. If you are provided withmore data for training (which should never be seen as an unfortunate case), arepresentative subset must be created heuristically.

2. Diagonalize matrix K, to calculate eigenvalues and eigenvectors after equa-tion (38), using traditional methods, and normalize such vectors. After thiscalculation we can obtain coefficients �k = ��k

1� � � � ��kM�, to be used in the

projection phase.3. To extract non-linear principal components from a given pattern, point projec-

tions over eigenvectors must be calculated using equation (39). The number ofprincipal components (non-zero eigenvectors) to be used is designer’s choice.But not all of them must be used: if so, the process would be useless. Just choosethe first k principal components, those with a significant amount of informationand very little noise.After this simple process, you get a data space change. From input space we

changed to a k-dimension space (being k a fraction of M), in which each dimensiongives a useful feature taken from non-linear correlation in the training data set.That is a conceptual difference between SVM training and kernel PCA training:in the first case new features are implicitly generated, and many are dropped aftertraining; in the second case k new features are explicitly generated, all of themwith a lot of information, ordered from most important downwards. The value khas an upper bound of M, the number of training patterns. New features are madeexplicit, so, obviously, the number k must not be too high or computation effortwould be inefficient. So only the first components should be used, those having thegreatest possible variance, i.e. the biggest eigenvector, i.e. the most discriminantinformation.

Non-linear PCA usefulness in pattern recognition has been tested thoroughly,attaining classification performances as good as the best non-linear SVM and wellabove neural networks. The process is very simple: first calculate projection coef-ficients and select the best ones; then transform all patterns (training, validationand test) explicitly into the new space; afterwards use these data to train a linear ornon-linear classification machine (SVM, neural networks, decision trees, …, any-one will do) which will be the true supervised classification process. When usingkernel PCA for classification, usually a linear SVM is used for supervised training,giving enough flexibility to solve any non-linear problem.

The described process is very much like a one-hidden-layer neural network, inwhich the architecture and the first layer weights are obtained by optimised means:the variance matrix eigenvectors.

Also because of the explicit new features calculation, multiclass SVM can betrained easily: the first layer would be common (as in neural networks), and thesecond layer (linear discriminant) can be calculated using the hyperplane w value(now it can be calculated because feature space is no longer implicit), giving O(1)complexity.

180 CHAPTER 7

3. SVM VERSUS NEURAL NETWORKS

Neural Networks has led the Machine Learning field from the 1980’s thanks to itsdevelopment and interpretation simplicity, while having very competitive general-ization ability. Nevertheless, after 20 years have gone by, design and developmentcomplexity has increased considerably when trying to solve secondary problems asconvergence speed, new error calculation concepts, new activation functions withadditional constraints, local minimum preventing, and so on. So many years ofactive research have turned NN from initial simplicity to current complexity, fitonly for specialised engineers.

Probably, as time goes by, SVM will follow a similar path, from current simplicityto some complexity degree, needing a human expert to take out all of its potential.Complexity by itself is not bad: higher method complexity usually leads to betterperformance or classification rates. But NN basic research is currently scarce. Forthe most part, it is about new applications where NN give better results for specificarchitectures, so a qualitative jump is needed: SVM. This is quite a natural step inhuman research, and many examples can be shown. When some technology gets toits limit, then a new approach must be issued. At first, both methods performancemay be similar, but the new one will eventually outperform the old method. Webelieve we are currently in the beginning of a technology jump, so it is a nice timeto change sides.

All along this chapter, the relation between SVM and NN has been widelyestablished. It can be stated that a SVM object topology can be developed as aone-hidden-layer perceptron. It has been demonstrated in NN literature that thefamily of one-hidden-layer perceptron can act as a universal discriminator, i.e. itcan approximate any function.

For sigmoidal activation function, similarity between NN and SVM with kernel(22) is complete. (see figure 16). After the SVM optimisation process, we obtaina network having d units in the first layer (input space dimension), N units in

Figure 16. A neural network approach for the SVM implicit architecture. Note that the layers arecompletely connected (although not explicitly shown for figure clarity). Also, all weights are equal to 1except the ones connecting the hidden layer and the output layer which are equal to the corresponding

support vector’s �


the hidden layer (number of support vectors), and one unit in the output layer(binary classifier), with weights � connecting the last two layers. For other kernels,similarity is somewhat lesser, although the resulting topology is like the d,N ,1one-hidden-layer perceptron. Only the kernel function makes the difference. Onthe other hand, the SVM with RBF kernel is like a RBF classification network,in which clusters and its characteristics have been calculated using an automaticoptimal algorithm.

When using a similar kernel and activation function, an important difference canbe observed. SVM tend to have a bigger number of support vectors (hidden-layerunits), when facing a complex or noisy training set. Neural networks can attainsimilar classification performances with much less internal units. This is essentialto the test phase speed, because it depends directly on the number of elements inthe hidden layer. The number of multiply-add operations done in the test phase ofeither method is N�d+1�.

The reason for such difference is mainly that support vectors defining the hiddenlayer are constrained to be training points. Neural networks do not have suchconstraint, so they need less elements to model the same function (it has greaterfreedom degree). This does not mean that NN solution is better; it is quicker intest phase, and topology complexity is lower, but generalization performance is notaffected.

Moreover, note that the training phase allowed errors (including those pointslying inside the margin) become support vectors. When optimising complex ornoisy training sets with loose error penalization, the number of training errors canbe very large.

But this problem has also been solved during the first SVM research steps. In[Burges, 1996] the “reduced support vector set” method is described. Given a trainedSVM, this method creates a smaller support vector set representing approximatelythe same information than the whole support vector set. But in this case the formerconstraint is eliminated because the new virtual support vectors need not be trainingpoints. The result is very much alike the NN approach topology.

This new expansion solves the classification speed problem, making the SVMcompetitive against other Machine Learning methods. Nevertheless, it is seldomused because of its considerable development difficulty.

Even more similarity can be found between NN and SVM classifiers using kernelPCA as the feature extraction step. Units in the hidden layer are calculated explicitlyusing the eigenvector projection instead of kernel calculation. These units are notsignificant training points but true features, all of which share the concepts underthe internal data distribution. Thus, the classifier topology should be very similarto the one generated by an experienced NN architect, because they have heavystatistical meaning.

The only flexibility a NN offers and the SVM cannot reach is the multiple-hidden-layer approach (using kernel PCA plus non-linear SVM could get up to 2hidden layers, but it is seldom used). In spite of the fact that a one-hidden-layer

182 CHAPTER 7

topology is a universal discriminator, having more hidden layers can make thetraining process much more efficient.

Using that capability, maybe there are fewer units in the net, or convergenceis faster. But the training algorithms grow more complex, and the overfitting andlocal minimum finding problems will still be there. Therefore, the main differencesbetween both methods are:• Training one SVM requires much more computation resources than training one

NN.• Classification speed is usually slower in SVM• SVM result is the optimum, while NN can be stuck in local minima. Therefore

SVM usually outperforms NN in classification performance.• SVM parameters are few and easy to use, while NN requires an experienced

engineer to create and try the right architecture.• SVM usually needs one execution only to give the best results, while NN usually

requires many tries to take out its best.Outside scientific community, money rules. Expert engineer time is much more

expensive than computing resources, and differences will grow higher. If MachineLearning algorithms are to be introduced massively in commercial products suchas knowledge management or data mining, automatic methods must be used. Inthe real world, new data is always coming; new profiles arise while others are nolonger valid. Neural network flexibility must be tailored by an expert to fit currentstate. But, for a company, it may be not worthy the cost of tailoring a MachineLearning system that will become obsolete within some months. It is unavoidable:craftsmen will be eventually replaced by machines.

4. SVM OPTIMISATION METHODS

4.1 Optimisation Methods Overview

SVM development tries to solve the problem described in (14): Maximize LD withrespect to Lagrange multipliers and with constraints (15) and (16).

When SVM appeared, the first approach to solve this problem was using standardoptimisation methods, such as gradient-descent or quasi-Newton. These methods,quite veterans in mathematical literature, mainly apply complex operators over theHessian matrix (partial derivatives matrix). These one-step processes are compu-tationally as well as memory resources intensive. Memory resources for matricesis O�M2�, while computational resources are O�M3�, being M the number of pat-terns in the training set. For instance, a 5000-point set will require 100 Mbytes ofstoring memory using single precision floating numbers. Any process over such anenormous data structure will be very inefficient, beyond many machines ability.

The main research line in the first years of live of SVM was the search foralternative optimisation methods, developed explicitly for SVM mathematical use.Many new approaches were published before one of them pleased all researchersfor its simplicity and its efficiency. The main methods, in chronological appearanceare the following:


• The chunking method, developed by Vapnik. Points that are not support vectorsdo not affect the Hessian matrix calculation; therefore, if we take them outbefore the matrix calculation begins the resulting Lagrange multipliers wouldremain the same. At the same time, the matrix calculation itself is easier, now itscomplexity is O�N3� being N the number of support vectors, and N << M. Theproblem is that we cannot know beforehand which patterns will become supportvectors and which ones will not. Therefore, the algorithm is described as:• Divide the training set in subsets randomly, with same (heuristic) size.• Initialise the “support vector set” to void• Repeat until there are no more subsets

• Join the “support vector set” with next subset• Optimise (as if it were a complete SVM training set) the joined subset

using basic optimisation techniques (gradient-descent, etc..).• Assign to the “support vector set” the patterns identified in the last opti-

misation execution as support vectors.The idea behind this algorithm is, at each step, eliminate those points that are notsupport vectors, and therefore, they are not likely to become support vectors ina complete set optimisation. However, you could eliminate significant patternsin one step that would become support vectors only in the presence of patternsnot in the current set. Of course, if N and M are about the same magnitude, thenthis algorithm is even worse than the basic methods.

• The decomposition method, described in [Osuna et al, 1997]. Following thechunking idea, we can optimise just a few training points subset with respectto the complete training set. Similarities with the chunking method are: first,small subsets are trained at each step; second, basic optimisation algorithmsare used at each step. And it also has differences: first, each step calculatesthe optimisation state for a small subset with respect to the whole training set(therefore, no support vectors will be missed); second, the values of N and Mare not so strictly constrained. The resulting value you get is an approximationhaving more guarantees to be close to the basic method approach (which givesthe exact result), but at the same time it can handle many more data patterns.The size of the small subset is heuristics-driven, depending on the hypothesiscomplexity, the number of data points or the noise figures.

• The SMO method, described in [Platt, 1999]. The decomposition method isextended to the limit. The smallest data set you can optimise is two, for condition(16) to hold. But the remarkable difference is that a two-point optimisationstep can be calculated analytically, using a simple equation. Therefore, theoptimisation method becomes very simple:• Until we are close enough to the correct solution

• Choose two points• Optimise those two points with respect to the complete training set.

Generating a SVM optimisation algorithm is no longer a mathematician’s field.Calculating the optimisation step is easy, and selecting the two-point set is madeusing a fixed set of heuristics, accepted by the scientific community as the best you

184 CHAPTER 7

could imagine (until a better one is published). The SMO algorithm is the quickestand the most scalable of them all, so there is no more controversy about whichoptimisation method should be used.

The greatest disadvantages of SVM with respect to other Machine Learningmethods are the training speed and the training set size limit. Using the decom-position algorithm, the biggest training set successfully tried had around 100.000patterns. Time complexity is a bit harder to calculate, but it has been calculatedempirically to be O�N2�, while memory requirements decrease dramatically usingSMO (no matrices needed).

When facing a really big problem (more than 100.000 points), the SVM usershould try divide-and-conquer algorithms, allowing the SMO algorithm to performmore time-efficient tasks.

4.2 SMO Algorithm

After describing the main features about optimisation methods, we will show withmore detail the SMO algorithm basics. This section is divided in two parts: thetakeStep function, were the two-point optimisation equation is implemented; thepoint pair choice, were heuristics are implemented, and one usually ignored (butimportant to efficiency) parameter appears.

4.2.1 The takeStep Function

After reading Platt’s paper [Platt, 1999] , where the first SMO pseudo-code appears,the non-mathematician reader will have that uncomfortable “this is too complex forme to understand” feeling.

Because we do not want to lose your interest in SVM (after all, you read all theway up until now), we will follow Platt’s pseudo-code regarding to the takeStepfunction, adding some intermediate steps that will make it easier for engineerswithout a very strong mathematical background.

Given two points, A and B, defined with an input vector x and label y as �xA� yA�and �xB� yB�. We distinguish two cases, one where yA = yB and another whereyA �= yB.• Case yA = yB.Because of constraint (16), �A +�B = H, before and after optimisation, where H isa constant real value, calculated using the previous �A and �B values.

The SVM tries to maximize LD with respect to all �i. In one SMO step wewant to maximize LD with respect to �A and �B. All other Lagrange multipliers areconstants to this optimisation step.

The main condition for this step is �LD��A

= 0, where

LD =l∑

i=1

�i −12

l∑i=1

l∑j=1



We can separate this expression in two adding expressions:

� Set 0 =l∑

i=1

�i

� Set u = −12

l∑i=1

l∑j=1


Suppose the set L is defined as the set of all training patterns except A and B.We can further separate expression Set u in nine adding expressions:

� Set 1 When i = A and j = A

� Set 2 When i = A and j = B

� Set 3 When i = A and j �= B and j �= A

� Set 4 When i = B and j = B

� Set 5 When i = B and j = A

� Set 6 When i = B and j �= B and j �= A

� Set 7 When i �= A and i �= B and j = A

� Set 8 When i �= A and i �= B and j = B

� Set 9 When i �= A and i �= B and j �= A and j �= B

And substitute �B by �H −�A� everywhere it appears. Then

�LD

��A= �Set0

��A+

+ �Set1��A

+ �Set2��A

+ �Set3��A

+ �Set4��A

+ �Set5��A

+ �Set6��A

+ �Set7��A

+ �Set8��A

+ �Set9��A

= 0

The complete expression is:

= ��1 +�1 +K +�A +H −�A +K +�n

��A

+

− 12

�2AKAA

��A

+

− 12

�H�A −�2A�KAB

��A

+

− 12

L∑j=1

�j�AyjyA KjA

��A

+

− 12

�H −�A�2KBB

��A

+

186 CHAPTER 7

− 12

�H�A −n�2A�KAB

��A

+

− 12

L∑j=1

�j�H −�A�yjyAKjB

��A

+

− 12

L∑i=1

�i�AyiyAKiA

��A

+

− 12

L∑i=1

�i�H −�A�yiyAKiB

��A

+

− 12

L∑i=1

L∑j=1

�i�jyiyAKij

��A

After partial derivation:

= 0+

− 12

2�AKAA+

− 12

�H −2�A�KAB+

− 12

L∑i=1

�iyiyAKAi+

− 12

�−2��H −�A�KBB+

− 12

�H −2�A�KAB+

− 12

�−1�L∑

i=1

�iyiyBKBi+

− 12

L∑i=1

�iyiyAKAi+

− 12

�−1�L∑

i=1

�iyiyBKBi+

0


Adding up members with equal Kmn term

= yA

L∑i=1

�iyiKAi+

−yB

L∑i=1

�iyiKBi+

�AKAA+�H −2�A�KAB+− �H −�A�KBB

Where index i is related to the set L (to all training points except A and B).Therefore,

(40) �A =yB

L∑i=1

�iyiKBi −yA

L∑i=1

�iyiKAi +H�KBB −KAB�

KAA −2 KAB +KBB

and �B is very easy to compute, because �B = H −�A• Case yA �= yB.Because of constraint (16), �B = H + �A, before and after optimisation, whereH is a constant real value, calculated using the previous �A and �B values. Themathematical development is very similar to the first case. The formulas to calculatethe step are:

(41) �A =−yB

L∑i=1

�iyiKBi −yA

L∑i=1

�iyiKAi +H�KBB −KAB�+2

KAA −2 KAB +KBB

and �B = H +�A.Note that in the soft margin algorithm the � values must be between 0 and

C. This is not guaranteed using the formula here written. Therefore, a last stepis needed: clipping. The �A value must be set inside defined bounds prior to the�B calculation. Afterwards, the �B value must be checked to be inside the samebounds, and otherwise clipped. Note that this clipping procedure only needs twochecks, and that the H term must always hold. Nevertheless, numerical stabilitymust be ensured in the clipping procedure using the add and substract arithmeticoperations only, avoiding inexact floating point multiply operations.

In both cases, you still have to calculate a couple of � terms, so its complexityis related to the number of points in the training set. Suppose you implement aninternal cache where those terms are calculated a priori. Then, the step complexitywould be O(1). But this improvement is not free: now you have to update the cachevalue for all training points after each successful step, so the complexity becomesO(M) all over again. Actually you can limit the cache update only to the current

188 CHAPTER 7

non-bound support vectors, that is, the points that most probably will be optimisedin next steps. Then, the complexity decreases to O�N��, being N� the number ofnon-bound support vectors, and usually

N� < N << M

The cache used is the error term (usually defined as E in code implementations),being

E�xA� = EA = SVMCurrentlearnedFunction�xA�−yA

= �w•xA�+b−yA

=N∑

i=1

�iyiKAi +b−yA

Therefore, the previously defined term becomes:

L∑i=1

�iyiKAi =N∑

i=1

�iyiKAi −�AyAKAA −�ByBKAB

=EA −b+yA −�AyAKAA −�ByBKAB

Substituting this term in (40) and (41) and expanding H, gives a common andvery simple formula for both cases (same and different class):

(42) �newA = �old

A + yA�EB −EA�

KAA −2KAB +KBB

This cache concept comes along with the SMO algorithm from its very begin-nings. It may seem that no great advantage is gained using and updating the cacheinstead of calculating the � terms at each optimisation step. The reason why isdone so is the heuristics.

4.2.2 The Heuristics

The greatest cost in the SMO algorithm is the error cache update. The optimisationstep itself has a very low impact when using this cache. Therefore, as it was saidin the previous section, the least number of updated cache values, the better.

So, a two-fold search pattern must be used: first, the best optimisation pair mustbe tried at each step (the highest increase in LD); second, the least number of pairsto be inspected for one step, the better.

For the first issue, we could use some of the hand-made optimisation rulesexamples given at section 2.2. In much the same way as those rules were figuredout, the main heuristic defined in SMO is: chose the pair of points whose error cachedifference is the greatest. Note that one of the effects of a two-point optimisation


is that both points become “well classified”, i.e., their error value becomes 0.Looking at the LD formula analysis at section 2.2, it can be seen that classificationcorrectness is one of the main signs for reaching the optimum. If you optimisethe two points which are most poorly classified, then the increase in the LD valueseems the highest. This approach is only a heuristic function. Maybe it cannot beformally demonstrated, but it seems very plausible.

For the second issue, we need a search pattern using the least number of errorcalculations possible. When in an intermediate optimisation step, it has been demon-strated that those points whose Lagrange multiplier is at bound (�i = 0 or �i = C),will probably remain that way throughout the rest of the training. Thus, updatingthese points error cache is not worthy.

Most of the time the non-bound points will be used for the optimisation procedure(and therefore, only these points will have its cache updated). In that usefulnessorder, support vectors at bound ��i = C� and non-suport-vectors ��i = 0� will mostprobably remain non-significant throughout the rest of the training.

Therefore, the choosing of both points follows the same search pattern:• First, try successful steps using every non-bound point (fine-grain).• If not successful, try on all the set (coarse-grain).

The training procedure can be seen as a two-loop approach, each one selectingone of the optimisation pair points (called i1 and i2 in the original SMO pseudo-code).

Loop until LD is high enoughChoose next i1 following the two-set approach heuristic.Loop until an i2 choice is good enough or i1 cannot beoptimised, choose i2 in one of three ways:First try the point having greatest error (cache) difference.If not successful, try all support vectors.If not successful, try all patterns.If not successful, i1 cannot be optimised at this step

If i2 choosing successful, then takeStep(i1, i2).

Actually, the implementation is slightly different, as the takeStep procedure isused to assert if the optimisation step is successful. But the idea behind this simplepseudo-code is quite a close approximation.

Remember that points at bound do not have their error cache updated. Therefore,when choosing i2, if the first try fails, then non-updated-cache points are used,decreasing the current step efficiency.

Much in the same line, there is a usually forgotten parameter in the SMO codethat is very important to attain a greater efficiency degree. In the previous pseudo-code, there is no mention about when to switch the set used to choose i1. Currentimplementations use a very simple decision: when you reach an optimisation statewhere no first-set points can be optimised, switch to the bigger set. This conditionis also used for the algorithm termination: when you are already using the biggerset and no points can be optimised the current numerical approximation is goodenough.

190 CHAPTER 7

The effect of these sharp limits is lower efficiency. Suppose you are at an earlyoptimisation step where at-bound points are still far from being in their correct state(a very common case). A set of non-bound points is ready for optimisation. Thesharp-limit algorithm will optimise that small set until it is completely correct. Thatapproach is not at all cheap, it is like generating a correct smaller SVM using someconstant information. At each optimisation step, many other points could becomenon-optimal. Then, another pair will be chosen, and the previous pair could becomenon-optimal again. Fine-grain optimisation processes move at a very slow pace.And after all that work is performed, a coarse-grain good optimisation step willmake all previous fine-grain computation useless. Of course, fine-grain steps areneeded as well as coarse-grain steps. A balance must be implemented optimisingthe computational cost. One coarse-grain step always costs more than a fine-grainstep, but the objective function increase in the former can be much greater thanthe latter. In early steps, coarse-grain optimising performs better, while in close-to-optimum steps, fine-grain optimising is more cost-efficient. However, this analysisis not defined as a sharp rule; it is more like a fuzzy rule, where some heuristicsmust be used.

Note that you can always know approximately how far you are from the solutionat any point in the optimisation procedure. At optimality, the difference LP −LD is 0.In complex problems, the difference will hardly be 0, as a numerical approximationis used and some degree of floating point error is allowed (using a precisionparameter). But it gives a very good estimate about how much a fine-grain orcoarse-grain optimisation step is more likely to give better results. Note that thedifference is non-negative, as LP is always greater (or equal to at optimality) thanLD (see section 2.2).

The basic SMO has been enhanced in several ways, fulfilling the premonitionenounced in section 4: SVM algorithms will get more complex. Keerthi proposedin [Keerthi et al., 1999] two modifications on SMO heuristics that highly upgradesperformance, but they are a bit more difficult to follow. Also, C.J. Lin createdand maintains LIBSVM (Library for Support Vector Machines), a freely available,open-source, unreadable, optimised implementation of SMO, considered worldwideas the fastest SVM training library.

5. CONCLUSIONS

The SVM community is growing fast. Nowadays all commercial pattern recognitiontoolboxes have the SVM algorithms as a standard Machine Learning method. In theearly years, from 1995 to 1999, its implementation was too hard for the engineeringcommunity to develop and maintain. But after the publishing of the SMO algorithmin 1999, the whole Machine Learning community is walking with big strides toaccept the SVM as a reference method.

SVM theory has very strong foundations. Statistical learning theory has its rootsdeep in mathematical and logic knowledge, and SVM is just a new mathematically


developed appendix. Although its implementation may seem difficult, the solidbackground makes the SVM basics very easy to understand, making the resultsclear and useful for analysis.

6. ACKNOWLEDGEMENTS

The authors would like to thank Ignacio Melgar for text review, and the I + Ddepartment at Sener Ingeniería y Sistemas for their support.

REFERENCES

C. J. C. Burges. Simplified support vector decision rules. In L. Saitta, editor, Proc. 13th InternationalConference on Machine Learning, pages 71–77, San Mateo, CA, 1996.

C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discoveryand Data Mining, 2(2), pages 121–167, 1998.

C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273–297, 1995.R. Fletcher. Practical methods of optimization. John Wiley and Sons, Inc, 2nd edition, 1987.S. Keerthi, S. Shevade, C. Bhattacharyya and K. Murthy, Improvements to Platt’s smo algorithm for svm

classifier design. Technical Report CD-99-14, Control Division, Dept. of Mechanical and ProductionEngineering, National University of Singapore, Singapore, August 1999.

E. Osuna, R. Freund and F. Girosi. An improved algorithm for support vector machines. In proceedingsof the 1997 IEEE workshop on Neural Networks for Signal Processing 7, pages 276–285, AmeliaIsland, FL, 1997.

E. Osuna and F. Girosi. Reducing run-time complexity in SVMs. In Proceedings of the 14th InternationalConf. on Pattern Recognition, pages 271–284,Brisbane, Australia, 1998.

J. Platt. Fast training of support vector machines using sequential minimal optimisation. In B. Scholkopf,C. Burges and A. Smola, editors, Advances in Kernel Methods - support vector learning, pages 185–208.MIT press, Cambridge, MA, 1999

B. Schölkopf, P. Y. Simard, A. J. Smola, and V. N. Vapnik. Prior knowledge in support vector kernels.In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors, Advances in Neural information processingssystems, volume 10, pages 640–646, Cambridge, MA, 1998. MIT Press.

V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.V. Vapnik, S. Golowich, and A. Smola. Support vector method for function approximation, regression

estimation, and signal processing. In M. Mozer, M. Jordan, and T. Petsche, editors, Advances inNeural Information Processing Systems 9, pages 281–287, Cambridge, MA, 1997. MIT Press.

CHAPTER 8

FRACTALS AS PRE-PROCESSING TOOLFOR COMPUTATIONAL INTELLIGENCE APPLICATION

ANA M. TARQUIS1, VALERIANO MÉNDEZ1, JUAN B. GRAU1,JOSÉ M. ANTÓN1, DIEGO ANDINA1�2

1 Dpto. de Matemática Aplicada, E.T.S. de Ingenieros Agrónomos, U.P.M., Av. Complutense s.n.,Ciudad Universitaria, Madrid 28040, Spain.2 Dpto. de Señales, Sistemas y Radiocomunicaiones, E.T.S. Ingenieros de Telecomunicación, U.P.M.,Av. Complutense s.n., Ciudad Universitaria, Madrid 28040, Spain.

Abstract: Preprocessing is the process of adapting the input of our Computational Intelligence (CI)problem to the CI technique applied. Images are inputs of many problems, and Fractalprocessing of the images to extract relevant geometry characteristics is a very importanttool. This chapter is dedicated to Fractal Preprocessing. In Pedology, fractal models werefitted to match the structure of soils and techniques of multifractal analysis of soil imageswere developed as is described in a state-of-the-art panorama. A box-counting methodand a gliding box method are presented, both obtaining from images sets of dimensionparameters, and are evaluated in a discussed case study from images of samples, and thesecond seems preferable. Finally, a comprehensive list of references is given

Keywords: pedology; soil structure; multifractal; soil images; box-counting method; gliding-boxmethod; capacity dimension; information dimension; correlation dimension; multiscaleheterogeneity; fractal models; porous media; partition function

INTRODUCTION

Methods of analysis based on fractal theories have been developed for Pedology todescribe the structure of soil, as natural soils due to a combination of geology, actionof water and air and organisms present a structure down to very small scales thatis in rapport with the physical, biological and agricultural properties. That structuretends to correspond to fractal paradigms as maintaining somehow a fairly similarstructure when reducing the scale to small microscopic ranges. Fractal models witha reduced number of parameters have been developed to describe naturally complexsoils, and different experimental methods were created to compare theory with

193


194 CHAPTER 8

reality or to evaluate parameters, including representative images of soils treatedso as to maintain natural structure features while marking pores or flow patterns,etc. Next section contains a condensed panorama of the corresponding state of artwith references of authors and methods. Third section explains the most commonmethods in subdividing an image to calculate fractal dimensions. Later, two relatedmethods of image description are exposed with formulae, a box-counting methodand a gliding box method, that assume a multifractal structure and the type of imageanalysis resulting in parameters that include capacity, information and correlationdimension. Next, a case study of three obtained images is presented, and using bothmethods a wide range of generalized dimensions is plotted for three samples, andthese results are described and discussed to assess the methods. Finally, conclusionsand a list of references are provided.

1. STATE OF THE ART

Many parameters may be used in the attempt to describe a disordered morphology,but the spatial arrangement of its most prominent features is a challenging problemthroughout a wide range of disciplines (Ripley, 1988; Griffith, 1988; Baveye andBoast, 1998). In the case of 2-dimensional images of soil sections, several workstry to describe the spatial structure applying fractal techniques. Some of them werestudying the spatial arrangement of pore and solid spaces on images of sections ofresin-impregnated soil (Protz and VandenBygaart, 1998; VandenBygaart and Protz,1999). Thin soil sections are analysed by transmitted light to obtain images fromwhich pores, filled with a resin, and solid spaces can be separated using imageanalysis techniques (Morán et al., 1989; Vogel and Kretzschmar, 1996). In other soilscience areas, dye tracers are frequently used to study flow patterns in structuredsoils, and with modern photographic techniques, they may provide excellent spatialresolution of the flow paths (Flury and Fluhler, 1994; 1995). The most commonapproach to describe dye patterns has been by descriptive statistics of the verticalvariation in dye coverage or shape parameters (Flury et al., 1994) based on a blackand white image, as the ones used for soil-pore structure.

A main objective was to extract fractal dimensions which characterize multi-scale and self-similar geometric structures within the images. In other words, weexpect to see that the image viewed at different resolutions looks the same. Ata given resolution we should see the matrix as a collection of subsets similar toeach other and furthermore similar to the whole. If such a hierarchical organizationexists, it can be characterized by a mass fractal dimension D. These structures areeither one of the phases (black or white) or the interface between the two withina 2-dimensional image. That such dimensions can be extracted from soil imagesis evident, from soil pore structure (Brakensiek et al., 1992; Peyton et al., 1994;Crawford et al., 1993; 1995; Anderson et al., 1996; 1998; Pachepsky et al., 1996;Giménez et al., 1997; 1998; Oleschko et al., 1997; 1998a; 1998b; Hallet et al.,1998; Bartoli et al., 1991; 1998; 1999; Dathe et al., 2001; Bird et al., 2000; 2003)or flow paths in soils (Hatano and Booltink, 1992; Hatano et al, 1992; Booltinket al., 1993).

PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION 195

More recently, interest has turned to multifractal analysis of soil images. A mul-tifractal or more precisely a geometrical multifractal (Tel and Vicsek, 1987) isa non-uniform fractal which unlike a uniform fractal exhibits local density fluc-tuations. Its characterization requires not a single dimension but a sequence ofgeneralized fractal dimensions (Pachepsky et al., 2000). A multifractal analysis toextract these dimensions from a soil image may have utility for more complexdistributions, if there is marked variation in local density or porosity.

Multifractal analysis (MFA) has been applied to images of rock pore systems(Muller and McCauley, 1992; Saucier, 1992; Muller et al., 1995; Muller, 1996;Saucier and Muller, 1999; Saucier et al., 2002), reviewed in the context of soils byTarquis et al. (2003) and recently applied to soils by Posadas et al. (2003). In otherareas, researchers have calculated fractal dimensions of dye patterns in horizontalsections of undisturbed soil cores and related them to the outflow of percolatingdye solution. Baveye et al. (1998) calculated the information dimension �D1� andthe correlation dimension �D2� of dye stain patterns in vertical sections in a fieldsoil. Based on these calculations, models have approached the variability showedin the dye images using diffusion limited aggregation techniques (Persson et al.,2001).

Several authors have shown that the exact value of the generalized dimension isnot an easy task (Baveye et al., 1998; Baveye and Boast, 1998; Crawford et al.,1999; Ogawa et al., 1999) pointing to practical difficulties in extracting generalizeddimensions (Vicsek, 1990; Buczkowski et al., 1998). Merits and limitations ofthe multifractal analysis have been discussed by Aharony (1990), Beghdadi et al.(1993), and Andraud et al. (1994). In the same way, Chhabra et al. (1989) pointedout the risks in the estimation of the Hausdorff dimension and cited different sourcesof errors in some physical cases. The difficulties arising in practice are due tothe fact that the relevant quantities used in the multifractal concept are estimatedasymptotically and in image analysis these estimations are much more coarse andlimited by the finite resolution of the image (Ahammer et al., 2003) and the measurebuild on it, as the number of black points in a box (Buczkowski et al., 1998). Alsosome authors have pointed out the influence of the percentage of black pixels ina 2-dimensional image in the Dq obtained (Dathe and Thullner, 2005; Bird et al.,2005; Tarquis et al., 2005; Dathe et al., 2005).

On the other hand, MFA involves partitioning the space of study into boxesto construct samples with multiple scales. The number of the samples at a givenscale is restricted by the size of the partitioning space and data resolution, which isusually another main factor influencing statistical estimation in MFA (Cheng andAgerberg, 1996).

The purpose of this chapter is to ascertain the successful extraction of a spectrumof generalized dimensions from a soil image, trying to avoid all the restrictionsthat the box-counting method has and to discern the existence/non-existence of amultifractal distribution of black or white within the image.

In the following sections, we shall review the box counting algorithm to obtainthe generalized fractal dimensions of multifractal analysis, as well as the gliding-boxmethod. We will show some examples to highlight shortcomings of these methods.

196 CHAPTER 8

2. FRACTAL CALCULATIONS

The main purpose of this section is to introduce some of the fractal methods usedin the context of black and white image analysis of soil structure. A completetreatment of fractal and multifractal theories can be found, among others, in Feder(1989) and Baveye and Boast (1998).

2.1 Box-counting Method

This methodology is classical in this field and has generated a large volume of work.If a fractal line in a 2-dimensional space is covered by boxes of side length d, thenumber of such boxes, n��, needed to cover the line when � → 0 is (Mandelbrot,1982):

(1) n�� = c�−DL

The length of the line studied (e.g., the pore-solid interface), L��, can be definedat different scales and is equal to �n��. At small � values the method provides agood approximation to the length of the line because the resolution of the imageis approached. At larger sizes the difference between �n�� and the “true” lengthincreases. Thus, DL, or capacity dimension, is estimated using small � values(Gimenez et al., 1997b). The box-counting method is also used to obtain a fractaldimension of pore space by counting boxes that are occupied for at least one pixelbelonging to the class “pore” (Gimenez et al., 1997b).

2.2 Dilation Method

Dathe et al. (2001) is the only published report of this method in the soil science.The dilation method follows essentially the same procedure as the box-countingmethod, but instead of using boxes it uses other structuring elements to cover theobject under study, e.g., circles (Dathe et al., 2001). The image is formed by pixels,which are either square or rectangular in shape. If circles are used, the measureof scale is their diameter (as is the side length of the box in the box-countingtechnique). If we want to have the same dilation in any direction, the orthogonal anddiagonal increments should be biased by

√2, which corresponds to the hypotenuse

of a square of unit side length (Kaye, 1989). The length of the studied object iscounted by numbers of circles, and then the slope of the regression line betweenthe log of the object length and the log of the object diameter is defined by therelation:

(2) L�� = c�1−DL

Dathe et al. (2001) applied the box-counting and dilation methods to the sameimages and found non-significant differences in the values of the fractal dimensionsobtained with both methods. They pointed out, however, that fractal dimensions


estimated with both methods are different: the box counting dimension is theKolmogorov dimension while the dimension obtained with the dilation method is theembedding dimension (Mikowski-Bouligand). For further details, see Takayasu’swork (Takayasu, 1990).

2.3 Random Walk

Fractal methods can also be used to describe the dynamic properties of fractalnetworks (Crawford et al., 1993; Anderson et al., 1996]. Characterization of fractalsinvolving space and time are achieved through the use of fractons (Kaye, 1989) orthe spectral dimension (Orbach, 1986). For example, Crawford et al. (1999) relatedmeasurements of the spectral dimension d to diffusion through soil, associating dwith the resistance degree to which the network delay the diffusing particle in agiven direction.

The determination of d is based on random walks, where in each walk thenumber of steps taken �ns� and the number of different pore pixels visited �Sn� arecomputed. At the beginning of the random walk a pore pixel is randomly chosen,then a random step is taken to another pore pixel from the eight pixels surroundingthe present one (see Figure 1A). If the new pore pixel has not been visited by therandom walk, the Sn and ns are increased by one, otherwise only ns is increased.The random walk stops when a certain number of null steps (the step goes into asite that has been used previously during the walk) is achieved or the random walk

Figure 1. Possible steps taken in a: a) eight-connected random walk, and b) four-connected randomwalk. The present position of the pore pixel is marked by an x, and the arrows indicate the possiblenext pore pixel. (From Tarquis et al. In: Scaling Methods in Soil Physics, Pachepsky, Radcliffe and

Selim Eds., CRC Press, 2003. With permission)

198 CHAPTER 8

arrives to an edge of the image (for further details see Crawford et al., 1990). Agraphical representation of these random walks is shown in Figure 2. The numberof walks and the maximum number of null steps for each walk can vary (Andersonet al., 1996). Also a four-connected random walk (Figure 1B) can be used insteadof eight-connected one (Anderson et al., 1996).

For each random walk, d is calculated based on the relation:

(3) ns = cSd2n

where c is a constant. The mean value of the d calculated for each walk is thespectral dimension.

Figure 2. A simplified example of one random walk through the pore space (Anderson et al., 1996)(From Anderson et al. Soil.Sci.Soc. A. J., 60, 962,1996. With permission)


3. CALCULATION OF GENERALIZED FRACTAL DIMENSIONS

3.1 Box-counting Method

Generalized dimensions calculated using the box counting technique basicallyaccounts for the mass contained in each box. An image is divided into n boxes ofsize r (pixels in each dimension making r · r pixels per box), designated as �n�r��,and for each box the fraction ��i� of pore space in that box is defined and calculatedas

(4) �i =mi

M= mi/

n�r�∑

i=1

mi

where mi is the number of pore class pixels and M is the total number of pore classpixels in an image. In this case, the pore space area is the measure whereas thesupport is the pore space itself. The next step is to define the generating function��q� r�� as:

(5) ��q� r� =n�r�∑

i=1

��i�q� r�� q ∈ R

where

(6) �i�q� r� = �qi =

(

mi/n�r�∑

i=1

mi

)q

�i is a weighted measure that represents the percentage of pore space in the ith

box, and q is the weight or moment of the measure. When computing boxes of sizer, the possible values of mi are from 0 to r · r. Therefore, let Nj�r� be the numberof boxes containing j pixels of pore space in that grid. Equations (5) and (6) willthen be (Barnsley et al., 1988):

(7)

��q� r� =n�r�∑

i=1

��i�q =

n�r�∑

i=1

(mi

M

)q =r·r∑

j=1

Nj�r�

(j

M

)q

=r·r∑

j=1

Nj�r�

⎛

⎜⎜⎝

jr·r∑

k=1kNk�r�

⎞

⎟⎟⎠

q

Using the distribution function Nj�r�, calculations become simpler and compu-tational errors are smaller.

A log-log plot of a self-similar measure, ��q� r�, vs. r at various values for qgives

(8) � �q� r� ∼ r−�q�

200 CHAPTER 8

where �q� is the qth mass exponent (Feder, 1989). We can express �q� as:

(9) �q� = −limr→0

log��q� r��

log�r�

Then, the generalized dimension, Dq, can be introduced by the following scalingrelationship (Feder, 1989):

(10) �q� = −limr→0

log��q� r��

log�r�

And, therefore

(11) �q� = �q −1�Dq

For the case that q = 1, Equation (11) cannot be applied and the followingequation should be used:

(12) D1 = limr→0

n�r�∑

i=1�i�1� r� · log��i�1� r�

log r

The generalized dimensions, Dq, for q = 0, 1 and 2 are known as the capac-ity, the information, and the correlation dimensions, respectively (Hentschel andProcaccia, 1983). The capacity dimension is the box-counting or fractal dimension.The information dimension is related to the entropy of the system, whereas thecorrelation dimension computes the correlation of measures contained in boxes ofvarious sizes (Posadas et al., 2003).

Given these definitions and the behaviour to expect in case of a multifractalmeasure, it is again instructive to seek the lower and upper bounds for ��q� r� inorder to establish what scope exists for behaviour other than that associated witha multifractal measure (for further explanation see Bird et al., 2005). FollowingBird et al.’s work (2005), a brief explanation will be shown to establish bounds for��q� r�. Four separate ranges of values of the parameter q should be considered:

Case q > 1

The smallest value that ��q� r� can take corresponds to a uniform distribution ofpore phase over the image. The largest value that ��q� r� can take corresponds tothe case in which each grid block covering the pore phase is entirely filled by porephase. The lower and upper bounds are then as follows:

(13)(

L

r

)2�1−q�

< ��q� r� <

(L

r

)2�1−q�

f �1−q�� q > 1

where f is the fraction of the image occupied by black pixels and L is the lengthof the image.


Case q = 1

The smallest value of the entropy corresponds to the case in which each gridblock covering the pore phase is entirely filled by pore phase. The largest valuecorresponds to a uniform distribution. Lower and upper bounds are then as follows:

(14) 2 ln(

L

r

)

+ ln�f� < −n�r�∑

i=1

�i ln��i� < 2 ln(

L

r

)

� q = 1

Case 0 ≤ q < 1

The smallest value ��q� r� can take corresponds to the case in which each grid blockcovering the pore phase is entirely filled by pore. The largest value corresponds toa uniform distribution of pore phase over the image. Lower and upper bounds areas follows:

(15)(

L

r

)2�1−q�

f �1−q� < �r�q� r� <

(L

r

)2�1−q�

� 0 ≤ q < 1

Case q < 0

The smallest value ��q� r� can take corresponds to the case in which each grid blockcovering the pore phase is entirely filled by pore. The function is monotonicallydecreasing. Therefore, the value corresponding to r = 1 pixel can be selected as anupper bound.

(16)(

L

r

)2�1−q�

f �1−q� < ��q� r� < L2�1−q�f �1−q�� q < 0

Having defined these bounds, we now seek to examine their significance interms of extracting generalized dimensions from image data. For q > 1 and for0 ≤ q < 1, the bounding functions when plotted on the log-log plot used to extractthe dimension yield two parallel lines with a vertical separation of �1−q� ln�f�.

For q = 1, the bounding functions when included in the plot of entropy againstln�r� again yield two parallel lines of slope 2 with separation of ln�f�. Thus, in thesecases we reach the same impasse as that with the fractal analysis, namely dependingon f , and independent of actual geometry considered, the data can be so constrainedas to yield convincing straight-line fits with associated derived dimensions.

3.2 Gliding Box Method

The gliding-box method was originally used for lacunarity analysis (Allain andCloitre, 1991). Later, it was modified by Cheng (1997a, 1997b) for estimating �q�as follows:

(17) < �q� > +D = − log�< M�q� r� >�

log�r/rmin�

202 CHAPTER 8

Where D is the dimension of the Euclidean space where the image is imbibed(in this case D = 2) and M represents the multiplier measured on each pixel as:

(18) M�q� r� =(

��rmin�

��r�

)q

For further details see Grau et al. (2006). The advantage of using Equation (17)in comparison with Equation (9) is that the estimation is independent of box size rwhich allows the use of two successive box sizes only to estimate �q�. Equation (18)imposes that ��rmin� should not be null.

Once this estimation is done, Equation (8) can be applied to estimate Dq. Forthe case of q = 1 the following relationship is applied based on the work given in(Saucier and Muller, 1999):

(19) D1 = 2D2 −D3

4. IMAGES FOR THE CASE STUDY

Three soil samples were selected with the aim to represent a different range invoid pattern distribution in soils and a wide range of porosity values, from 5% ofporosity till 47%. Each of the samples was prepared for image analysis followingthe procedure described by Protz and VandenBygaart (1998).

The data was obtained by imaging thin sections with a Kodak 460 RGB camerausing transmitted and circularly polarized illumination. The data was cropped from3060 × 2036 pixels to 3000 × 2000 pixels. Then, EASI/PACE software classifiedthe data and the void bitmap separated, each individual pixel size was 18�6×18�6microns. The images of these soils are showed in Figure 3.

To avoid any interference of the edge effect for the calculations using the box-counting method, an area of 1024 × 1024 pixels of the left upper corner of theoriginal images was selected.

5. RESULTS OF THE CASE STUDY AND DISCUSSION

5.1 Generating Function with the Box-counting Method

For the three binary images, ��q� r� was calculated and then a bi-log plot of ��q� r�versus r was made to observe the behavior. All plots showed a clear pattern in thedata. In Figure 2, for example, at negative q there were two distinctive areas, onewhere there was a linear relationship between log�r� and log��q� r�� and anotherwhere the value of log ��q� r� was almost constant versus log�r�. The box size atwhich the behavior is different for the three images is around 64 pixels. These twophases were not evident with positive q values (see Figure 4).

The existence of a plateau phase of log��q� r�� can be explained by the natureof the measure under consideration. At r values close to 1, the variation in numberof black pixels is based on a few pixels, having the most simplicity when r = 1


Sample A

Sample B

Sample C

Figure 3. Soil binary images, pore phase in black pixels, of: (a) ADS, (b) BUSO and (c) EHV1. Eachimage has 5.65%, 19.17% and 46.67% of porosity, respectively

204 CHAPTER 8

–150

–100

–50

0

50

100

150

0 2 4 6

Log(r)

Log

X(q

,r)

–150

–100

–50

0

50

100

150

0 2 4 6

Log(r)

Log

X(q

,r)

–150

–100

–50

0

50

100

150

0 2 4 6

Log(r)

Log

X(q

,r)

–10–8–6–4–20246810

B

A

C

Figure 4. Bi-log plot of ��q� r� versus box size �r� at different mass exponent �q�: A): ADS; B)BUSO; C) EVH1

where the measure can only have 0 or 1 value. Thus, for small boxes of size r theproportions among their values are mainly constant. However, when the box sizepasses certain size a scaling pattern begins.


5.2 Generalized Dimensions Using the Box-counting Method

If all of the regression points are considered, the Dq values, obtained mainly forq < 0, were quite different from these obtained if only the regression points inthe linear behavior were chosen (Figure 5). Between both criteria, any Dq canbe obtained, but for q >= 0 the differences are not significant. Many authorshave pointed out this fact since the first applications of multifractal analysis toexperimental results (Tarquis et al., 2005).

The implications of Dq changes, too noticeable in this case, make impossible anycomparison and calculation of the amplitude of the dimensions �D−10 −D+10� as ithas been used in several works.

The differences found among the Dq representation (Figure 5, filled circles) aremainly found in the negative part. In particular, comparing ADS (Figure 5A filledcircles) with the rest it is evident that it doesn’t show a multifractal behavior.

All the D0 obtained have a value of 2 (plane dimension). This overestimation isdue to the fact that the studied range that was selected to have an optimum fit forall the q values. However, looking at the lower and upper bond of the box-countingplots for q = 0 (Figure 6) it is quite clear that regardless the structure in the imagethe linear fit will be obtained with a high r2.

The standard errors (data not shown) of the Dq obtained in the linear behaviorphase are minimum and the r2 of the regression analysis very high. However, thisis not surprising if we realize that only three points are being used. In addition, thenumber of boxes of each size is very low, for size 128 × 128 pixels the numberof boxes is 64, for size 256 × 256 pixels the number of boxes is 16, analyzing animage of 1024 × 1024 pixels that is considered a representative elementary area(VandenBygaart and Protz, 1999).

This size restriction is avoided by using the gliding box method and its resultsare discussed in the next section.

5.3 Generalized Dimensions Using the Gliding Box Method

For the three binary images, < M�q� r� > was calculated and then a bi-log plot of< M�q� r� > versus r/rmin was made. All plots showed a linear relationship, as itwas expected, with an important number of points to calculate a linear regressionand based on the line’s slope estimate Dq (Figure 4). In the case of EHV1 forq < −6 (Figure 4A), the linear relationship is not as clear as in the rest of theimages.

Finally, a comparison between both methods in the Dq values obtained can bestudied in Figure 5. In all of the graphics, Dq appears again with a value of 2imposed by the box gliding method as it was explained in section 3.2.

For ADS (Figure 5A) both curves are similar. On propose, the range of valuesfor Dq has been changed to observe that the image effect could induce to an errorin our conclusions, when in Figure 3 was evident that Dq was an almost constantvalue.

206 CHAPTER 8

1,50

2,50

3,50

4,50

5,50

6,50

–10 –8 –6 –4 –2 0 2 4 6 8 10

q

Dq

1,50

2,50

3,50

4,50

5,50

6,50

–10 –8 –6 –4 –2 0 2 4 6 8 10

q

Dq

1,50

2,50

3,50

4,50

5,50

6,50

–10 –8 –6 –4 –2 0 2 4 6 8 10

q

Dq

A

B

C

Figure 5. Generalized dimensions (Dq) from q = −10 to q = +10 for all points of the regression line(filled square) and for the three selected points based on bi-log plot of X(r,q) (filled circles) of each

image: A) ADS; B) BUSO and C) EVH1


Observing the differences between both methods in BUSO and EVH1 (Figure 5Band 5C respectively) are bigger in the negative q values although in the positivevalues Dq shows a stronger decay (Grau et al., 2006).

6. CONCLUSIONS

Over the last years, the concepts of fractal/multifractal have been increasinglyapplied in analysis of porous materials including soils and in the development offractal models of porous media. In terms of modeling, it is important to charac-terize the multiscale heterogeneity of soil structure in a useful way, but the blindapplication of these analyses does not approach to it.

(a)

0 1 2 3 4 5 6 7 8

2

0

4

6

8

10

12

14

16

log r

log

N

(b)

00 1 2 3 4 5 6 7 8

2

4

6

8

10

12

14

16

log r

log

N

Figure 6. Box counting plots for EHV1 soil images, q = 0, with upper and lower bounds (a) solidphase (b) pore phase. (From Bird et al., J. of Hydrol., 322, 211, 2006. With permission)

208 CHAPTER 8

Log

(<M

(r,q

)>)

–40

–30

–20

–10

0

10

20

3040

–0,1 0,1 0,3 0,5 0,7 0,9 1,1 1,3 1,5

Log(r/rmin)

Log

(<M

(r,q

)>)

0

20

–20

–40

–60

–80

40

0

0 0,2 0,4 0,6 0,8 1 1,2 1,4 1,6

0

20

–20

–40

–60

–80

–100

–120

0,2 0,4 0,6 0,8 1 1,2 1,4 1,6

Log(r/rmin)

Log

(<M

(r,q

)>)

–10–8–6–4–20246810

C

A

B

Figure 7. Bi-log plot of < M�r� q� > versus box size rate �r/rmin� at different mass exponent (q): A):ADS; B) BUSO, C) EVH1


1,990

1,995

2,000

2,005

2,010

2,015

2,020

2,025

2,030

-10 -8 -6 -4 -2 0 2 4 6 8 10q

Dq

0,000

0,500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

-8 -6 -4 -2 0 2 4 6 8-10 10q

Dq

1,00

2,00

3,00

4,00

5,00

6,00

7,00

-10 -8 -6 -4 -2 0 2 4 6 8 10q

Dq

A

B

C

Figure 8. Generalized dimensions (Dq) from q = −10 to q = +10 based on the box-gliding method(empty square) and based on the box-counting method (filled circles) using the same box sizes range:

A) ADS; B) BUSO; C) EVH1

210 CHAPTER 8

The results obtained by the “box-counting” and “gliding-box” methods for mul-tifractal modeling of soil pore images show that “gliding-box” provides more con-sistent results as it creates more number of large size boxes in comparison with thebox-counting method and avoids the restriction that box-counting method imposesto the partition function.

7. ACKNOWLEDGEMENTS

We thank Dr Richard Heck of Guelph University for the soil images. We are veryindebted to Dr. N. Bird, Dr. Q. Cheng and Dr. D. Gimenez for helpful discussions.This work was supported by Techical University of Madrid (UPM) and MadridAutonomous Community (CAM), Project No. M050020163.

REFERENCES

Aharony, A., 1990, Multifractals in physics – successes, dangers and challenges, Physica A. 168:479–489.

Ahammer, H., De Vaney, T.T.J. and Tritthart, H.A., 2003, How much resolution is enough? Influenceof downscaling the pixel resolution of digital images on the generalised dimensions, Physica D. 181(3–4):147–156.

Allain, C. and Cloitre, M., 1991, Characterizing the lacunarity of random and deterministic fractal sets,Physical Review A. 44:3552–3558.

Anderson, A.N., McBratney, A.B. and FitzPatrick, E.A., 1996, Soil Mass, Surface, and Spectral FractalDimensions Estimated from Thin Section Photographs, Soil Sci. Soc. Am. J. 60:962–969.

Anderson, A.N., McBratney, A.B. and Crawford, J.W., Applications of fractals to soil studies. Adv.Agron., 63:1, 1998.

Barnsley, M.F., Devaney, R.L., Mandelbrot, B.B., Peitgen, H.O., Saupe, D. and Voss, R.F., 1988, TheScience of Fractal Images. Edited by H.O. Peitgen and D. Saupe, Springer-Verlag, New York.

Bartoli, F., Philippy, R., Doirisse, S., Niquet, S. and Dubuit, M., 1991, Structure and self-similarity insilty and sandy soils; the fractal approach, J. Soil Sci. 42:167–185.

Bartoli, F., Bird, N.R., Gomendy, V., Vivier, H. and Niquet, S., 1999, The relation between silty soilstructures and their mercury porosimetry curve counterparts: fractals and percolation, Eur. J. Soil Sci.,50(9).

Bartoli, F., Dutartre, P., Gomendy, V., Niquet, S. and Vivier, H., 1998. Fractal and soil structures. In:Fractals in Soil Science, Baveye, Parlange and Stewart, Eds., CRC Press, Boca Raton, 203–232.

Baveye, P. and Boast, C.W. Fractal Geometry, Fragmentation Processes and the Physics of Scale-Invariance: An Introduction. In Fractals in Soil Science, Baveye, Parlange and Stewart, Eds., CRCPress, Boca Raton, 1998, 1.

Baveye, P., Boast, C.W., Ogawa, S., Parlange, J.Y. and Steenhuis, T., 1998. Influence of image resolutionand thresholding on the apparent mass fractal characteristics of preferential flow patterns in field soils,Water Resour. Res. 34, 2783–2796.

Bird, N., Díaz, M.C., Saa, A. and Tarquis, A.M., 2006. Fractal and Multifractal Analysis of Pore-ScaleImages of Soil. J. Hydrol, 322, 211–219.

Bird, N.R.A., Perrier, E. and Rieu, M., 2000. The water retention function for a model of soil structurewith pore and solid fractal distributions. Eur. J. Soil Sci. 51, 55–63.

Bird, N.R.A. and Perrier, E.M.A., 2003. The pore-solid fractal model of soil density scaling. Eur. J. SoilSci. 54, 467–476.

Booltink, H.W.G., Hatano, R. and Bouma, J., 1993. Measurement and simulation of bypass flow in astructured clay soil; a physico-morphological approach. J. Hydrol. 148, 149–168.


Brakensiek, D.L., W.J. Rawls, S.D. Logsdon and Edwards, W.M., 1992. Fractal description of macrop-orosity. Soil Sci. Soc. Am. J. 56, 1721–1723.

Buczhowski, S., Hildgen, P. and Cartilier, L. 1998. Measurements of fractal dimension by box-counting:a critical analysis of data scatter. Physica A 252, 23–34.

Cheng, Q. and Agerberg, F.P. (1996). Comparison between two types of multifractal modeling. Mathe-matical Geology, 28(8), 1001–1015.

Cheng, Q. (1997a). Discrete multifractals. Mathematical Geology, 29(2), 245–266.Cheng, Q. (1997b). Multifractal modeling and lacunarity analysis. Mathematical Geology, 29(7),

919–932.Crawford, J.W., Baveye, P., Grindrod, P. and Rappoldt, C. Application of Fractals to Soil Properties,

Landscape Patterns, and Solute Transport in Porous Media, in Assessment of Non-Point SourcePollution in the Vadose Zone. Geophysical Monograph 108, Corwin, Loague and Ellsworth, Eds.,American Geophysical Union, Wahington, DC, 1999, 151.

Crawford, J.W., Ritz, K. and Young, I.M. Quantification of fungal morphology, gaseous transport andmicrobial dynamics in soil: an integrated framework utilising fractal geometry. Geoderma, 56, 1578,1993.

Crawford, J.W., Matsui, N. and Young, I.M. 1995., The relation between the moisture-release curve andthe structure of soil. Eur. J. Soil Sci. 46, 369–375.

Dathe, A., Eins, S., Niemeyer, J. and Gerold, G. The surface fractal dimension of the soil-pore interfaceas measured by image analysis. Geoderma, 103, 203, 2001.

Dathe, A., Tarquis, A.M. and Perrier, E., 2006. Multifractal analysis of the pore- andsolid-phases in binary two-dimensional images of natural porous structures. Geoderma,doi:10.1016/j.geoderma.2006.03.024, in press.

Dathe, A. and Thullner, M., 2005. The relationship between fractal properties of solid matrix and porespace in porous media. Geoderma, 129, 279–290.

Feder, J., 1989. Fractals. Plenum Press, New York. 283ppFlury, M. and Fluhler, H., 1994. Brilliant blue FCF as a dye tracer for solute transport studies – A

toxicological overview. J.Environ. Qual. 23, 1108–1112.Flury, M. and Fluhler, H., 1995. Tracer characteristics of brilliant blue. Soil Sci. Soc. Am. J. 59, 22–27.Flury, M., Fluhler, H., Jury, W.A. and Leuenberger, J., 1994. Susceptibility of soils to preferential flow

of water: A field study, Water Resour. Res. 30, 1945–1954.Giménez, D., R.R. Allmaras, E.A. Nater and Huggins, D.R., 1997a. Fractal dimensions for volume and

surface of interaggregate pores – scale effects. Geoderma 77, 19–38.Giménez D., Perfect E., Rawls W.J. and Pachepsky, Y., 1997b. Fractal models for predicting soil

hydraulic properties: a review. Eng. Geol. 48, 161–183.Gouyet, J.G. Physics and Fractal Structures. Masson, Paris, 1996.Grau, J., Méndez, V., Tarquis, A.M., Díaz, M.C. and A. Saa, 2006. Comparison of gliding box and

box-counting methods in soil image analysis. Geoderma, doi:10.1016/j.geoderma.2006.03.009, inpress.

Griffith, D.A.. Advanced Spatial Statistics. Kluwer Academic Publishers, Boston, 1988.Hallett, P.D., Bird, N.R.A., Dexter, A.R. and Seville, P.K., 1998. Investigation into the fractal scaling

of the structure and strength of soil aggregates. Eur. J. Soil Sci. 49, 203–211.Hatano, R. and Booltink, H.W.G., 1992. Using Fractal Dimensions of Stained Flow Patterns in a Clay

Soil to Predict Bypass Flow. J. Hydrol. 135, 121–131.Hatano, R., Kawamura, N., Ikeda, J. and Sakuma, T. Evaluation of the effect of morphological features

of flow paths on solute transport by using fractal dimensions of methylene blue staining patterns.Geoderma 53, 31, 1992.

Hentschel, H.G.R. and Procaccia, I. (1983). The infinite number of generalized dimensions of fractalsand strange attractors. Physica D, 8, 435, 1983.

Kaye, B.G. A Random Walk through Fractal Dimensions. VCH Verlagsgesellschaft, Weinheim,Germany, 1989, 297.

Mandelbrot, B.B. The Fractal Geometry of Nature. W.H. Freeman, San Francisco, CA, 1982.McCauley, J.L. 1992. Models of permeability and conductivity of porous media. Physica A 187, 18–54.

212 CHAPTER 8

Moran, C.J., McBratney, A.B. and Koppi, A.J.,1989. A rapid method for analysis of soil macroporestructure. I. Specimen preparation and digital binary production. Soil Sci. Soc. Am. J. 53, 921–928.

Muller, J., 1996. Characterization of pore space in chalk by multifractal analysis. J. Hydrology, 187,215–222.

Muller, J., Huseby, O.K. and Saucier, A. Influence of Multifractal Scaling of Pore Geometry onPermeabilities of Sedimentary Rocks. Chaos, Solitons & Fractals, 5, 1485, 1995.

Muller, J. and McCauley, J.L., 1992. Implication of Fractal Geometry for Fluid Flow Properties ofSedimentary Rocks. Transp. Porous Media 8, 133–147.

Muller, J., Huseby, O.K. and Saucier, A., 1995. Influence of Multifractal Scaling of Pore Geometry onPermeabilities of Sedimentary Rocks. Chaos, Solitons & Fractals 5, 1485–1492.

Ogawa, S., Baveye, P., Boast, C.W., Parlange, J.Y. and Steenhuis, T. Surface fractal characteristics ofpreferential flow patterns in field soils: evaluation and effect of image processing. Geoderma, 88,109, 1999.

Oleschko, K., Fuentes, C., Brambila, F. and Alvarez, R. Linear fractal analysis of three Mexican soilsin different management systems. Soil Technol., 10, 185, 1997.

Oleschko, K. Delesse principle and statistical fractal sets: 1. Dimensional equivalents. Soil&TillageResearch, 49, 255,1998a.

Oleschko, K., Brambila, F., Aceff, F. and Mora, L.P. From fractal analysis along a line to fractals onthe plane. Soil&Tillage Research, 45, 389, 1998b.

Orbach, R. Dynamics of fractal networks. Science (Washington, DC) 231, 814, 1986.Pachepsky, Y.A.,Yakovchenko, V., Rabenhorst, M.C., Pooley, C. and Sikora, L.J. . Fractal parameters

of pore surfaces as derived from micromorphological data: effect of long term management practices.Geoderma, 74, 305, 1996.

Pachepsky, Y.A., Giménez, D., Crawford, J.W. and Rawls, W.J. Conventional and fractal geometry insoil science. In Fractals in Soil Science, Pachepsky, Crawford and Rawls, Eds., Elsevier Science,Amsterdam, 2000, 7.

Persson, M., Yasuda, H., Albergel, J., Berndtsson, R., Zante, P., Nasri, S. and Öhrström, P., 2001.Modeling plot scale dye penetration by a diffusion limited aggregation (DLA) model. J. Hydrol. 250,98–105.

Peyton, R.L., Gantzer, C.J., Anderson, S.H., Haeffner, B.A. and Pfeifer, P. . Fractal dimension todescribe soil macropore structure using X ray computed tomography. Water Resource Research, 30,691, 1994.

Posadas, A.N.D., Giménez, D., Quiroz, R. and Protz, R., 2003. Multifractal Characterization of SoilPore Spatial Distributions. Soil Sci. Soc. Am. J. 67, 1361–1369

Protz , R. and VandenBygaart, A.J. 1998. Towards systematic image analysis in the study of soilmicromorphology. Science Soils, 3. (available online at http://link.springer.de/link/service/journals/).

Ripley, B.D. Statistical Inference for Spatial Processes, Cambridge Univ. Press, Cambridge, 1988.Saucier, A. Effective permeability of multifractal porous media. Physica A, 183, 381, 1992.Saucier, A. and Muller, J. Remarks on some properties of multifractals. Physica A, 199, 350, 1993.Saucier, A. and Muller, J. Textural analysis of disordered materials with multifractals. Physica A, 267,

221, 1999.Saucier, A., Richer, J. and Muller, J., 2002. Statistical mechanics and its applications. Physica A, 311

(1–2): 231–259.Takayasu, H. Fractals in the Physical Sciences. Manchester University Press, Manchester, 1990.Tarquis, A.M., Giménez, D., Saa, A., Díaz, M.C. and Gascó, J.M., 2003. Scaling and Multiscaling of

Soil Pore Systems Determined by Image Analysis. In: Scaling Methods in Soil Physics, Pachepsky,Radcliffe and Selim Eds., CRC Press, 434 pp.

Tarquis, A.M., McInnes, K.J., Keys, J., Saa, A., García, M.R. and Díaz, M.C., 2006. MultiscalingAnalysis In A Structured Clay Soil Using 2D Images. J. Hydrol, 322, 236–246.

Tel, T. and Vicsek, T., 1987. Geometrical multifractality of growing structures, J. Physics A. General,20, L835–L840.

VandenBygaart, A.J. and Protz, R., 1999. The representative elementary area (REA) in studies ofquantitative soil micromorphology. Geoderma 89, 333–346.


Vicsek, T. 1990. Mass multifractals. Physica A, 168, 490–497.Vogel, H.J. and Kretzschmar, A., 1996. Topological characterization of pore space in soil-sample

preparation and digital image-processing. Geoderma 73, 23–38.

Documents

Computational Intelligence: for Engineering and Manufacturing