14
GENETIC PLANS AND THE PROBABILISTIC LEARNING SYSTEM: I SYNTHESIS AND RESULTS '. Larry Rendell Department of Computer Science, University of Illinois at Urbana-Champaign, 1304 West Springfield Avenue, Urbana, Illinois 61801 ABSTRACT This paper describes new conceptual and experi- mental .results using tbe probabilistic learning system PLSt. PLS2 is designed ror any task in which overall perrormance can be measured, and in which choice of task objei:ts or operators in8uences performance. The system can manage incremental learning and noisy domains. PLS2 learns in two ways. Its lower ·percep- tual" layer dusters data into ei:onomical cells or re,ionB in augmented feature space. The upper -genetic" level of selects successful regions (compressed ,enes) rrom multiple, parallel cases. Intermediate between perrormance data and task control structures, regions promote efficient and etrective learning. Novel aspects of PLS2 include compressed genotypes, credit localization and ·population perrormance". Incipient principles of efficiency and etrei:tivenes5 are suggested. Analysis of the system is con6rmed by experiments demonstrat- ing stability, efficiency. and effectiveness. Fisure I. Layered learnins system PLS2. The pcre,,. fuel learninl system PLSI sene. .. the performance element (PE) of t.he "adie qstem PLS2. The PE of PLSI is some t.ask. PLS2 act.intes PLSI with different Itnowled,e structures (-cumulative repn seta-) which PLS2 continually improves. The basis for improve- ment is compet.ition and credit locaJint.ion. 1. INTRODUCTION The author's probabilistic leGrnin, 11IItem PLS is capable or efficient and effei:tive generalization learning in many domains IRe Re 83d, Re86al. Unlike other systems (La 83, Mit 83, Mk 83al, PLS can manage noise, and learn incre- men tally. While it can be used ror -single con- cept- learning, like the systems described in [Di82/, PLS has been developed and tested in the difficult domain of heuristic search, which requires not only noise management and incre- mental learning, but also removal or bias acquired during task performance [Re83aJ. The system can discover optimal evaluation runctions (see Fig. 2). PLS has introduced some novel approaches, such as new kinds or c1usterins. 1 sl«t F'spre 2. One use or PLS. In heuristic search, an object is a state, and it.s utility might be the probabil- ity or contributins t.o success (appearins on a solution path). E.,., ror rat t.bis probabilit.y is 1/3. Here the pair (rS1 Ps) is one or three refioa. which may be used t.o create a heuristic evaluation function. Resion characteristics are det.ermined by clusterins- I. See (Re83aJ ror details and (Re85a, Re85bJ for discussion or PLS', -conceptual e1usterinl" (Mic83b) which bepn in (Re76, Re77). PLS ·utility· of domain object.s provides -catesofY cohesiveness- [Me 85). (Re 85c) introduces -higher dimensional- clus- term, which permit.s creation or structure. Appendix A summarizes some or these terms, whicb will be ex- panded in later sect.ion. or this paper.

~-34 · PLS~ selects successful regions (compressed ,enes) rrom multiple, parallel cases. Intermediate between perrormance data and task …

  • Upload
    lydung

  • View
    223

  • Download
    0

Embed Size (px)

Citation preview

~-34

GENETIC PLANS AND THE PROBABILISTIC LEARNING SYSTEM I SYNTHESIS AND RESULTS

Larry Rendell Department of Computer Science

University of Illinois at Urbana-Champaign 1304 West Springfield Avenue Urbana Illinois 61801

ABSTRACT

This paper describes new conceptual and experishymental results using tbe probabilistic learning system PLSt PLS2 is designed ror any task in which overall perrormance can be measured and in which choice of task objeits or operators in8uences performance The system can manage incremental learning and noisy domains

PLS2 learns in two ways Its lower middotpercepshytual layer dusters data into eionomical cells or reionB in augmented feature space The upper -genetic level of PLS~ selects successful regions (compressed enes) rrom multiple parallel cases Intermediate between perrormance data and task control structures regions promote efficient and etrective learning

Novel aspects of PLS2 include compressed genotypes credit localization and middotpopulation perrormance Incipient principles of efficiency and etreitivenes5 are suggested Analysis of the system is con6rmed by experiments demonstratshying stability efficiency and effectiveness

Fisure I Layered learnins system PLS2 The pcre fuel learninl system PLSI sene the performance element (PE) of the adie qstem PLS2 The PE of PLSI is some task PLS2 actintes PLSI with different Itnowlede structures (-cumulative repn seta-) which PLS2 continually improves The basis for improveshyment is competition and credit locaJintion

1 INTRODUCTION

The authors probabilistic leGrnin 11IItem PLS is capable or efficient and effeitive generalization learning in many domains IRe 83a~ Re 83d Re86al Unlike other systems (La 83 Mit 83 Mk 83al PLS can manage noise and learn increshymen tally While it can be used ror -single conshycept- learning like the systems described in [Di82 PLS has been developed and tested in the difficult domain of heuristic search which requires not only noise management and increshymental learning but also removal or bias acquired during task performance [Re83aJ The system can discover optimal evaluation runctions (see Fig 2) PLS has introduced some novel approaches such as new kinds or c1usterins1

sllaquot

Fspre 2 One use or PLS In heuristic search an object is a state and its utility might be the probabilshyity or contributins to success (appearins on a solution path) E ror rat tbis probability is 13 Here the pair (rS1 Ps) is one or three refioa which may be used to create a heuristic evaluation function Resion characteristics are determined by clusterins-

I See (Re83aJ ror details and (Re85a Re85bJ for discussion or PLS -conceptual e1usterinl (Mic83b) which bepn in (Re76 Re77) PLS middotutilitymiddot of domain objects provides -catesofY cohesivenessshy[Me 85) (Re 85c) introduces -higher dimensional- clusshyterm which permits creation or structure Appendix A summarizes some or these terms whicb will be exshypanded in later section or this paper

Another successful approach to adaptation is endic alorituM (GAs) Aside from their

ability to diseover global optima GAs have several other important characteristics including stability efficiency 8exibility and extensibility (Ho 75 H0811 While the fuD behavior of geDetic algorithms is not yet knoWD in detail certain characteristics have been established and this approach compares very favorably with other methods of optimization [Be80 Br8t De80I Because of their performance and potential GAs have been applied to various AI learning tasks fRe83c SmSO Sm83)

In [Re83c1 a combination of the above two approaches was described the doubly layered learning system PLSI (see Figt)2 PLS1 the lower level of PLS2 could be considered percepshytual it compresses goal-oriented information (task utility) into a generalized economical and useful form (regions -see Figs 2 4) The upper layer is genetic a competition of parallel knowledge structures In fRe83cI each of these components was argued to improve efficacy and efficiencys

This paper extends and substantiates these claims conceptually and empirically The next section gives an example of a genetic algorithm which is oriented toward the current context Section 3 describes the knowledge structure (regions) from two points of view PLSI and PLS2 Section examines be SJnthesis of these two systems and considers some reasons for their efficiency Sections 5 and 6 present and analyze the experimental results which show the system to be stable accurate and efficient The paper doses with a brier summary and a glossary or terms used in machine learninl and genetic sysshytems

2 For the reader unramiliar with learning sysshytem and otber terminolo Appendix B provides brier explanations

3 PLS2 is appUcable to aDr domain ror which rutures aDd -userulnessmiddot or fItIlit of objects can be de6ned (Re 83dJ An object can represent a pbrsical entity or an operator oyer the set or entities Domains can be simple (eg single conceptmiddot learning) or comshyplex (eK- expert systems) Sbte-spaee problems and games have been tested in (Re 83 Re 83dJ The PLS approach is unirorm and can be deterministic or proshybabilistic The only real difficulty with a new domain is in constrodin ruturea which bear a smooth rel~ tionship to the utility (the system can evaluate and screen Ceatures presented to it)

2 GENETIC SYSTEMS AN EXAMPLE

This section describes a simple GA to introduce terminology a~d concepts and to provide a basis for comparison with the more complex PLS2 The reader already familiar with GAs may wish to omit aU but the last part of this section

21 OptImization

Many problems can be regarded as function optimization In an AI application this may mean diseovery or a good control structure for executing- some task The function to be optimshyized is then some measure of task success which we may call the perorm4nce fl In the terminolshyogy or optimizatioD fI is the objectil1e lunction In the context of genetic systems fI is the jitne P4110J or merit4

The merit fI depends on some control structure the simplest example of which is a vecshytor of wei b =- (b l bull b2 bal Frequently

the analytic form of J( b) is not known so exact methods cannot be used to optimize it (this is the case with most AI problems) But what orten is available (at some cost) is the value of fI ror a given control structure In our example let us suppose that fI can be obtained for any desired value or b by testing system perrormance If fI is a well behaved smooth functioD of b and if there is just one peak in the fI surface then this local opCimum is also a ob41 optimtam which can be efficiently diseovered usial hiD ctimbinl techniques However the behavior of fI is orten unknown and may have numerous optima in these cases a genetic adaptive algorithm is appropriate

22 Genetic Allorltbm

In a GA a structure of interest such as a weight vector b is called a pAenote Fig3 shows a simple example with just two weights b l and ba The phenotype is normaUy coded as a string of digits (usually bits) called the enote B A single digit is a erae gene values are allele The position or a gene within the genoshytype is given by an index called the locur Depending OD the resolution desired we might choose a greater or lesser number or sequential lenes to code eacla bi Ir we consider 5 bits to be

4 might also be called the middotutilit7 but we reserve this term Cor another kind or quality measure used by PLSI

sufficient the length of the genotype 8 will be L - 5n bits (see Fig3)

Instead or search inc weicht space directly for an optimal veetor b a GA searches gene space which has dimensionality L (gene space is Hamminc space if allele are binary) A GA conshyducts this search in parallel using a set of indiflishydual genotypes called a population or lene pool By comparing the relative merits p or individuals in a population and by matinc only the better individuals a GA perlorms an inrormed search or gene space This search is conducted iteratively over repeated generGtion In each new cenerashytion there are three basic operations perrormed (1) selection 01 parents (2) generation or ofisprinc and (3) replacement 01 individuals (1) and (2) have been given more attention Parent election is usually stochastic weich ted in lavor or individuals having higher values Offpring generGtion relies on genetic operator which modily parent genotypes Two natural examples are mutation (which alters one allele) and crossshyover (which slices two genotypes at a common locus and exchanges segments-see Fig 3)

POPULAl10tc

enol7PC B PMClty ~ Q()QIUIIlO (3-2) U

04 U

0011011011 (8t tf 01 01

(4-4) L4

[~~~ OfFSPlUtcQ

PartiliB CIIiWrt B CliiWrtl b

-UO Q()QIOU100 ~er Q()Ql omoo 00101UlIO

Fipre 3 Simple endic system The upper part or this diasrarn shows smaD population or just seven individuals Here the set or chancteristics (the pll tpe) is a simple two element vector b This is coded by the erlt B Each individual is associated with its measured merit po On the basis or their values pairs or individuals are stochasticall chosen as parents ror gendic recombination Their senotlpe are modified hI crossover to produce two new oftsprins

tions are perrormed OD them to produce ollsprinl the elleet is a combinatioll or knowledge retention and controlled search HoIshyland proved that osing binary alleles the crossshyover operator aad parent seleetioll proportional to p a GA is KS times more emdent than exhaustive search or gene space where K is the population size IHo 76 H081 Several empirical studies have verified the computational emciency or GAs compared with alterative procedures ror global optimization and have discovered interestinc properties of OAs such as ell~eets or varyinc K For example populations smaller than 60 can cause problems IBr81 OeSO

13 Appllcatlon In Heuristic Search

One AI use is search rorsolutions to probshylems or lor wins in cames (Ni80) Here we wish to learn an evaluation runction Has combinashytion 01 variables Xl Ibull x called allrihte or feature (reatures are often used to describe states in search) In the simplest case H is expressed as the linear combination bl Xl + b21 ++ bx = bx where the bi are weights to be learned We want to optimize the weight vecshytor b according to some measure or the perfor mance when H is used to control search

A rational way to de6ne (which we shaD use throughout this paper) is related to the avermiddot age number 0 or states or nodes developed in solving a set of problems Suppose D is observed ror a population or K heuristic ruactions defined by weight vectors bi Since the perror mance improves with lower values or D a good definitioa 01 the merit of ~ (ie or bi ) is the relashytive perrormance measure JIi == i5 OJ where i5 is the average over the popUlation ie i5 E OJ K This expression or merit could be used to assess geaotypes 8 representiac weight vectors bit as depicted in Fig3

Instead or this simple genetic approach however PLS2 employs unusual genotypes and operators some or which relate to PLSI Ia the remaining sections or this paper we shall examshyine the advantages 01 the OA resulting Irom the combiution of PLSI with PLS2

5 Notice that search takes pl3Ce both at the level or the task domain (for sood problem solutions)Because the more successful pareats are and at the level or the learninl element (for a lood

selected lor mating and because limited operashy control structure H)

3

3 PLS INFORMATION STRUCTURING DUAL VIEWPOINT

The connection behveea PLSI perceptual learning and PLS2 genetic adaptation is subtle and indirect BasicaUy PLSI deals witb object x (whicb caa be just aboat anything) aad their relationships to taak performance Let UI call the usefulness of an object x in some task domain ita utilitl u(x)

Since the number of objects is typically immense even vast observatioD is incomplete and generalization is required for prediction of u given a previously aneacountered x A significant step ia generalization is usually the expression of x aa a vector of high-level abstract features Xl X2 bullbullbull x so that x really represents not just one object but ratber a large number of similar objects (eg in a board game x might be a vector of reatures sach aa piece advantage center control etc) A further step in generalishyzation is to classif1l or categorize XiS which are similar for current purposese Since the purpose is to succeed well in a task PLSI cl3Ssi6es xs having similar utilities u

Class formation can be accomplished in several ways depending on the model assumed If the task domain and features permit objects having similar utilities may be eluterl in

feature space as illustrated in Figs 2t giving a middotregion set- RT Another model is the linear combination H == bt ot sect2

It is at this point that a GA like PLS2 can aid the learning process WelJ performing bs or Rs may be selected accord in to their merit 110 Note that merit 110 is an overall measure of the task performance while utility u is a quality measure localized to individual objects

The question now is what information structures to choose for representing knowledge about task utility For many reasons PLS incorshyporates the middotreion set- (Figbullbull) which represents domain knowledge by associatin an object with its utility We examine the region set from two

I Here t cttlfif means t frm classes cateories or concepts This is difficult to automate

1 PLSl initiated wbat has become known as conceptutl c1usterin - where not just reature valuea are considered but also predetermined rorms or classes (e rectlnles) and the whole data environment (eampshyutility) See [Re 78 Re 71 Re 83a Re SSa Re SSbl and also Appendix A

points of view as a PLSI knowledge structure and as a PLS2 genetie structure

0001

Fipre 4 Dual inLerpretation or a reion set R A repon set a part-it-loa or reatur pace (bere there are I repons) Poinu are clustered into reion ampccordinl to heir utility u in some task domain (e u - probabiJit7 or contributin to task success-see Fie 2) Here he u alues are bown inside the rectanshy

Ies A rei R is the triple (r u e) where e is the error in u The repon set R - R serves both as the PLSI knowledp IJtructur and as the PLS2 pnotypbull In PLS1 R is a discreLe (sLep) runction expresain varishyation or utility u witb reatures Xl In PLS2 R is a compressed nmon or the detailed enotype illustraLed in Fie5

31 The Region as PLSI Knowledge Strucshyture

In a feature pau representation an object

is a vector x == (Xl ~ x) In a problem or game the basic object is the tat trequently expressed as a vector oC features such as piece advantage center control mobility etc Obsershyvations made dUlin the course of even many problems or games normally cover just a traction of feature space and generalization is required for prediction

In neraliz4lion Iarnin objects are abstracted to form clan cotegoritll or conshycet This may take the form ot a partition of teature space ie a set of mutually exhaustive local neighborhoods called cute or cell (An 73 Di821 Since the goal of clustering ia PLS is to aid task performance the basis for generalishyzation is some measure otthe worth quality or

8 Feature spaces are sometimes avoided because they cannot easily express structure However alLershynative representatione as normally used are also deficient ror realistic eneralization learnioamp- A new scheme mechanizes or a very difficult inductive probshylem reature ibullbull IRe 83d1je 85e1

t The object or ennt could just as wen be an operaLor to be applied to a state or a state-operaLor pair See IRe 83d~

utilit of a state or ceO relative to the task One measure of utility is the probability ot contributshying to a solution or will Ia Figs 24 probability classes are rectangular cells (tor economy) The leftmost rectangle r has probability a 020 The rectangle r is a category generalizing the conditions under which the utility u applies

In PLS a rectangle is associated not just with its utility u but also with the utility error e This expression e ot uncertainty iD u allows quantification ot the elect of noise and provides an infomed and concise means for weighting various contributions to the value of u during learning The triple R - (rue) caUed amp

region is the main knowledge structure for PLSI A set R R or regions defines amp partition ill augmented reature space

R may be used directly as amp (discrete) evalualion or ileuriflic rUlction H == u(r) to assess state x Erin search For example in Fig4 there are six regions which difrerentiate states into si( utility classes Instead of forming a discrete heuristic R may be used indirectly as data (or determining the weight vector b in a smooth evaluation (unction H == bx (employing cune fitting techniques) We shall return to these algorithmic aspects of PLS in 54

Figure 5 Definition ot muimally detailed genotype U If the number of points in reature space is finite and a value of the utility is usoeiated with each point comshyplete information can be captured in a detailed genoshytype U of concatenated utilities U I u bullbull UL Coordishynates could be linearly ordered as shown here ror the two dimensional case U is an tully expanded lenotype correspondinl to the compreSled version or Fit 4

10 This could be expressed in other ways The production rule rorm is r -u Usin 10lie r is represented (0 S XI S 4) n (0 S x S 2)

31 The Region Set at Compressed and Unreatrleted PLSJ Genotype

Now let us examine these inrormation structures from the genetic viewpoint The weight vector b or evaluation function H could be considered a GA phenotype What might the genotype ber One choice a simple one was described in sect2 and illustrated in Fig3 here the genotype B is just a binary coding or b A difFerent possibility is one that captures exhausshytive inrormation about the relationship between utility u and reature vector x (see FigS) In this case the gene would be (x u) It the number of genes is bite they can be indexed and conshycateoated to give a very detailed genotype U which becomes a striDg or values U 1U2 bullbullbull uL codshying the entire utility surface in augmented feature space

This genotype U is unusual in some imporshytant ways Let us compare it with the earlier example B of sect2 (Fig3) B is simply a binary form or weight vector b One obvious difrerence between B and U is that U is more verbose than B This redundancy aspect will be considered shortly The other important diJrerence between Band U is that alleles within B may weD interact (to express feature nonlinearity) but alleles Uj

within U Cllnnot interact (since the Uj express an absolute property or reature vector x ie its utilshyity for some task) As explained in the next secshytion this rreedom rrom gene interdependence permits localization or credit11

The detaned genotype U codes the utility surface which may be very irregular at worst or very smooth at best This surrace may be locally well behaved (it may vary slowly iD some volumes or reature space) [n cases or local regushylarity portions or U are redundant As shown in FigS PLS2 compresses the genotype U into the region set R (examined in 531 from the PLSI viewpoil1 t) [n PLS2 a sin gle region R == (r u e ) is a et or genes the whole having just one allele u (we disregard the genetic coding or e) Unlike standard genotypes which have a stationary locus ror each gene and a fixed Dumber or genes the regioD set has no explicit loci but rather a

11 While one of the strengths of a GA is its abilshyity to manage interaetion ot nriable (by middotcoshyadaptinc alleles) PLS2 achieves efficient and concise knowledre representation and acquisition by ftexible sene compression and by certain other mdhods examshyined later in this paper

6

variable number 01 elements (regions) eacb representinamp a variable number of lenes A relion compresses lene seta having similar utility according to current knowledge

4 KNOWLEDGE ACQUISITION SYNERGIC LEARNING ALGORITHMS

In this section we examine how R is used to proshyvide barely adequate information about the utilshyity surface This compact representation resulta in ecoDomy of both 8pace aDd time and in eftective -learning Some reasons for this power are considered

The ultimate purpose of PLS is to discover utility classes in the form 01 a region set R This knowledge structure controls the primary task lor example in heuristic search R R=II =II

(rue) defines a discrete evaluation lunction H(r) =- u

The ideal R would be perfectly accurate and maximally compre Accuracy of utility u determines the quaJitl of task performance Appropriate compression of R characterizes the task domain conciselJ but adequately (see Figs 3 4) saving storage and time both during task performance and during learning

These goals 01 accuracl and economl are approached bl the doublJ lalered learninl 8Ylshytem PLS2 Fig 1) PLSI and PLS2 combine to become eftective rules lor generalization (inducshytion) specializatioD (differentiation) and reorganshyization The two layers support each other in various ways for example PLS2 stabilizes the pershyceptual system PLSl and PLSI maintains genoshytype diversity of the genetic system PLS2 In the tollowing we consider details amprBt trom the standpoint of PLSl then Irom the perspective 01 PLS2

41 PLSI Revision and DlfIerentlatlon

Even without a genetic component PLSI is a ftexible learning system which can be employed in noisy domains requiring incremental learning It can be used lor simple concept learnin like the systems in [Di82J but most experiments have involved state space problem solvin and game playing12 Here we examine PLS in the context of

12 These experimenLs have led to unique reaulLs 8uch as discovery ot locally optimal evaluation runcshytions (see [Re S3amp Re 83dl)

these difficult tasks

Ns described in sect31 the main PLSI knowledge strucfare is the region set R == ( r I u e) Intermediate between basic data obtained duriD search and a general heuristic ased to control search R defines a feature space augmen ted and partitioned by u and e Because R is improved incrementally it is called the cumulatil1e reiDn d PLSI repeatedly perrorms two basic operationl on R One operation is correction or relinDn (of utility u and error e) and the other is specialization difterentiation or refinement (or leature space cells r) These operators are detailed in (Re83a Re83dJ here we simply outUne their eftecta and note their limshyitations

Revlalon 01 a and e For an established region R (rue) E R PLSI is able to modify u and to decrease e by using new data This is accomplished in a rough fashion by comparing established values within aU rectangles r with fresh values within the same r It is difficult or impossible to learn the -true- values of u since data are acquired during performance of hard tasks and these data are biased in unknown ways because of nontrivial search

Reflnemen or R Alternately perrorming then learning PLSI acquires more and more detail about the natare or variation or utility u with reatures This information accumulates in the region set R c R III ( r a e ) where the primary eftect 01 clustering a is increasing resolushytion of R The number sizes and shapes 01 recshytanles in R reflect current knowledge resolution As this differentiation continues in successive iterations of PLSl attention locuses on more useshyrul parts 01 feature space and heuristic power improves

Unfortunately so does the likelihood of error Further errors are difficult to quantily and hard to localbe to individual regions

In brier while the incremental learnin 01 PLSI is powerful enoagh to learn locally optimal heuristics under certain conditions and while PLSI feedback is good enough to control and correct mild errors the leedback can become unstable in ualayorable situations instead of beinl corrected errors can become more proshyDounced Moreover PLSI is sensitive to parameshyter settings (see Appendix B) The system needs support

42 PLS2 Genetic Operatou

Qualities missing in PLSI can be provided by PLS2 M bull1 concluded PLSl with its single region set cannot discover accurate values of utilities u PLS2 however maintains an entire population of region seta which means that several regions in aU cover any given feature space volume The availability of comparable regions ultimately permits greater accuracy in u and brings other benefits

As sect32 explained a PLS2 genotype is the region set R = R and each region R = (rue) ( R is a compressed gene whose aDele is the utility u Details of an early version or PLS2 are given in (Re83cl Those algorithms have been improved the time complexity of the operashytors in recent program implementations is linear with population sile K The rollowing discussion outlines the overall eJlects and properties or the various genetic operators (compare to the more usual GA of sect2)

K-aexual mating is the operator analoshygous to crossover Consider a population R of K difterent region sets R Each set is composed of a number of regions R which together cover feature space A new region set Rt is formed by selecting individual regions (one at a time) from parents R with probability proportional to merit (merit is the performance of R defined at the end of sect2) Selection of regions rrom the whole population of region sets continues until the feature space cover is approximately the average cover or the parents This creates the oftspring region set Rt which is generally not a partition

Gene reorganization For economy of storale and time the olfsprinl region set Rt is repartitioned so that regions do not overlap in feature space

Controlled mutation Standard mutashytion operators alter an allele randomly III conshytrast the PLS2 operator analogous to mutation changes aD allele accordinl to evidence arising in the task domaia The controlled mutation operator ror a region set R = ( r Ii e) is the utility revision operator of PLSI M described in sect4I PLSI modifies the utility u for each feature space ceO r

Genotype expansion This operator is also provided by PLSI Recall the discussion of sect32 about the economy resultinl from compressshying genes (utility-feature vectors) into a region

set R The rerinement operator was described in sect41 This feature space refiDement amounta to an expansion or the lenotype R and is carried out when data warrant the discrimination

Both controlled mutation and genotype expansion promote genotype diversity Thus PLSI helps PLS2 to avoid premature convergence a typical GA problem [BrSl MaSI

43 Efteettveneaa and Effielency

The power of PLS2 may be traced to cershytain aspects of the perceptual and genetic algoshyrithms just outlined Some existing and emergshying principles or eftective and efficient learning are briefly discussed below (see also (ReSSa ReSSb ReSSel)

Credit loeaIlzatlon The selection of regions for Kmiddotsexual mating may use a single merit value for each region R within a given set R However the value of can just as well be localized to single regions within R t by comshyparing R with similar regions in other seta Since regions estimate an absolute quantity (taskshyrelated utility) in their own volume or feature space they are independent or each other Thus credit and blame may be assigned to feature space cells (ie to gene sequences)

Assignment or credit to individual regions within a cumulative set R is straightforward but it would be difticult to do directly in the final evaluation flluction H since the components of H whiJe appropriate ror performance omit inforshymation relevant to learning (compare Figs 2 )bull3

Knowledge mediation successrul sysshytems tend to employ inrormation structures which mediate data objects and the ultimate knowledge form These mediating structures include means to record growing assurance or tentative hypotheses

Whell used in heuristic sea~ch the PLS region set mediates large numbers of states and a

13 There are various pOSIIibilities lor the evalu tion function R but all contain len useful inlormation than their determinant the resion Bet R The simshyplest heuristic used in (Re 83a Re 83dJ is H - bl where b is a vector of weisbts for the feature vector f (This linear combination is used exclusively in experishyments to be described in 55) The value of b is found usinr regions as data in linear regression (Re 83a Re83bJ

7

very concise evaluation runction H Retention and continual improvemen or this mediatin structure relines the credit assignment problem This view is unlike that or (Oi81p14 Oi821 learning systems orten attempt to improve the control structure itself whereas PLS acquires knowledge efficiently in an appropriate structure and utilizes this knowledge by compressing it only temporarily ror performance In other words PLS does not directly search rule space for a good H but rather searches for good cumul tive regions rrom whicb H is constructed

Full but controlled use of ever) datum Samuels checker player permitted each state encountered to inBuence the beuristic H and at tbe same time no one datum could overwhelm the system The learning was stochastic botb conservative and economic In this respect PLS2 is similar (although more automated)

Schemata In learning systems and genetic algorithms A related efficiency in both Samuels systems and PLS is like tbe scheshymata concept in a GA In a GA a single indivishydual coded as a genotype (a string of digits) supports not only itselr but also aU its subshystrings Similarly a single state arising in heurisshytic search contains information about every feature used to describe it Thus each state can be used to appraise and weight each reature (The effect is more pronounced when a state is described in more elementary terms and combishynations or primitive descriptors are assessed -see (ReSSel)

6 EXPERIMENT AND ANALYSIS

PLS2 is designed to work in a changing environshyment of increasingly difficult problems Tbis seeshytion describes experimental evidence or effective and efficient learning

51 Experimental Condltlons Tame teaturee The features used ror

tbese experiments were the four or (Re83al Tbe relationship beween utility and these features is rairly smootb so tbe rull capability of a GA is not tested although tbe environment was dynamic

De6nltlon ot merit As sect4 described PLS2 choses regions from successrul cumulative sets and recombines them into imprOVed sets For the experiments reported here the selection criterion was the global merit p ie tbe perforshy

mance ora whole region Id without localization or credit to indiyichlal regions This measure p was the average number of nodes developed 0 in a training sample of 8 fmeen puules divided into the mean of aU such averages in a popul tion of K sets ie p 00 where 0 is tbe average over the population (0 E OJ K)

Changing envIronment For these expershyiments successive rounds of training were repeated in incremental learnin over several iterations or enertin The environment was altered in successive generations it was specified as problem difJicult or depth d (defined as the number of moves from the goal in sample probshylems) As a sequence of specifications of problem difficulty this becomes a training difJicult vector d == (d d2 d)

Here d was static one known to be a good progression based on previous experience with user training (Co84)middot In these experiments d was always (81422 SO ) An integer means random production or training problems subject to this difficulty constraint wbile demands production of fully random training instances

6t Discussioll

Berore we examine the experiments themshyselves let us consider potential dilrerences between PLSI aDd PLS2 in terms of their effectiveness and efficiency We also need a crishyterion for assessing differences between the two systems

VulnerabtUty of PLSt With population size K == 1 PLS2 degenerates to tbe simpler PLSI In this case static training can result in utter railure since the process is stocbastic and various things can go wrong (see Appendix B) The wors is railure to solve any problems in some generation and consequent absence or any new information IC the control structure H is tbis poor it will DO improve unless tbe lad is

14 PLS and similar system ror problem and game are IIOmetima neither rully supervised nor rully unsupervised The orjpnu PLSl w intermediate in this respect Trainiq problema were selected by a humiddot man bu rrom each raining instance a multitude or individuu noda ror learning are generated b the SY8shytem Each node can be considered a separate example ror concep learniq (Re 83d] (Co 84] describes experishyments with an automated trainer

8

detected and problem difficulty is reduced (ie dynamic training is needed)

Even without this catastrophe PLSI pershyrorms with varying degrees or success depending on the sophistication 01 ita training and other lacton (explained in Appendix B) With minimal human guidance PLSI always achieves a good evaluation lunction H although not always an optimal one Witb static training PLSI succeeds reasonably about balf the time

Stabtllty ot PLSt In contrast one would expect PLS2 to bave a much better success rate Since PLSI is here being run in parallel (Fig 1) and since PLS2 should reject hopeless cases (tbeir middots are small) a complete catastrophe (all Hs lailing) should occur with probability p s qK wbere q is the probability 01 PLSI lailure and K is popUlation size It q is even as large as one hall but K is 7 or more the probability p 01 catastrophe is less than 001

Cost versus benefit a measure Failure plays a part in costs so PLS2 may have an advantage The ultimate criterion for system quality is cost effectiveness is PLS2 worth its extra complexity Since the main cost is in task performance (here solving) the number of nodes developed 0 to attain some perlormance is a good measure of the expense

If training results in catastrophic bUure however all effort is wasted so a better measure is the expected cost Op where p is the probabiamp ity of success For example if 0 == 1gt00 for viable control structures but the probability of finding solutions is only then the aeraampe cost of useful information is 500 ~ == 1000

To extend this argument probability p depends on what is considered a success Is sucshycess the discovery of a perfect evaluation runcshytion H or is performance satisfactory ir 0 departs Irom optimal by no more than 21gt

53 Rulte

Table I shows perrormances and costa with yarious nlues of K Here p is estimated using roughly 38 trbJa or PLSI in a PLS2 context (it K == I 38 distind runl it K == 2 18 runl etc) Since variances is D are high performance testa were made oyer a random sample or 50 puzzles This typicaJJy iyes 91gt con6dence of 0 t 40

Accuraq ot learnlnl Let UI 6rst comshyp31e results 01 PLSI versul PLS2 for rour different success criteria We consider the learning to be successful it the resUlting heuristic H approaches optimal quality within a given margin (01 100 50 21gt and 10)

Columns two to five in the table (the second group) show the proportion of Hs giving performance 0 within a specified percentage or the best ho 0 (the best 0 is around 3SO nodes for the foar leatures used) For example the last row of the table shows that 01 the 36 individual control structures H tested in (two different) popul2tions of size 19 all 36 were within 100 of optimal 0 (column two) This means that aU developed no more than 700 nodes before a solutio was found Similarly column 6ve in the last row shows that 021 01 the 3t5 Hs or 8 or them were within 10 ie required no more than 386 nodes developed

Cost ot accuracy Columns ten and eleyen (the two riamphtmost columns of the lourth croup) show the estimated costs of achieving pershyformance withia 100 and within 10 or optimum respedively The values are based on the expected total number of nodes required (ie Op) with adjustments in favor of PLSI for extra PLS2 overhead (The unit is one thousand nodes developed) As K increases the cost of a given accuracy 6rst increases Nevertheless with just moderate K Yahles the genetic system becomes cheaper particularly ror an accuracy or 10

lb

I

Propolt0 auat Suce c Itbullbull 1

C lalt otlaa1 D lOot S 15 lot

__ lIdo

IRaNoa SI of SliIt t 654 U5 -nl IU sn n ItS 5 114 It n nl 1M Ul 14 I

Coat 1 1 11

__I

US n bull 11le19a lJ

poet Con t 0lIlbullbull ltalbull

1 101

J III na n au1bull nl na Ul II t4

foebullbullbulleobull It~t I

Jllot aot 12

n

1 4

12 IS 19

n J U bullbull4 6t 11 1 n 12 0 n 11 U )1 11

1 11 14 1 n 61 21

The expected cost benefit is not the only PLS2 and traditional optimization In [ReSl] advantage of PLS2 PLSI was round considerably more efficient than

Average pertormance beat perlo mance and population performance Conshysider now the third group of columns in Table I Tbe sixth column gives tbe IIverllle i5 for an H in the sample (or 36) The seventh column gives Db ror the best Hb in the sample These two measures average and be performance are orten used in assessing genetic systems (Br 81) The eighth column however is unusual it indishycates the population performance Dp resulting when all regions from every set in the population are used together in a regression to determine Up This is sensible because regions are indepenshydent and estimate tbe utility an absolute quaashytity ((Re83cJ cr [Br8l H07S))

Several trends are apparent in Table L First whether the criterion is 100 50 25 or 10 of optimum (columns 2-5) the proportion of good Hs increases steadily as population size K rises Similarly average best and popUlation performance measures 6 Db and Dp (columns 6shy8) also improve with K Perhaps most important is that the population performance Dp is so reu ably close to best even with these low K values This means that the whole population ot regions can be used (for Hp) without independent verification of perrormance In contrast indivi-middot dual Hs would require additional testing to disshycover the best (column 7) and the other alternashytive any H is likely not as good as Hp (colllI 6 and 8) Furthermore the entire populath or regions can become an accurate source of masshysive data for determining an evaluation function capturing reature interaction IRe83b

This accuracy advantage of PLS2 is illusshytrated in the 6nal column ot the table where ror a constant cost rough estimates are given or the expected error in population performance Dp relative to the optimal value

It is interesting that such sman popUlations improve perrormance markedly usually populashytion sizes are SO or more

o EFFICmNCY AND CAPABILITY

Based on these empirical observations ror PLS2 OD other comparisons ror PLSl and on various conceptual differences general properties or three competing methods can be compared PLSl

standard optimization and the luuestion was made tbat PLSI made better ase of available inrormation By Itudying sucll behaviors load underlying reasons we Ihould eventually identify principles of efficieat learning Some aspects are considered below

TradItIonal optlmlloatlon verau PLSI First let us consider efficiency of search for an optimal weight vector b in the evaluation runcshytion H = bt One good optimization method is repone lurace fittin (RSF) It can discover a local optimum in weight space by measuring and regressing the response (bere number of nodes developed D) for various values ot b RSF utilshyizes just a single quantit (ie D) ror every probshylem solved This seems like a small amount of inrormation to extract rrom an entire search lince a typical one may develop hundreds or thousands or nodes each possibly containing relevant inrormation In contrast to this tradishytional statistical approach PLSI like [5a63 5a671 uncovers knowledge about ever feature rrom ever node (see 143) PLSl then might be expected to be more efficient than RSF Experishyments veriry this [ReSI

TraditIonal optlmlratloll venta PLSI oM shown in SS PLS2 is more efficient ItiD We can compare it too with RSF The accuracy of RSF is known to improve with VN where N is the number or data (here the number or or probshylema solved) oM a arst approximation a paranel method like PLS2 should also cause accuracy to increase with the square root of the number or data although the data are now regions instead of D values If roughly the same number of regions is present in each individual set R or a population or size K accuracy must thererore improve as VK Since each of these K structures requires N problems in training the accuracy or PLS2 should increase as yenN like RSF

Obviously though PLS2 involves much more than blind parallelism a genetic algorithm extracts accurate knowledge and dismisses incorrect (un6t) inrormation While it is impossishyble ror PLSI alone PLS2 can rehe merit by localizing credit to individual regions (Re 83c] Planned experiments with this should show rurther increases in efficiency since the additional cost is small Another inexpensive improvement

10

will attempt to reward good regions by decreasshying their estimated errors Even without these refinements PLS2 retains meritorious regions (sect4) and should exhibit accuracy improvement better than vN Table I suggests this

PLSZ veraua PLSZ AI di8eussed ill sect41 and Appendix B PLSI is limited necessitating human tuning for optimum performance In conshytrast the second layer learning system PLS2 requires little human intervention The maiD reason is that PLS2 stabilizes knowledge automatically by comparing region sets and dismissing aberrant ones Accurate cumulative sets have a longer lifetime

1 SUMMARY AND CONCLUSIONS

PLS2 is a general learning system (Re83a Re83dJ Given a set or user-de6ned reatures and some measure of the utilil (eg probability or success in tast (errormance) PLS2 rorms aDd rebes an appropriate knowledge structure the cumulative region d R relating utility to reature values and permitting DOise manageshyment This economical and flexible structure mediate data objects and abstract heuristic knowledge

Since individual regions or the cumulative set R are independent or one anotber botb credit localization and reature interaction are possible

This ability to discriminate merit and retain successful data will likely be accentuated with the localization or credit to indiviltlual regions (see H2) Another improvement is to alter dynamically tbe error or a region (estimated by PLSl) as a runction or its merit (found by PLS2) This will have the elJect or protecting a good region rrom imperrect PLSI utility revision once some parallel PLSl has succeeded in discovshyering an accurate value it will be more immune to damage A fit region will bave a very long lirespan

Inherent differences In capablltty RSF PLSl and PLS2 can be cbaracterized dilJerently

From the standpoint or time costs given a chalshylenging requirement such as the location or a local optimum witbin 10 tbe ordering or tbese methods in terms or efficiency is RSF 5 PLSI 5

PLS2 In terms or capdilit tbe same relationshysbip bolds RSF cannot handle reature interacshytions without a more complex model (wbicb would increase its cost drastically) PLSl on the other hand can provide some performance improvement using piecewise linearity with little additional cost [Re83bJ PLS2 is more robust than PLSl While the original system is someshywhat sensitive to training and parameters PLS2 provides stability using competition to overcome deficiencies obviate tuning and increase accushyracy all at once PLS2 bulJers inadequacies inberent in PLSI Moreover PLS2 being genetishycally based may be able to handle bighly interacting reatures and discover 10601 optima Re 83c1 Tbis is very costly with RSF and seems inreasible with PLSl alone

simultaneously Separating tbe tast con trol structure H from the main store or knowledge R allows straigbtforward credit assignment to this dderminant R or H while H iteelt may incorshyporate reature nonlinearities witbout being responsible for tbem

A concise and adequate embodiment of current heuristic knowledge tbe cumulative region set R waa originally used in the learning system PLSl (Re83aJ PLSI is the only system shown to discover locally optimal evaluation runctions in an AI context Clearly superior to PLSl its genetic successor PLS2 haa been shown to be more stable more accurate more efficient and more convenient PLS2 employs an unusual genetic algorithm having tbe cumulative set R as a compressed genotype PLS2 extends PLSls limmiddot ited operations or revision (controlled mutation) and dilJerentiation (genotype expansion) to include generalization and other rules (K-sexual mating and genotype reorganization) Credit may be localized to individual gene sequences

These improvements may be viewed as eftectillg greater efficiency or as allowing greater capability Compared with a traditional metbod of optimization PLSl is more efficient (Re85aJ but PLS2 does even better Given a required accuracy PLS2 locates an optimum with lower expected cost In terms or capability PLS2 insushylates the system from inherent iIladequacies and sensitivities or PLSl PLS2 is much more stable and caD use the whole population of regions relishyably to create a bigbly informed heuristic (this popultion performance is not meaningrul in standard genetic systems) Tbis availability or massive data haa important implications for teature interaction lRe83blmiddot

11

Additional refinements or PLS2 may further increase efficiency and power These include rewarding meritorious regions 10 they become immune to damage Future experiments will investigate nonlinear capability ability to disshycover lobal optima and efficiency and effectiveness of focalized credit auinment

This paper has quantitatively affirmed some principles believed to improve efficiency and effectiveness of learning (eg credit localization) The paper has also considered some simple but little explored ideas for realizin these capabilishyties (eg tull but controlled use or each datum)

REFERENCES

(An 73) Ande~bert MR auter Aul IJr Applicamiddot tiIJn AcademicPress 1973

(Be 80J Bethke AD Genetic elgIJritlm bull bullbull uchu IJptimizer PhD Thesis Universitl of Michian 1980

[Br 81) Brindle A Genetic algorithms for function optimintion CS Department Report TR81middot2 (PhD Dissertation) Universitl of Alberta 1081

[Bu 78) Buchanan BG Johnson CR Mitchell TM and Smith RG Models or learnin systems in Belser J (Ed) Encdopclli 0 CIJmper Science 11 TcdnolIJfJ 11 (1978) 201

(Co 8) Coles D and Rendell LA Some issues in trainin learn in systems and an autonomoul desip PrIJc FitA Bicftaiel CIJerece cAe Cbullbullbullli SIJeiet IJ ComputetilJel Stli IJICeUiuce 1984

(De 80) DeJonl KA Adaptive system delip A enetic approach IEEE Trdibullbull a Stem Mbullbull nl C6enadiCl SMC-IO (1080) 566-S7(

(Di81) Dietterich TG and Buchanan BG The role of the critic in learnin systems Stanford Universitl Report STANmiddotC8-81middot801 1081

(Di82) Dietterich TG London B Clarkson K and Drome1 G Learninl and inductin inference STANmiddot C8-82-813 Stanford Universitl also Chapter XIV or Tlc HI6t1o tI Arlifidcl InteUlgenee Cohen PR and Feienbaum EA (Ed) Kaufmann 1082

(Ho 75] Holland JH AIIptti Nturel nll Arlificiel Stem Universitl or Michitan Press 1875

(H0801 Holland JH Adapive a1orithms ror discovshyerin and usin eneral patterns in powin knowlede bases IntI loumel Oil Polic AelbullbullbullIIormtiIJ Stem 4 2 (1080) 217middot240

(Ho 81) Holland JH Genetic a1prithms and adaptampshytion Proc NATO Ad Ret In Alpti CIJntrIJI 0

Ulefillel Sem 1081

(8083) Holland JH Eecapinl brit~lenelS Proc Sccod latemetael Mdae Lunai Workdop 1883 02-05

La83 Lanlley P Bradshaw GL and Simon HA Rediscoverinl chemistry with the Bacon system in Michalsk~ RS Carbonell JG and Mi~chell TM (Ed) MuMne Lunalg Aft Arliiciel Intelligente Appro Tiolamp 1983 307middot329

[Ma8 Mauldin ML Maintainin diversitl in renetic search Proc FnrlA NtiIJael CIJaereftce Oft Arlificicl InteUiece 10S4 247middot250 ~

(Me 851 Medin DL and Wattenmaker WD Catepl1 cohesiveness ~heoriel and copitive archeotshyOQ (as et unpublished manuscript) Dept of PsycholshyOQ Universit of Dlinois at Urbana Champaip 1985

[Mic 831o Michalsk~ RS A theory and methodology or inductive learninl Arlificiel laeUgece RO 2 (1883) 1I1middotlali reprinted in Michalski RS et aI (Ed) M4chine Lumi A Arlificiel InteUigcace Appr04cl Tioa 198383-134

[Mic 83b) Michalski RS an d Stepp RE Le arning rrom observation Conceptual clusteriD iD Michalski RS e~ al (Ed) Madiae Lunaia Aa Arlificial Intelmiddot ligence ApproA TioS 1083331middot383

(Mit 83) MitcheD TM Learnin and problem solvinr Proc EigAtA Ilterulibullbullcl loirampl CfJcrete on Arlificilll lteUieaee UI83 ll3G-lISI

(Ni80) Nilsson NJ Prieipl Arlijieiel Intellimiddot eee Tiolamp 1080

(ae 76) Rendell LA A method ror automatic eneramiddot tion of heuristics ror statemiddotspace problem Dept of Computer Science C8-76-10 Universit of Waterloo 1076

(ae 771 Rendell LA A locall optimal solution or the fifteen punle produced by an automatic evaluation runction senerator Dept or Computer Science C8-77middot 38 Universitl of Waterloo 1977

(ae 81] Rendell LA An adaptive plan for statemiddotspace problems Dept of Computer Science C8-81-13 (PhD thesis) Universit of Waterloo 1081

(ae 83aJ Rendell LA A new basis for state-space learDin systems and a successful implementation ArtajicielateUieee 10 (1883) ( 360-302

(Re 83bJ Rendell LA A learninl II)stem which accommodatet future interactions Prtle EiAtA Intermiddot eli1Il I Cbullbullcreaee Arliiciel Ifttdlilence 1883 6072

IRe 83c) RendeU LA A doubly layered sendic penetrance learnins stem Proc Tlir~ NotiouJ Ooeruce ()Il ArlificiJ l_teUigecc 1083 343-341

(ae 83dl Rendel~ LA Conceptual knowlede acquisishyion in Harch University or Guelph Report 015-83-15 Dept or Computins and Inrormation Science Guelph Canada 1083 (to appear in Bole L (ed) Kutllle~ge B4e4 Le4mi Stem SpringerVerias)

IRe SSa) Rendell LA Utility pattern u criteria for efficient seneraliution learninr Proc Jg8500_erec bullbull Ittlligerat Sytem lid M4diflu (to appear) lOSS

(ae SSbl Rendell LA A scientific approach to applied induction Proc 1985 International Machine Learninr Workshop Rutrers University (to appear) 1085

(ae SSc) Rendell LA Substantial constructive inducshytion ulinS layered information compression Tractable leature formation in search Proc Hitlt ltcm4tiobull J Joit Ooerucc o Artificial Intelligence (to appear) 1985

(Sa63) Samuel AL Some Itudies in machine learnshyins usin the 3me of checkers in Feigenbaum EA and Feldman J (Ed) Oomputer 4114 Thugll McGraw-Hili 1963 1H05

(Sa 61) Samue~ ALbull Some studie in machine learnshyinS usinr the same of checkers U-recent progress IBM J Ru IUI4 Duelop 11 (1961) 601-617

(Sm 80) Smitb SF A learnins stem based on lenetic adaptive alrorithms PhD Dissertation Univershysity of Pittsburgh 1980

[Sm 831 Smith SF Flexible learninr of problem 1I01vshyiDe heuristics throurb adaptive Harch Proc EiglUl Iteruttoul Joit OOflerenc 0_ ArlificiJ I_tellishyuce 1983422middot425

APPENDIX A GLOSSARY OF TERMS

Clustering Cluter anailln has lonl been used as a tool ror induction in statiatics and patshytern recognition (An 73) (See middotinduction) Improvements to basic clustering techniques lenshyerally use more than just the reatures or a datum ([An 73 p194) suggests -external criteria) External criteria in [Mi83 Re76 Re83a Re85b] involve prior sped6cation of the forms clusters may take (this has been called middotconceptual clusshytering (Mi83l) Criteria in (Re76 Re83a Re85bl are based on the data enviroDment (see

-utilit) below)ls This paper uses clustering to create economical compressed genetic structures (enotIlPu)

Feature A feature is an attribute ormiddot proshyperty of an object Features are usually quite abstract (el -center control or -mobility) in a board game The utility (see below) varies smoothly with a feature

Genetic algorithm In a GA a the charshyacter of an individual or a population is called a phenotype The phenotype ia coded as a string of digits called the enotvpe A single digit is a ene Instead of searcbing rule space directly (compare -learning system) a GA searches gene space (ie a GA searches for good genes in the population or genotypes) This search uses the merit of individual genotypes selecting the more successrul individuals to undergo genetic operations ror the production of offspring See sect2 and Fig3

Inductton Induction or eneralization learnin is an important means ror knowledge acquisition Information is actually created as data are compressed into c48fe or cateories in order to predict future events efficiently and effectively Induction may create reature space neighborhoods or clusterl See -clustering and sect41

Learning System Buchanan et al present a general model which distinguishes comshyponents of a learning system [Bu 78) The perforshymance element PE is guided by a control trucshyture H Based on observation or the PE the crishytic assesses H possibly localizing credit to parts or H (Bu 7S Di SI] The learning element LE uses thia inrormatioD to improve H for the next round of task perrormance Lavered systems have multiple PEs critics and LEs (eg PLS2 uses PLSJ as its PE -see FilI) Just as a PE searches for its goal in problem space the LE searches in rule pace DiS2) for an optimal H to control the PE-

To facilitate this higher goal PLS2 uses an intermediate knowledge structure which divides feature space into reion relating feature values to object utiitll (Re83d] and discovering a userul subset or features (ef (Sa63l) In this paper the control structure H is a linear evaluation funcshy

15 A new learnin~ system IRe SSc) introduces highermiddotdimensional clusterinr ror creation or structure

13

tion INiSOI and the rules are feature weights for H Search for accurate regions replaces direct search of rule space ie regions mediate data and H AB explained in sect3 sets of regions middotbecome compressed GA genotypes See also genetic algorithms middotPLS bull region and Fig I

Merit Also called paloff or fitness this is the measure used by a genetic algorithm to select parent genotypes lor prelerential reproducshytion of succesduJ individuals Compare middotutilitymiddot also see genetic algorithms

Object Objects are any data to be genshyeralized into categories Relationships usually depend on task domain See utility

PLS The pro6GiiliBtic learnin BIstem can learn what are sometimes called single conshycepts IDi82] but PLS is capable of much more difficult tasks involving noise management incremental learning and normaliution 01 biased data PLSt uniquely discovered locally optimal heuristics in search IRe 83al and PLS2 is the effective and efficient extension examined in this paper PLS manipulates middotregions (see below) using various inductive operations described in sect4

RegIon or Cell Depending on ones viewpoint the region is PLSs basic structure for clustering or for the genetic algorithm The region is a compressed representation 01 a utility sur race in augmented feature space it is also a compressed genotype representing a utility luncshytion to be optimized AB explained in IRe83d) the region representation is fully expressive proshyviding the features are See sect3 and Figs3amp4

Utll1ty u This is any measure of the useshylulness 01 an object in the perlormance or some task The utilitl provides a lid between the task domain and PLS generalization algorithms Utility can be a probability as in Fig 2 Comshypare merit See sect I 3

APPENDIX B PLSI LIMITATIONS

PLSI alone is inherently limited The problems relate to modi6cation of the main knowledge structure the cumulative region set R = (rue) As mentioned in sect41 R undergoes two basic alterations PLSI gradually changes the meanin of an established feature space recshytangle r by updating ita associated utility u (along with us error e) PLSI also incrementally

refines tbe feature space as rectangles r are conshytinually split

Both or these modifications (utility revision and region refinement) are largely directed by search data but the degree to which newer inrorshymation affects R depends on various choices or system parameters IRe 83al System parameters influence estimates of the error e and determine the degree of region re6nement These in turn affect the relative importance of new versus estashyblished knowledge

Consequently values or these parameters influence task performance For example there is a tradeol between utility revision and region refinement It rrampions are refined too quickly accuracy sulers (this is theoretically predictable) It instead utility revision predominates regions become inert (their estimated errors decline) but sometimes incorrectly18

There are several other problems including difficulties in training limitation in the utility revision algorithm and inaccurate estimation of various errors As a result utility estimations are imperfect and biased in unknown ways

Together the above uncertainties and senshysitivities explain the failure of PLSI always to locate an optimum with static training (Table f) The net effect is that PLSI works rairly well with no parameter tuning and unsophisticated trainshying and close to optimal with mild tuning and informed training (CoS4 as long as the features are well behaved

By nature however PLSI requires reatures exhibiting no worse than mild interactions This is a serious restriction since feature nonlinearity is prevalent On its own then PLSt is inherently limited There is simply no wall to learn utility accurately unless the elects of dilering heuristic runctions H are compared as in PLS2

ACKNOWLEDGEMENTS

I would like to thank Dave Coles for his lasting enthusiasm during the development implementashytion and testing of this system I appreciate the helpful suggestions from Chris Matheus Mike Mauldin and the Conference Reviewers

1amp Althourh lI7stem parameters are given by domain-independent statistieal analysis tunine these parameters nevertheless improves performanee in some eases (This is not required in PLS2)

14

Another successful approach to adaptation is endic alorituM (GAs) Aside from their

ability to diseover global optima GAs have several other important characteristics including stability efficiency 8exibility and extensibility (Ho 75 H0811 While the fuD behavior of geDetic algorithms is not yet knoWD in detail certain characteristics have been established and this approach compares very favorably with other methods of optimization [Be80 Br8t De80I Because of their performance and potential GAs have been applied to various AI learning tasks fRe83c SmSO Sm83)

In [Re83c1 a combination of the above two approaches was described the doubly layered learning system PLSI (see Figt)2 PLS1 the lower level of PLS2 could be considered percepshytual it compresses goal-oriented information (task utility) into a generalized economical and useful form (regions -see Figs 2 4) The upper layer is genetic a competition of parallel knowledge structures In fRe83cI each of these components was argued to improve efficacy and efficiencys

This paper extends and substantiates these claims conceptually and empirically The next section gives an example of a genetic algorithm which is oriented toward the current context Section 3 describes the knowledge structure (regions) from two points of view PLSI and PLS2 Section examines be SJnthesis of these two systems and considers some reasons for their efficiency Sections 5 and 6 present and analyze the experimental results which show the system to be stable accurate and efficient The paper doses with a brier summary and a glossary or terms used in machine learninl and genetic sysshytems

2 For the reader unramiliar with learning sysshytem and otber terminolo Appendix B provides brier explanations

3 PLS2 is appUcable to aDr domain ror which rutures aDd -userulnessmiddot or fItIlit of objects can be de6ned (Re 83dJ An object can represent a pbrsical entity or an operator oyer the set or entities Domains can be simple (eg single conceptmiddot learning) or comshyplex (eK- expert systems) Sbte-spaee problems and games have been tested in (Re 83 Re 83dJ The PLS approach is unirorm and can be deterministic or proshybabilistic The only real difficulty with a new domain is in constrodin ruturea which bear a smooth rel~ tionship to the utility (the system can evaluate and screen Ceatures presented to it)

2 GENETIC SYSTEMS AN EXAMPLE

This section describes a simple GA to introduce terminology a~d concepts and to provide a basis for comparison with the more complex PLS2 The reader already familiar with GAs may wish to omit aU but the last part of this section

21 OptImization

Many problems can be regarded as function optimization In an AI application this may mean diseovery or a good control structure for executing- some task The function to be optimshyized is then some measure of task success which we may call the perorm4nce fl In the terminolshyogy or optimizatioD fI is the objectil1e lunction In the context of genetic systems fI is the jitne P4110J or merit4

The merit fI depends on some control structure the simplest example of which is a vecshytor of wei b =- (b l bull b2 bal Frequently

the analytic form of J( b) is not known so exact methods cannot be used to optimize it (this is the case with most AI problems) But what orten is available (at some cost) is the value of fI ror a given control structure In our example let us suppose that fI can be obtained for any desired value or b by testing system perrormance If fI is a well behaved smooth functioD of b and if there is just one peak in the fI surface then this local opCimum is also a ob41 optimtam which can be efficiently diseovered usial hiD ctimbinl techniques However the behavior of fI is orten unknown and may have numerous optima in these cases a genetic adaptive algorithm is appropriate

22 Genetic Allorltbm

In a GA a structure of interest such as a weight vector b is called a pAenote Fig3 shows a simple example with just two weights b l and ba The phenotype is normaUy coded as a string of digits (usually bits) called the enote B A single digit is a erae gene values are allele The position or a gene within the genoshytype is given by an index called the locur Depending OD the resolution desired we might choose a greater or lesser number or sequential lenes to code eacla bi Ir we consider 5 bits to be

4 might also be called the middotutilit7 but we reserve this term Cor another kind or quality measure used by PLSI

sufficient the length of the genotype 8 will be L - 5n bits (see Fig3)

Instead or search inc weicht space directly for an optimal veetor b a GA searches gene space which has dimensionality L (gene space is Hamminc space if allele are binary) A GA conshyducts this search in parallel using a set of indiflishydual genotypes called a population or lene pool By comparing the relative merits p or individuals in a population and by matinc only the better individuals a GA perlorms an inrormed search or gene space This search is conducted iteratively over repeated generGtion In each new cenerashytion there are three basic operations perrormed (1) selection 01 parents (2) generation or ofisprinc and (3) replacement 01 individuals (1) and (2) have been given more attention Parent election is usually stochastic weich ted in lavor or individuals having higher values Offpring generGtion relies on genetic operator which modily parent genotypes Two natural examples are mutation (which alters one allele) and crossshyover (which slices two genotypes at a common locus and exchanges segments-see Fig 3)

POPULAl10tc

enol7PC B PMClty ~ Q()QIUIIlO (3-2) U

04 U

0011011011 (8t tf 01 01

(4-4) L4

[~~~ OfFSPlUtcQ

PartiliB CIIiWrt B CliiWrtl b

-UO Q()QIOU100 ~er Q()Ql omoo 00101UlIO

Fipre 3 Simple endic system The upper part or this diasrarn shows smaD population or just seven individuals Here the set or chancteristics (the pll tpe) is a simple two element vector b This is coded by the erlt B Each individual is associated with its measured merit po On the basis or their values pairs or individuals are stochasticall chosen as parents ror gendic recombination Their senotlpe are modified hI crossover to produce two new oftsprins

tions are perrormed OD them to produce ollsprinl the elleet is a combinatioll or knowledge retention and controlled search HoIshyland proved that osing binary alleles the crossshyover operator aad parent seleetioll proportional to p a GA is KS times more emdent than exhaustive search or gene space where K is the population size IHo 76 H081 Several empirical studies have verified the computational emciency or GAs compared with alterative procedures ror global optimization and have discovered interestinc properties of OAs such as ell~eets or varyinc K For example populations smaller than 60 can cause problems IBr81 OeSO

13 Appllcatlon In Heuristic Search

One AI use is search rorsolutions to probshylems or lor wins in cames (Ni80) Here we wish to learn an evaluation runction Has combinashytion 01 variables Xl Ibull x called allrihte or feature (reatures are often used to describe states in search) In the simplest case H is expressed as the linear combination bl Xl + b21 ++ bx = bx where the bi are weights to be learned We want to optimize the weight vecshytor b according to some measure or the perfor mance when H is used to control search

A rational way to de6ne (which we shaD use throughout this paper) is related to the avermiddot age number 0 or states or nodes developed in solving a set of problems Suppose D is observed ror a population or K heuristic ruactions defined by weight vectors bi Since the perror mance improves with lower values or D a good definitioa 01 the merit of ~ (ie or bi ) is the relashytive perrormance measure JIi == i5 OJ where i5 is the average over the popUlation ie i5 E OJ K This expression or merit could be used to assess geaotypes 8 representiac weight vectors bit as depicted in Fig3

Instead or this simple genetic approach however PLS2 employs unusual genotypes and operators some or which relate to PLSI Ia the remaining sections or this paper we shall examshyine the advantages 01 the OA resulting Irom the combiution of PLSI with PLS2

5 Notice that search takes pl3Ce both at the level or the task domain (for sood problem solutions)Because the more successful pareats are and at the level or the learninl element (for a lood

selected lor mating and because limited operashy control structure H)

3

3 PLS INFORMATION STRUCTURING DUAL VIEWPOINT

The connection behveea PLSI perceptual learning and PLS2 genetic adaptation is subtle and indirect BasicaUy PLSI deals witb object x (whicb caa be just aboat anything) aad their relationships to taak performance Let UI call the usefulness of an object x in some task domain ita utilitl u(x)

Since the number of objects is typically immense even vast observatioD is incomplete and generalization is required for prediction of u given a previously aneacountered x A significant step ia generalization is usually the expression of x aa a vector of high-level abstract features Xl X2 bullbullbull x so that x really represents not just one object but ratber a large number of similar objects (eg in a board game x might be a vector of reatures sach aa piece advantage center control etc) A further step in generalishyzation is to classif1l or categorize XiS which are similar for current purposese Since the purpose is to succeed well in a task PLSI cl3Ssi6es xs having similar utilities u

Class formation can be accomplished in several ways depending on the model assumed If the task domain and features permit objects having similar utilities may be eluterl in

feature space as illustrated in Figs 2t giving a middotregion set- RT Another model is the linear combination H == bt ot sect2

It is at this point that a GA like PLS2 can aid the learning process WelJ performing bs or Rs may be selected accord in to their merit 110 Note that merit 110 is an overall measure of the task performance while utility u is a quality measure localized to individual objects

The question now is what information structures to choose for representing knowledge about task utility For many reasons PLS incorshyporates the middotreion set- (Figbullbull) which represents domain knowledge by associatin an object with its utility We examine the region set from two

I Here t cttlfif means t frm classes cateories or concepts This is difficult to automate

1 PLSl initiated wbat has become known as conceptutl c1usterin - where not just reature valuea are considered but also predetermined rorms or classes (e rectlnles) and the whole data environment (eampshyutility) See [Re 78 Re 71 Re 83a Re SSa Re SSbl and also Appendix A

points of view as a PLSI knowledge structure and as a PLS2 genetie structure

0001

Fipre 4 Dual inLerpretation or a reion set R A repon set a part-it-loa or reatur pace (bere there are I repons) Poinu are clustered into reion ampccordinl to heir utility u in some task domain (e u - probabiJit7 or contributin to task success-see Fie 2) Here he u alues are bown inside the rectanshy

Ies A rei R is the triple (r u e) where e is the error in u The repon set R - R serves both as the PLSI knowledp IJtructur and as the PLS2 pnotypbull In PLS1 R is a discreLe (sLep) runction expresain varishyation or utility u witb reatures Xl In PLS2 R is a compressed nmon or the detailed enotype illustraLed in Fie5

31 The Region as PLSI Knowledge Strucshyture

In a feature pau representation an object

is a vector x == (Xl ~ x) In a problem or game the basic object is the tat trequently expressed as a vector oC features such as piece advantage center control mobility etc Obsershyvations made dUlin the course of even many problems or games normally cover just a traction of feature space and generalization is required for prediction

In neraliz4lion Iarnin objects are abstracted to form clan cotegoritll or conshycet This may take the form ot a partition of teature space ie a set of mutually exhaustive local neighborhoods called cute or cell (An 73 Di821 Since the goal of clustering ia PLS is to aid task performance the basis for generalishyzation is some measure otthe worth quality or

8 Feature spaces are sometimes avoided because they cannot easily express structure However alLershynative representatione as normally used are also deficient ror realistic eneralization learnioamp- A new scheme mechanizes or a very difficult inductive probshylem reature ibullbull IRe 83d1je 85e1

t The object or ennt could just as wen be an operaLor to be applied to a state or a state-operaLor pair See IRe 83d~

utilit of a state or ceO relative to the task One measure of utility is the probability ot contributshying to a solution or will Ia Figs 24 probability classes are rectangular cells (tor economy) The leftmost rectangle r has probability a 020 The rectangle r is a category generalizing the conditions under which the utility u applies

In PLS a rectangle is associated not just with its utility u but also with the utility error e This expression e ot uncertainty iD u allows quantification ot the elect of noise and provides an infomed and concise means for weighting various contributions to the value of u during learning The triple R - (rue) caUed amp

region is the main knowledge structure for PLSI A set R R or regions defines amp partition ill augmented reature space

R may be used directly as amp (discrete) evalualion or ileuriflic rUlction H == u(r) to assess state x Erin search For example in Fig4 there are six regions which difrerentiate states into si( utility classes Instead of forming a discrete heuristic R may be used indirectly as data (or determining the weight vector b in a smooth evaluation (unction H == bx (employing cune fitting techniques) We shall return to these algorithmic aspects of PLS in 54

Figure 5 Definition ot muimally detailed genotype U If the number of points in reature space is finite and a value of the utility is usoeiated with each point comshyplete information can be captured in a detailed genoshytype U of concatenated utilities U I u bullbull UL Coordishynates could be linearly ordered as shown here ror the two dimensional case U is an tully expanded lenotype correspondinl to the compreSled version or Fit 4

10 This could be expressed in other ways The production rule rorm is r -u Usin 10lie r is represented (0 S XI S 4) n (0 S x S 2)

31 The Region Set at Compressed and Unreatrleted PLSJ Genotype

Now let us examine these inrormation structures from the genetic viewpoint The weight vector b or evaluation function H could be considered a GA phenotype What might the genotype ber One choice a simple one was described in sect2 and illustrated in Fig3 here the genotype B is just a binary coding or b A difFerent possibility is one that captures exhausshytive inrormation about the relationship between utility u and reature vector x (see FigS) In this case the gene would be (x u) It the number of genes is bite they can be indexed and conshycateoated to give a very detailed genotype U which becomes a striDg or values U 1U2 bullbullbull uL codshying the entire utility surface in augmented feature space

This genotype U is unusual in some imporshytant ways Let us compare it with the earlier example B of sect2 (Fig3) B is simply a binary form or weight vector b One obvious difrerence between B and U is that U is more verbose than B This redundancy aspect will be considered shortly The other important diJrerence between Band U is that alleles within B may weD interact (to express feature nonlinearity) but alleles Uj

within U Cllnnot interact (since the Uj express an absolute property or reature vector x ie its utilshyity for some task) As explained in the next secshytion this rreedom rrom gene interdependence permits localization or credit11

The detaned genotype U codes the utility surface which may be very irregular at worst or very smooth at best This surrace may be locally well behaved (it may vary slowly iD some volumes or reature space) [n cases or local regushylarity portions or U are redundant As shown in FigS PLS2 compresses the genotype U into the region set R (examined in 531 from the PLSI viewpoil1 t) [n PLS2 a sin gle region R == (r u e ) is a et or genes the whole having just one allele u (we disregard the genetic coding or e) Unlike standard genotypes which have a stationary locus ror each gene and a fixed Dumber or genes the regioD set has no explicit loci but rather a

11 While one of the strengths of a GA is its abilshyity to manage interaetion ot nriable (by middotcoshyadaptinc alleles) PLS2 achieves efficient and concise knowledre representation and acquisition by ftexible sene compression and by certain other mdhods examshyined later in this paper

6

variable number 01 elements (regions) eacb representinamp a variable number of lenes A relion compresses lene seta having similar utility according to current knowledge

4 KNOWLEDGE ACQUISITION SYNERGIC LEARNING ALGORITHMS

In this section we examine how R is used to proshyvide barely adequate information about the utilshyity surface This compact representation resulta in ecoDomy of both 8pace aDd time and in eftective -learning Some reasons for this power are considered

The ultimate purpose of PLS is to discover utility classes in the form 01 a region set R This knowledge structure controls the primary task lor example in heuristic search R R=II =II

(rue) defines a discrete evaluation lunction H(r) =- u

The ideal R would be perfectly accurate and maximally compre Accuracy of utility u determines the quaJitl of task performance Appropriate compression of R characterizes the task domain conciselJ but adequately (see Figs 3 4) saving storage and time both during task performance and during learning

These goals 01 accuracl and economl are approached bl the doublJ lalered learninl 8Ylshytem PLS2 Fig 1) PLSI and PLS2 combine to become eftective rules lor generalization (inducshytion) specializatioD (differentiation) and reorganshyization The two layers support each other in various ways for example PLS2 stabilizes the pershyceptual system PLSl and PLSI maintains genoshytype diversity of the genetic system PLS2 In the tollowing we consider details amprBt trom the standpoint of PLSl then Irom the perspective 01 PLS2

41 PLSI Revision and DlfIerentlatlon

Even without a genetic component PLSI is a ftexible learning system which can be employed in noisy domains requiring incremental learning It can be used lor simple concept learnin like the systems in [Di82J but most experiments have involved state space problem solvin and game playing12 Here we examine PLS in the context of

12 These experimenLs have led to unique reaulLs 8uch as discovery ot locally optimal evaluation runcshytions (see [Re S3amp Re 83dl)

these difficult tasks

Ns described in sect31 the main PLSI knowledge strucfare is the region set R == ( r I u e) Intermediate between basic data obtained duriD search and a general heuristic ased to control search R defines a feature space augmen ted and partitioned by u and e Because R is improved incrementally it is called the cumulatil1e reiDn d PLSI repeatedly perrorms two basic operationl on R One operation is correction or relinDn (of utility u and error e) and the other is specialization difterentiation or refinement (or leature space cells r) These operators are detailed in (Re83a Re83dJ here we simply outUne their eftecta and note their limshyitations

Revlalon 01 a and e For an established region R (rue) E R PLSI is able to modify u and to decrease e by using new data This is accomplished in a rough fashion by comparing established values within aU rectangles r with fresh values within the same r It is difficult or impossible to learn the -true- values of u since data are acquired during performance of hard tasks and these data are biased in unknown ways because of nontrivial search

Reflnemen or R Alternately perrorming then learning PLSI acquires more and more detail about the natare or variation or utility u with reatures This information accumulates in the region set R c R III ( r a e ) where the primary eftect 01 clustering a is increasing resolushytion of R The number sizes and shapes 01 recshytanles in R reflect current knowledge resolution As this differentiation continues in successive iterations of PLSl attention locuses on more useshyrul parts 01 feature space and heuristic power improves

Unfortunately so does the likelihood of error Further errors are difficult to quantily and hard to localbe to individual regions

In brier while the incremental learnin 01 PLSI is powerful enoagh to learn locally optimal heuristics under certain conditions and while PLSI feedback is good enough to control and correct mild errors the leedback can become unstable in ualayorable situations instead of beinl corrected errors can become more proshyDounced Moreover PLSI is sensitive to parameshyter settings (see Appendix B) The system needs support

42 PLS2 Genetic Operatou

Qualities missing in PLSI can be provided by PLS2 M bull1 concluded PLSl with its single region set cannot discover accurate values of utilities u PLS2 however maintains an entire population of region seta which means that several regions in aU cover any given feature space volume The availability of comparable regions ultimately permits greater accuracy in u and brings other benefits

As sect32 explained a PLS2 genotype is the region set R = R and each region R = (rue) ( R is a compressed gene whose aDele is the utility u Details of an early version or PLS2 are given in (Re83cl Those algorithms have been improved the time complexity of the operashytors in recent program implementations is linear with population sile K The rollowing discussion outlines the overall eJlects and properties or the various genetic operators (compare to the more usual GA of sect2)

K-aexual mating is the operator analoshygous to crossover Consider a population R of K difterent region sets R Each set is composed of a number of regions R which together cover feature space A new region set Rt is formed by selecting individual regions (one at a time) from parents R with probability proportional to merit (merit is the performance of R defined at the end of sect2) Selection of regions rrom the whole population of region sets continues until the feature space cover is approximately the average cover or the parents This creates the oftspring region set Rt which is generally not a partition

Gene reorganization For economy of storale and time the olfsprinl region set Rt is repartitioned so that regions do not overlap in feature space

Controlled mutation Standard mutashytion operators alter an allele randomly III conshytrast the PLS2 operator analogous to mutation changes aD allele accordinl to evidence arising in the task domaia The controlled mutation operator ror a region set R = ( r Ii e) is the utility revision operator of PLSI M described in sect4I PLSI modifies the utility u for each feature space ceO r

Genotype expansion This operator is also provided by PLSI Recall the discussion of sect32 about the economy resultinl from compressshying genes (utility-feature vectors) into a region

set R The rerinement operator was described in sect41 This feature space refiDement amounta to an expansion or the lenotype R and is carried out when data warrant the discrimination

Both controlled mutation and genotype expansion promote genotype diversity Thus PLSI helps PLS2 to avoid premature convergence a typical GA problem [BrSl MaSI

43 Efteettveneaa and Effielency

The power of PLS2 may be traced to cershytain aspects of the perceptual and genetic algoshyrithms just outlined Some existing and emergshying principles or eftective and efficient learning are briefly discussed below (see also (ReSSa ReSSb ReSSel)

Credit loeaIlzatlon The selection of regions for Kmiddotsexual mating may use a single merit value for each region R within a given set R However the value of can just as well be localized to single regions within R t by comshyparing R with similar regions in other seta Since regions estimate an absolute quantity (taskshyrelated utility) in their own volume or feature space they are independent or each other Thus credit and blame may be assigned to feature space cells (ie to gene sequences)

Assignment or credit to individual regions within a cumulative set R is straightforward but it would be difticult to do directly in the final evaluation flluction H since the components of H whiJe appropriate ror performance omit inforshymation relevant to learning (compare Figs 2 )bull3

Knowledge mediation successrul sysshytems tend to employ inrormation structures which mediate data objects and the ultimate knowledge form These mediating structures include means to record growing assurance or tentative hypotheses

Whell used in heuristic sea~ch the PLS region set mediates large numbers of states and a

13 There are various pOSIIibilities lor the evalu tion function R but all contain len useful inlormation than their determinant the resion Bet R The simshyplest heuristic used in (Re 83a Re 83dJ is H - bl where b is a vector of weisbts for the feature vector f (This linear combination is used exclusively in experishyments to be described in 55) The value of b is found usinr regions as data in linear regression (Re 83a Re83bJ

7

very concise evaluation runction H Retention and continual improvemen or this mediatin structure relines the credit assignment problem This view is unlike that or (Oi81p14 Oi821 learning systems orten attempt to improve the control structure itself whereas PLS acquires knowledge efficiently in an appropriate structure and utilizes this knowledge by compressing it only temporarily ror performance In other words PLS does not directly search rule space for a good H but rather searches for good cumul tive regions rrom whicb H is constructed

Full but controlled use of ever) datum Samuels checker player permitted each state encountered to inBuence the beuristic H and at tbe same time no one datum could overwhelm the system The learning was stochastic botb conservative and economic In this respect PLS2 is similar (although more automated)

Schemata In learning systems and genetic algorithms A related efficiency in both Samuels systems and PLS is like tbe scheshymata concept in a GA In a GA a single indivishydual coded as a genotype (a string of digits) supports not only itselr but also aU its subshystrings Similarly a single state arising in heurisshytic search contains information about every feature used to describe it Thus each state can be used to appraise and weight each reature (The effect is more pronounced when a state is described in more elementary terms and combishynations or primitive descriptors are assessed -see (ReSSel)

6 EXPERIMENT AND ANALYSIS

PLS2 is designed to work in a changing environshyment of increasingly difficult problems Tbis seeshytion describes experimental evidence or effective and efficient learning

51 Experimental Condltlons Tame teaturee The features used ror

tbese experiments were the four or (Re83al Tbe relationship beween utility and these features is rairly smootb so tbe rull capability of a GA is not tested although tbe environment was dynamic

De6nltlon ot merit As sect4 described PLS2 choses regions from successrul cumulative sets and recombines them into imprOVed sets For the experiments reported here the selection criterion was the global merit p ie tbe perforshy

mance ora whole region Id without localization or credit to indiyichlal regions This measure p was the average number of nodes developed 0 in a training sample of 8 fmeen puules divided into the mean of aU such averages in a popul tion of K sets ie p 00 where 0 is tbe average over the population (0 E OJ K)

Changing envIronment For these expershyiments successive rounds of training were repeated in incremental learnin over several iterations or enertin The environment was altered in successive generations it was specified as problem difJicult or depth d (defined as the number of moves from the goal in sample probshylems) As a sequence of specifications of problem difficulty this becomes a training difJicult vector d == (d d2 d)

Here d was static one known to be a good progression based on previous experience with user training (Co84)middot In these experiments d was always (81422 SO ) An integer means random production or training problems subject to this difficulty constraint wbile demands production of fully random training instances

6t Discussioll

Berore we examine the experiments themshyselves let us consider potential dilrerences between PLSI aDd PLS2 in terms of their effectiveness and efficiency We also need a crishyterion for assessing differences between the two systems

VulnerabtUty of PLSt With population size K == 1 PLS2 degenerates to tbe simpler PLSI In this case static training can result in utter railure since the process is stocbastic and various things can go wrong (see Appendix B) The wors is railure to solve any problems in some generation and consequent absence or any new information IC the control structure H is tbis poor it will DO improve unless tbe lad is

14 PLS and similar system ror problem and game are IIOmetima neither rully supervised nor rully unsupervised The orjpnu PLSl w intermediate in this respect Trainiq problema were selected by a humiddot man bu rrom each raining instance a multitude or individuu noda ror learning are generated b the SY8shytem Each node can be considered a separate example ror concep learniq (Re 83d] (Co 84] describes experishyments with an automated trainer

8

detected and problem difficulty is reduced (ie dynamic training is needed)

Even without this catastrophe PLSI pershyrorms with varying degrees or success depending on the sophistication 01 ita training and other lacton (explained in Appendix B) With minimal human guidance PLSI always achieves a good evaluation lunction H although not always an optimal one Witb static training PLSI succeeds reasonably about balf the time

Stabtllty ot PLSt In contrast one would expect PLS2 to bave a much better success rate Since PLSI is here being run in parallel (Fig 1) and since PLS2 should reject hopeless cases (tbeir middots are small) a complete catastrophe (all Hs lailing) should occur with probability p s qK wbere q is the probability 01 PLSI lailure and K is popUlation size It q is even as large as one hall but K is 7 or more the probability p 01 catastrophe is less than 001

Cost versus benefit a measure Failure plays a part in costs so PLS2 may have an advantage The ultimate criterion for system quality is cost effectiveness is PLS2 worth its extra complexity Since the main cost is in task performance (here solving) the number of nodes developed 0 to attain some perlormance is a good measure of the expense

If training results in catastrophic bUure however all effort is wasted so a better measure is the expected cost Op where p is the probabiamp ity of success For example if 0 == 1gt00 for viable control structures but the probability of finding solutions is only then the aeraampe cost of useful information is 500 ~ == 1000

To extend this argument probability p depends on what is considered a success Is sucshycess the discovery of a perfect evaluation runcshytion H or is performance satisfactory ir 0 departs Irom optimal by no more than 21gt

53 Rulte

Table I shows perrormances and costa with yarious nlues of K Here p is estimated using roughly 38 trbJa or PLSI in a PLS2 context (it K == I 38 distind runl it K == 2 18 runl etc) Since variances is D are high performance testa were made oyer a random sample or 50 puzzles This typicaJJy iyes 91gt con6dence of 0 t 40

Accuraq ot learnlnl Let UI 6rst comshyp31e results 01 PLSI versul PLS2 for rour different success criteria We consider the learning to be successful it the resUlting heuristic H approaches optimal quality within a given margin (01 100 50 21gt and 10)

Columns two to five in the table (the second group) show the proportion of Hs giving performance 0 within a specified percentage or the best ho 0 (the best 0 is around 3SO nodes for the foar leatures used) For example the last row of the table shows that 01 the 36 individual control structures H tested in (two different) popul2tions of size 19 all 36 were within 100 of optimal 0 (column two) This means that aU developed no more than 700 nodes before a solutio was found Similarly column 6ve in the last row shows that 021 01 the 3t5 Hs or 8 or them were within 10 ie required no more than 386 nodes developed

Cost ot accuracy Columns ten and eleyen (the two riamphtmost columns of the lourth croup) show the estimated costs of achieving pershyformance withia 100 and within 10 or optimum respedively The values are based on the expected total number of nodes required (ie Op) with adjustments in favor of PLSI for extra PLS2 overhead (The unit is one thousand nodes developed) As K increases the cost of a given accuracy 6rst increases Nevertheless with just moderate K Yahles the genetic system becomes cheaper particularly ror an accuracy or 10

lb

I

Propolt0 auat Suce c Itbullbull 1

C lalt otlaa1 D lOot S 15 lot

__ lIdo

IRaNoa SI of SliIt t 654 U5 -nl IU sn n ItS 5 114 It n nl 1M Ul 14 I

Coat 1 1 11

__I

US n bull 11le19a lJ

poet Con t 0lIlbullbull ltalbull

1 101

J III na n au1bull nl na Ul II t4

foebullbullbulleobull It~t I

Jllot aot 12

n

1 4

12 IS 19

n J U bullbull4 6t 11 1 n 12 0 n 11 U )1 11

1 11 14 1 n 61 21

The expected cost benefit is not the only PLS2 and traditional optimization In [ReSl] advantage of PLS2 PLSI was round considerably more efficient than

Average pertormance beat perlo mance and population performance Conshysider now the third group of columns in Table I Tbe sixth column gives tbe IIverllle i5 for an H in the sample (or 36) The seventh column gives Db ror the best Hb in the sample These two measures average and be performance are orten used in assessing genetic systems (Br 81) The eighth column however is unusual it indishycates the population performance Dp resulting when all regions from every set in the population are used together in a regression to determine Up This is sensible because regions are indepenshydent and estimate tbe utility an absolute quaashytity ((Re83cJ cr [Br8l H07S))

Several trends are apparent in Table L First whether the criterion is 100 50 25 or 10 of optimum (columns 2-5) the proportion of good Hs increases steadily as population size K rises Similarly average best and popUlation performance measures 6 Db and Dp (columns 6shy8) also improve with K Perhaps most important is that the population performance Dp is so reu ably close to best even with these low K values This means that the whole population ot regions can be used (for Hp) without independent verification of perrormance In contrast indivi-middot dual Hs would require additional testing to disshycover the best (column 7) and the other alternashytive any H is likely not as good as Hp (colllI 6 and 8) Furthermore the entire populath or regions can become an accurate source of masshysive data for determining an evaluation function capturing reature interaction IRe83b

This accuracy advantage of PLS2 is illusshytrated in the 6nal column ot the table where ror a constant cost rough estimates are given or the expected error in population performance Dp relative to the optimal value

It is interesting that such sman popUlations improve perrormance markedly usually populashytion sizes are SO or more

o EFFICmNCY AND CAPABILITY

Based on these empirical observations ror PLS2 OD other comparisons ror PLSl and on various conceptual differences general properties or three competing methods can be compared PLSl

standard optimization and the luuestion was made tbat PLSI made better ase of available inrormation By Itudying sucll behaviors load underlying reasons we Ihould eventually identify principles of efficieat learning Some aspects are considered below

TradItIonal optlmlloatlon verau PLSI First let us consider efficiency of search for an optimal weight vector b in the evaluation runcshytion H = bt One good optimization method is repone lurace fittin (RSF) It can discover a local optimum in weight space by measuring and regressing the response (bere number of nodes developed D) for various values ot b RSF utilshyizes just a single quantit (ie D) ror every probshylem solved This seems like a small amount of inrormation to extract rrom an entire search lince a typical one may develop hundreds or thousands or nodes each possibly containing relevant inrormation In contrast to this tradishytional statistical approach PLSI like [5a63 5a671 uncovers knowledge about ever feature rrom ever node (see 143) PLSl then might be expected to be more efficient than RSF Experishyments veriry this [ReSI

TraditIonal optlmlratloll venta PLSI oM shown in SS PLS2 is more efficient ItiD We can compare it too with RSF The accuracy of RSF is known to improve with VN where N is the number or data (here the number or or probshylema solved) oM a arst approximation a paranel method like PLS2 should also cause accuracy to increase with the square root of the number or data although the data are now regions instead of D values If roughly the same number of regions is present in each individual set R or a population or size K accuracy must thererore improve as VK Since each of these K structures requires N problems in training the accuracy or PLS2 should increase as yenN like RSF

Obviously though PLS2 involves much more than blind parallelism a genetic algorithm extracts accurate knowledge and dismisses incorrect (un6t) inrormation While it is impossishyble ror PLSI alone PLS2 can rehe merit by localizing credit to individual regions (Re 83c] Planned experiments with this should show rurther increases in efficiency since the additional cost is small Another inexpensive improvement

10

will attempt to reward good regions by decreasshying their estimated errors Even without these refinements PLS2 retains meritorious regions (sect4) and should exhibit accuracy improvement better than vN Table I suggests this

PLSZ veraua PLSZ AI di8eussed ill sect41 and Appendix B PLSI is limited necessitating human tuning for optimum performance In conshytrast the second layer learning system PLS2 requires little human intervention The maiD reason is that PLS2 stabilizes knowledge automatically by comparing region sets and dismissing aberrant ones Accurate cumulative sets have a longer lifetime

1 SUMMARY AND CONCLUSIONS

PLS2 is a general learning system (Re83a Re83dJ Given a set or user-de6ned reatures and some measure of the utilil (eg probability or success in tast (errormance) PLS2 rorms aDd rebes an appropriate knowledge structure the cumulative region d R relating utility to reature values and permitting DOise manageshyment This economical and flexible structure mediate data objects and abstract heuristic knowledge

Since individual regions or the cumulative set R are independent or one anotber botb credit localization and reature interaction are possible

This ability to discriminate merit and retain successful data will likely be accentuated with the localization or credit to indiviltlual regions (see H2) Another improvement is to alter dynamically tbe error or a region (estimated by PLSl) as a runction or its merit (found by PLS2) This will have the elJect or protecting a good region rrom imperrect PLSI utility revision once some parallel PLSl has succeeded in discovshyering an accurate value it will be more immune to damage A fit region will bave a very long lirespan

Inherent differences In capablltty RSF PLSl and PLS2 can be cbaracterized dilJerently

From the standpoint or time costs given a chalshylenging requirement such as the location or a local optimum witbin 10 tbe ordering or tbese methods in terms or efficiency is RSF 5 PLSI 5

PLS2 In terms or capdilit tbe same relationshysbip bolds RSF cannot handle reature interacshytions without a more complex model (wbicb would increase its cost drastically) PLSl on the other hand can provide some performance improvement using piecewise linearity with little additional cost [Re83bJ PLS2 is more robust than PLSl While the original system is someshywhat sensitive to training and parameters PLS2 provides stability using competition to overcome deficiencies obviate tuning and increase accushyracy all at once PLS2 bulJers inadequacies inberent in PLSI Moreover PLS2 being genetishycally based may be able to handle bighly interacting reatures and discover 10601 optima Re 83c1 Tbis is very costly with RSF and seems inreasible with PLSl alone

simultaneously Separating tbe tast con trol structure H from the main store or knowledge R allows straigbtforward credit assignment to this dderminant R or H while H iteelt may incorshyporate reature nonlinearities witbout being responsible for tbem

A concise and adequate embodiment of current heuristic knowledge tbe cumulative region set R waa originally used in the learning system PLSl (Re83aJ PLSI is the only system shown to discover locally optimal evaluation runctions in an AI context Clearly superior to PLSl its genetic successor PLS2 haa been shown to be more stable more accurate more efficient and more convenient PLS2 employs an unusual genetic algorithm having tbe cumulative set R as a compressed genotype PLS2 extends PLSls limmiddot ited operations or revision (controlled mutation) and dilJerentiation (genotype expansion) to include generalization and other rules (K-sexual mating and genotype reorganization) Credit may be localized to individual gene sequences

These improvements may be viewed as eftectillg greater efficiency or as allowing greater capability Compared with a traditional metbod of optimization PLSl is more efficient (Re85aJ but PLS2 does even better Given a required accuracy PLS2 locates an optimum with lower expected cost In terms or capability PLS2 insushylates the system from inherent iIladequacies and sensitivities or PLSl PLS2 is much more stable and caD use the whole population of regions relishyably to create a bigbly informed heuristic (this popultion performance is not meaningrul in standard genetic systems) Tbis availability or massive data haa important implications for teature interaction lRe83blmiddot

11

Additional refinements or PLS2 may further increase efficiency and power These include rewarding meritorious regions 10 they become immune to damage Future experiments will investigate nonlinear capability ability to disshycover lobal optima and efficiency and effectiveness of focalized credit auinment

This paper has quantitatively affirmed some principles believed to improve efficiency and effectiveness of learning (eg credit localization) The paper has also considered some simple but little explored ideas for realizin these capabilishyties (eg tull but controlled use or each datum)

REFERENCES

(An 73) Ande~bert MR auter Aul IJr Applicamiddot tiIJn AcademicPress 1973

(Be 80J Bethke AD Genetic elgIJritlm bull bullbull uchu IJptimizer PhD Thesis Universitl of Michian 1980

[Br 81) Brindle A Genetic algorithms for function optimintion CS Department Report TR81middot2 (PhD Dissertation) Universitl of Alberta 1081

[Bu 78) Buchanan BG Johnson CR Mitchell TM and Smith RG Models or learnin systems in Belser J (Ed) Encdopclli 0 CIJmper Science 11 TcdnolIJfJ 11 (1978) 201

(Co 8) Coles D and Rendell LA Some issues in trainin learn in systems and an autonomoul desip PrIJc FitA Bicftaiel CIJerece cAe Cbullbullbullli SIJeiet IJ ComputetilJel Stli IJICeUiuce 1984

(De 80) DeJonl KA Adaptive system delip A enetic approach IEEE Trdibullbull a Stem Mbullbull nl C6enadiCl SMC-IO (1080) 566-S7(

(Di81) Dietterich TG and Buchanan BG The role of the critic in learnin systems Stanford Universitl Report STANmiddotC8-81middot801 1081

(Di82) Dietterich TG London B Clarkson K and Drome1 G Learninl and inductin inference STANmiddot C8-82-813 Stanford Universitl also Chapter XIV or Tlc HI6t1o tI Arlifidcl InteUlgenee Cohen PR and Feienbaum EA (Ed) Kaufmann 1082

(Ho 75] Holland JH AIIptti Nturel nll Arlificiel Stem Universitl or Michitan Press 1875

(H0801 Holland JH Adapive a1orithms ror discovshyerin and usin eneral patterns in powin knowlede bases IntI loumel Oil Polic AelbullbullbullIIormtiIJ Stem 4 2 (1080) 217middot240

(Ho 81) Holland JH Genetic a1prithms and adaptampshytion Proc NATO Ad Ret In Alpti CIJntrIJI 0

Ulefillel Sem 1081

(8083) Holland JH Eecapinl brit~lenelS Proc Sccod latemetael Mdae Lunai Workdop 1883 02-05

La83 Lanlley P Bradshaw GL and Simon HA Rediscoverinl chemistry with the Bacon system in Michalsk~ RS Carbonell JG and Mi~chell TM (Ed) MuMne Lunalg Aft Arliiciel Intelligente Appro Tiolamp 1983 307middot329

[Ma8 Mauldin ML Maintainin diversitl in renetic search Proc FnrlA NtiIJael CIJaereftce Oft Arlificicl InteUiece 10S4 247middot250 ~

(Me 851 Medin DL and Wattenmaker WD Catepl1 cohesiveness ~heoriel and copitive archeotshyOQ (as et unpublished manuscript) Dept of PsycholshyOQ Universit of Dlinois at Urbana Champaip 1985

[Mic 831o Michalsk~ RS A theory and methodology or inductive learninl Arlificiel laeUgece RO 2 (1883) 1I1middotlali reprinted in Michalski RS et aI (Ed) M4chine Lumi A Arlificiel InteUigcace Appr04cl Tioa 198383-134

[Mic 83b) Michalski RS an d Stepp RE Le arning rrom observation Conceptual clusteriD iD Michalski RS e~ al (Ed) Madiae Lunaia Aa Arlificial Intelmiddot ligence ApproA TioS 1083331middot383

(Mit 83) MitcheD TM Learnin and problem solvinr Proc EigAtA Ilterulibullbullcl loirampl CfJcrete on Arlificilll lteUieaee UI83 ll3G-lISI

(Ni80) Nilsson NJ Prieipl Arlijieiel Intellimiddot eee Tiolamp 1080

(ae 76) Rendell LA A method ror automatic eneramiddot tion of heuristics ror statemiddotspace problem Dept of Computer Science C8-76-10 Universit of Waterloo 1076

(ae 771 Rendell LA A locall optimal solution or the fifteen punle produced by an automatic evaluation runction senerator Dept or Computer Science C8-77middot 38 Universitl of Waterloo 1977

(ae 81] Rendell LA An adaptive plan for statemiddotspace problems Dept of Computer Science C8-81-13 (PhD thesis) Universit of Waterloo 1081

(ae 83aJ Rendell LA A new basis for state-space learDin systems and a successful implementation ArtajicielateUieee 10 (1883) ( 360-302

(Re 83bJ Rendell LA A learninl II)stem which accommodatet future interactions Prtle EiAtA Intermiddot eli1Il I Cbullbullcreaee Arliiciel Ifttdlilence 1883 6072

IRe 83c) RendeU LA A doubly layered sendic penetrance learnins stem Proc Tlir~ NotiouJ Ooeruce ()Il ArlificiJ l_teUigecc 1083 343-341

(ae 83dl Rendel~ LA Conceptual knowlede acquisishyion in Harch University or Guelph Report 015-83-15 Dept or Computins and Inrormation Science Guelph Canada 1083 (to appear in Bole L (ed) Kutllle~ge B4e4 Le4mi Stem SpringerVerias)

IRe SSa) Rendell LA Utility pattern u criteria for efficient seneraliution learninr Proc Jg8500_erec bullbull Ittlligerat Sytem lid M4diflu (to appear) lOSS

(ae SSbl Rendell LA A scientific approach to applied induction Proc 1985 International Machine Learninr Workshop Rutrers University (to appear) 1085

(ae SSc) Rendell LA Substantial constructive inducshytion ulinS layered information compression Tractable leature formation in search Proc Hitlt ltcm4tiobull J Joit Ooerucc o Artificial Intelligence (to appear) 1985

(Sa63) Samuel AL Some Itudies in machine learnshyins usin the 3me of checkers in Feigenbaum EA and Feldman J (Ed) Oomputer 4114 Thugll McGraw-Hili 1963 1H05

(Sa 61) Samue~ ALbull Some studie in machine learnshyinS usinr the same of checkers U-recent progress IBM J Ru IUI4 Duelop 11 (1961) 601-617

(Sm 80) Smitb SF A learnins stem based on lenetic adaptive alrorithms PhD Dissertation Univershysity of Pittsburgh 1980

[Sm 831 Smith SF Flexible learninr of problem 1I01vshyiDe heuristics throurb adaptive Harch Proc EiglUl Iteruttoul Joit OOflerenc 0_ ArlificiJ I_tellishyuce 1983422middot425

APPENDIX A GLOSSARY OF TERMS

Clustering Cluter anailln has lonl been used as a tool ror induction in statiatics and patshytern recognition (An 73) (See middotinduction) Improvements to basic clustering techniques lenshyerally use more than just the reatures or a datum ([An 73 p194) suggests -external criteria) External criteria in [Mi83 Re76 Re83a Re85b] involve prior sped6cation of the forms clusters may take (this has been called middotconceptual clusshytering (Mi83l) Criteria in (Re76 Re83a Re85bl are based on the data enviroDment (see

-utilit) below)ls This paper uses clustering to create economical compressed genetic structures (enotIlPu)

Feature A feature is an attribute ormiddot proshyperty of an object Features are usually quite abstract (el -center control or -mobility) in a board game The utility (see below) varies smoothly with a feature

Genetic algorithm In a GA a the charshyacter of an individual or a population is called a phenotype The phenotype ia coded as a string of digits called the enotvpe A single digit is a ene Instead of searcbing rule space directly (compare -learning system) a GA searches gene space (ie a GA searches for good genes in the population or genotypes) This search uses the merit of individual genotypes selecting the more successrul individuals to undergo genetic operations ror the production of offspring See sect2 and Fig3

Inductton Induction or eneralization learnin is an important means ror knowledge acquisition Information is actually created as data are compressed into c48fe or cateories in order to predict future events efficiently and effectively Induction may create reature space neighborhoods or clusterl See -clustering and sect41

Learning System Buchanan et al present a general model which distinguishes comshyponents of a learning system [Bu 78) The perforshymance element PE is guided by a control trucshyture H Based on observation or the PE the crishytic assesses H possibly localizing credit to parts or H (Bu 7S Di SI] The learning element LE uses thia inrormatioD to improve H for the next round of task perrormance Lavered systems have multiple PEs critics and LEs (eg PLS2 uses PLSJ as its PE -see FilI) Just as a PE searches for its goal in problem space the LE searches in rule pace DiS2) for an optimal H to control the PE-

To facilitate this higher goal PLS2 uses an intermediate knowledge structure which divides feature space into reion relating feature values to object utiitll (Re83d] and discovering a userul subset or features (ef (Sa63l) In this paper the control structure H is a linear evaluation funcshy

15 A new learnin~ system IRe SSc) introduces highermiddotdimensional clusterinr ror creation or structure

13

tion INiSOI and the rules are feature weights for H Search for accurate regions replaces direct search of rule space ie regions mediate data and H AB explained in sect3 sets of regions middotbecome compressed GA genotypes See also genetic algorithms middotPLS bull region and Fig I

Merit Also called paloff or fitness this is the measure used by a genetic algorithm to select parent genotypes lor prelerential reproducshytion of succesduJ individuals Compare middotutilitymiddot also see genetic algorithms

Object Objects are any data to be genshyeralized into categories Relationships usually depend on task domain See utility

PLS The pro6GiiliBtic learnin BIstem can learn what are sometimes called single conshycepts IDi82] but PLS is capable of much more difficult tasks involving noise management incremental learning and normaliution 01 biased data PLSt uniquely discovered locally optimal heuristics in search IRe 83al and PLS2 is the effective and efficient extension examined in this paper PLS manipulates middotregions (see below) using various inductive operations described in sect4

RegIon or Cell Depending on ones viewpoint the region is PLSs basic structure for clustering or for the genetic algorithm The region is a compressed representation 01 a utility sur race in augmented feature space it is also a compressed genotype representing a utility luncshytion to be optimized AB explained in IRe83d) the region representation is fully expressive proshyviding the features are See sect3 and Figs3amp4

Utll1ty u This is any measure of the useshylulness 01 an object in the perlormance or some task The utilitl provides a lid between the task domain and PLS generalization algorithms Utility can be a probability as in Fig 2 Comshypare merit See sect I 3

APPENDIX B PLSI LIMITATIONS

PLSI alone is inherently limited The problems relate to modi6cation of the main knowledge structure the cumulative region set R = (rue) As mentioned in sect41 R undergoes two basic alterations PLSI gradually changes the meanin of an established feature space recshytangle r by updating ita associated utility u (along with us error e) PLSI also incrementally

refines tbe feature space as rectangles r are conshytinually split

Both or these modifications (utility revision and region refinement) are largely directed by search data but the degree to which newer inrorshymation affects R depends on various choices or system parameters IRe 83al System parameters influence estimates of the error e and determine the degree of region re6nement These in turn affect the relative importance of new versus estashyblished knowledge

Consequently values or these parameters influence task performance For example there is a tradeol between utility revision and region refinement It rrampions are refined too quickly accuracy sulers (this is theoretically predictable) It instead utility revision predominates regions become inert (their estimated errors decline) but sometimes incorrectly18

There are several other problems including difficulties in training limitation in the utility revision algorithm and inaccurate estimation of various errors As a result utility estimations are imperfect and biased in unknown ways

Together the above uncertainties and senshysitivities explain the failure of PLSI always to locate an optimum with static training (Table f) The net effect is that PLSI works rairly well with no parameter tuning and unsophisticated trainshying and close to optimal with mild tuning and informed training (CoS4 as long as the features are well behaved

By nature however PLSI requires reatures exhibiting no worse than mild interactions This is a serious restriction since feature nonlinearity is prevalent On its own then PLSt is inherently limited There is simply no wall to learn utility accurately unless the elects of dilering heuristic runctions H are compared as in PLS2

ACKNOWLEDGEMENTS

I would like to thank Dave Coles for his lasting enthusiasm during the development implementashytion and testing of this system I appreciate the helpful suggestions from Chris Matheus Mike Mauldin and the Conference Reviewers

1amp Althourh lI7stem parameters are given by domain-independent statistieal analysis tunine these parameters nevertheless improves performanee in some eases (This is not required in PLS2)

14

sufficient the length of the genotype 8 will be L - 5n bits (see Fig3)

Instead or search inc weicht space directly for an optimal veetor b a GA searches gene space which has dimensionality L (gene space is Hamminc space if allele are binary) A GA conshyducts this search in parallel using a set of indiflishydual genotypes called a population or lene pool By comparing the relative merits p or individuals in a population and by matinc only the better individuals a GA perlorms an inrormed search or gene space This search is conducted iteratively over repeated generGtion In each new cenerashytion there are three basic operations perrormed (1) selection 01 parents (2) generation or ofisprinc and (3) replacement 01 individuals (1) and (2) have been given more attention Parent election is usually stochastic weich ted in lavor or individuals having higher values Offpring generGtion relies on genetic operator which modily parent genotypes Two natural examples are mutation (which alters one allele) and crossshyover (which slices two genotypes at a common locus and exchanges segments-see Fig 3)

POPULAl10tc

enol7PC B PMClty ~ Q()QIUIIlO (3-2) U

04 U

0011011011 (8t tf 01 01

(4-4) L4

[~~~ OfFSPlUtcQ

PartiliB CIIiWrt B CliiWrtl b

-UO Q()QIOU100 ~er Q()Ql omoo 00101UlIO

Fipre 3 Simple endic system The upper part or this diasrarn shows smaD population or just seven individuals Here the set or chancteristics (the pll tpe) is a simple two element vector b This is coded by the erlt B Each individual is associated with its measured merit po On the basis or their values pairs or individuals are stochasticall chosen as parents ror gendic recombination Their senotlpe are modified hI crossover to produce two new oftsprins

tions are perrormed OD them to produce ollsprinl the elleet is a combinatioll or knowledge retention and controlled search HoIshyland proved that osing binary alleles the crossshyover operator aad parent seleetioll proportional to p a GA is KS times more emdent than exhaustive search or gene space where K is the population size IHo 76 H081 Several empirical studies have verified the computational emciency or GAs compared with alterative procedures ror global optimization and have discovered interestinc properties of OAs such as ell~eets or varyinc K For example populations smaller than 60 can cause problems IBr81 OeSO

13 Appllcatlon In Heuristic Search

One AI use is search rorsolutions to probshylems or lor wins in cames (Ni80) Here we wish to learn an evaluation runction Has combinashytion 01 variables Xl Ibull x called allrihte or feature (reatures are often used to describe states in search) In the simplest case H is expressed as the linear combination bl Xl + b21 ++ bx = bx where the bi are weights to be learned We want to optimize the weight vecshytor b according to some measure or the perfor mance when H is used to control search

A rational way to de6ne (which we shaD use throughout this paper) is related to the avermiddot age number 0 or states or nodes developed in solving a set of problems Suppose D is observed ror a population or K heuristic ruactions defined by weight vectors bi Since the perror mance improves with lower values or D a good definitioa 01 the merit of ~ (ie or bi ) is the relashytive perrormance measure JIi == i5 OJ where i5 is the average over the popUlation ie i5 E OJ K This expression or merit could be used to assess geaotypes 8 representiac weight vectors bit as depicted in Fig3

Instead or this simple genetic approach however PLS2 employs unusual genotypes and operators some or which relate to PLSI Ia the remaining sections or this paper we shall examshyine the advantages 01 the OA resulting Irom the combiution of PLSI with PLS2

5 Notice that search takes pl3Ce both at the level or the task domain (for sood problem solutions)Because the more successful pareats are and at the level or the learninl element (for a lood

selected lor mating and because limited operashy control structure H)

3

3 PLS INFORMATION STRUCTURING DUAL VIEWPOINT

The connection behveea PLSI perceptual learning and PLS2 genetic adaptation is subtle and indirect BasicaUy PLSI deals witb object x (whicb caa be just aboat anything) aad their relationships to taak performance Let UI call the usefulness of an object x in some task domain ita utilitl u(x)

Since the number of objects is typically immense even vast observatioD is incomplete and generalization is required for prediction of u given a previously aneacountered x A significant step ia generalization is usually the expression of x aa a vector of high-level abstract features Xl X2 bullbullbull x so that x really represents not just one object but ratber a large number of similar objects (eg in a board game x might be a vector of reatures sach aa piece advantage center control etc) A further step in generalishyzation is to classif1l or categorize XiS which are similar for current purposese Since the purpose is to succeed well in a task PLSI cl3Ssi6es xs having similar utilities u

Class formation can be accomplished in several ways depending on the model assumed If the task domain and features permit objects having similar utilities may be eluterl in

feature space as illustrated in Figs 2t giving a middotregion set- RT Another model is the linear combination H == bt ot sect2

It is at this point that a GA like PLS2 can aid the learning process WelJ performing bs or Rs may be selected accord in to their merit 110 Note that merit 110 is an overall measure of the task performance while utility u is a quality measure localized to individual objects

The question now is what information structures to choose for representing knowledge about task utility For many reasons PLS incorshyporates the middotreion set- (Figbullbull) which represents domain knowledge by associatin an object with its utility We examine the region set from two

I Here t cttlfif means t frm classes cateories or concepts This is difficult to automate

1 PLSl initiated wbat has become known as conceptutl c1usterin - where not just reature valuea are considered but also predetermined rorms or classes (e rectlnles) and the whole data environment (eampshyutility) See [Re 78 Re 71 Re 83a Re SSa Re SSbl and also Appendix A

points of view as a PLSI knowledge structure and as a PLS2 genetie structure

0001

Fipre 4 Dual inLerpretation or a reion set R A repon set a part-it-loa or reatur pace (bere there are I repons) Poinu are clustered into reion ampccordinl to heir utility u in some task domain (e u - probabiJit7 or contributin to task success-see Fie 2) Here he u alues are bown inside the rectanshy

Ies A rei R is the triple (r u e) where e is the error in u The repon set R - R serves both as the PLSI knowledp IJtructur and as the PLS2 pnotypbull In PLS1 R is a discreLe (sLep) runction expresain varishyation or utility u witb reatures Xl In PLS2 R is a compressed nmon or the detailed enotype illustraLed in Fie5

31 The Region as PLSI Knowledge Strucshyture

In a feature pau representation an object

is a vector x == (Xl ~ x) In a problem or game the basic object is the tat trequently expressed as a vector oC features such as piece advantage center control mobility etc Obsershyvations made dUlin the course of even many problems or games normally cover just a traction of feature space and generalization is required for prediction

In neraliz4lion Iarnin objects are abstracted to form clan cotegoritll or conshycet This may take the form ot a partition of teature space ie a set of mutually exhaustive local neighborhoods called cute or cell (An 73 Di821 Since the goal of clustering ia PLS is to aid task performance the basis for generalishyzation is some measure otthe worth quality or

8 Feature spaces are sometimes avoided because they cannot easily express structure However alLershynative representatione as normally used are also deficient ror realistic eneralization learnioamp- A new scheme mechanizes or a very difficult inductive probshylem reature ibullbull IRe 83d1je 85e1

t The object or ennt could just as wen be an operaLor to be applied to a state or a state-operaLor pair See IRe 83d~

utilit of a state or ceO relative to the task One measure of utility is the probability ot contributshying to a solution or will Ia Figs 24 probability classes are rectangular cells (tor economy) The leftmost rectangle r has probability a 020 The rectangle r is a category generalizing the conditions under which the utility u applies

In PLS a rectangle is associated not just with its utility u but also with the utility error e This expression e ot uncertainty iD u allows quantification ot the elect of noise and provides an infomed and concise means for weighting various contributions to the value of u during learning The triple R - (rue) caUed amp

region is the main knowledge structure for PLSI A set R R or regions defines amp partition ill augmented reature space

R may be used directly as amp (discrete) evalualion or ileuriflic rUlction H == u(r) to assess state x Erin search For example in Fig4 there are six regions which difrerentiate states into si( utility classes Instead of forming a discrete heuristic R may be used indirectly as data (or determining the weight vector b in a smooth evaluation (unction H == bx (employing cune fitting techniques) We shall return to these algorithmic aspects of PLS in 54

Figure 5 Definition ot muimally detailed genotype U If the number of points in reature space is finite and a value of the utility is usoeiated with each point comshyplete information can be captured in a detailed genoshytype U of concatenated utilities U I u bullbull UL Coordishynates could be linearly ordered as shown here ror the two dimensional case U is an tully expanded lenotype correspondinl to the compreSled version or Fit 4

10 This could be expressed in other ways The production rule rorm is r -u Usin 10lie r is represented (0 S XI S 4) n (0 S x S 2)

31 The Region Set at Compressed and Unreatrleted PLSJ Genotype

Now let us examine these inrormation structures from the genetic viewpoint The weight vector b or evaluation function H could be considered a GA phenotype What might the genotype ber One choice a simple one was described in sect2 and illustrated in Fig3 here the genotype B is just a binary coding or b A difFerent possibility is one that captures exhausshytive inrormation about the relationship between utility u and reature vector x (see FigS) In this case the gene would be (x u) It the number of genes is bite they can be indexed and conshycateoated to give a very detailed genotype U which becomes a striDg or values U 1U2 bullbullbull uL codshying the entire utility surface in augmented feature space

This genotype U is unusual in some imporshytant ways Let us compare it with the earlier example B of sect2 (Fig3) B is simply a binary form or weight vector b One obvious difrerence between B and U is that U is more verbose than B This redundancy aspect will be considered shortly The other important diJrerence between Band U is that alleles within B may weD interact (to express feature nonlinearity) but alleles Uj

within U Cllnnot interact (since the Uj express an absolute property or reature vector x ie its utilshyity for some task) As explained in the next secshytion this rreedom rrom gene interdependence permits localization or credit11

The detaned genotype U codes the utility surface which may be very irregular at worst or very smooth at best This surrace may be locally well behaved (it may vary slowly iD some volumes or reature space) [n cases or local regushylarity portions or U are redundant As shown in FigS PLS2 compresses the genotype U into the region set R (examined in 531 from the PLSI viewpoil1 t) [n PLS2 a sin gle region R == (r u e ) is a et or genes the whole having just one allele u (we disregard the genetic coding or e) Unlike standard genotypes which have a stationary locus ror each gene and a fixed Dumber or genes the regioD set has no explicit loci but rather a

11 While one of the strengths of a GA is its abilshyity to manage interaetion ot nriable (by middotcoshyadaptinc alleles) PLS2 achieves efficient and concise knowledre representation and acquisition by ftexible sene compression and by certain other mdhods examshyined later in this paper

6

variable number 01 elements (regions) eacb representinamp a variable number of lenes A relion compresses lene seta having similar utility according to current knowledge

4 KNOWLEDGE ACQUISITION SYNERGIC LEARNING ALGORITHMS

In this section we examine how R is used to proshyvide barely adequate information about the utilshyity surface This compact representation resulta in ecoDomy of both 8pace aDd time and in eftective -learning Some reasons for this power are considered

The ultimate purpose of PLS is to discover utility classes in the form 01 a region set R This knowledge structure controls the primary task lor example in heuristic search R R=II =II

(rue) defines a discrete evaluation lunction H(r) =- u

The ideal R would be perfectly accurate and maximally compre Accuracy of utility u determines the quaJitl of task performance Appropriate compression of R characterizes the task domain conciselJ but adequately (see Figs 3 4) saving storage and time both during task performance and during learning

These goals 01 accuracl and economl are approached bl the doublJ lalered learninl 8Ylshytem PLS2 Fig 1) PLSI and PLS2 combine to become eftective rules lor generalization (inducshytion) specializatioD (differentiation) and reorganshyization The two layers support each other in various ways for example PLS2 stabilizes the pershyceptual system PLSl and PLSI maintains genoshytype diversity of the genetic system PLS2 In the tollowing we consider details amprBt trom the standpoint of PLSl then Irom the perspective 01 PLS2

41 PLSI Revision and DlfIerentlatlon

Even without a genetic component PLSI is a ftexible learning system which can be employed in noisy domains requiring incremental learning It can be used lor simple concept learnin like the systems in [Di82J but most experiments have involved state space problem solvin and game playing12 Here we examine PLS in the context of

12 These experimenLs have led to unique reaulLs 8uch as discovery ot locally optimal evaluation runcshytions (see [Re S3amp Re 83dl)

these difficult tasks

Ns described in sect31 the main PLSI knowledge strucfare is the region set R == ( r I u e) Intermediate between basic data obtained duriD search and a general heuristic ased to control search R defines a feature space augmen ted and partitioned by u and e Because R is improved incrementally it is called the cumulatil1e reiDn d PLSI repeatedly perrorms two basic operationl on R One operation is correction or relinDn (of utility u and error e) and the other is specialization difterentiation or refinement (or leature space cells r) These operators are detailed in (Re83a Re83dJ here we simply outUne their eftecta and note their limshyitations

Revlalon 01 a and e For an established region R (rue) E R PLSI is able to modify u and to decrease e by using new data This is accomplished in a rough fashion by comparing established values within aU rectangles r with fresh values within the same r It is difficult or impossible to learn the -true- values of u since data are acquired during performance of hard tasks and these data are biased in unknown ways because of nontrivial search

Reflnemen or R Alternately perrorming then learning PLSI acquires more and more detail about the natare or variation or utility u with reatures This information accumulates in the region set R c R III ( r a e ) where the primary eftect 01 clustering a is increasing resolushytion of R The number sizes and shapes 01 recshytanles in R reflect current knowledge resolution As this differentiation continues in successive iterations of PLSl attention locuses on more useshyrul parts 01 feature space and heuristic power improves

Unfortunately so does the likelihood of error Further errors are difficult to quantily and hard to localbe to individual regions

In brier while the incremental learnin 01 PLSI is powerful enoagh to learn locally optimal heuristics under certain conditions and while PLSI feedback is good enough to control and correct mild errors the leedback can become unstable in ualayorable situations instead of beinl corrected errors can become more proshyDounced Moreover PLSI is sensitive to parameshyter settings (see Appendix B) The system needs support

42 PLS2 Genetic Operatou

Qualities missing in PLSI can be provided by PLS2 M bull1 concluded PLSl with its single region set cannot discover accurate values of utilities u PLS2 however maintains an entire population of region seta which means that several regions in aU cover any given feature space volume The availability of comparable regions ultimately permits greater accuracy in u and brings other benefits

As sect32 explained a PLS2 genotype is the region set R = R and each region R = (rue) ( R is a compressed gene whose aDele is the utility u Details of an early version or PLS2 are given in (Re83cl Those algorithms have been improved the time complexity of the operashytors in recent program implementations is linear with population sile K The rollowing discussion outlines the overall eJlects and properties or the various genetic operators (compare to the more usual GA of sect2)

K-aexual mating is the operator analoshygous to crossover Consider a population R of K difterent region sets R Each set is composed of a number of regions R which together cover feature space A new region set Rt is formed by selecting individual regions (one at a time) from parents R with probability proportional to merit (merit is the performance of R defined at the end of sect2) Selection of regions rrom the whole population of region sets continues until the feature space cover is approximately the average cover or the parents This creates the oftspring region set Rt which is generally not a partition

Gene reorganization For economy of storale and time the olfsprinl region set Rt is repartitioned so that regions do not overlap in feature space

Controlled mutation Standard mutashytion operators alter an allele randomly III conshytrast the PLS2 operator analogous to mutation changes aD allele accordinl to evidence arising in the task domaia The controlled mutation operator ror a region set R = ( r Ii e) is the utility revision operator of PLSI M described in sect4I PLSI modifies the utility u for each feature space ceO r

Genotype expansion This operator is also provided by PLSI Recall the discussion of sect32 about the economy resultinl from compressshying genes (utility-feature vectors) into a region

set R The rerinement operator was described in sect41 This feature space refiDement amounta to an expansion or the lenotype R and is carried out when data warrant the discrimination

Both controlled mutation and genotype expansion promote genotype diversity Thus PLSI helps PLS2 to avoid premature convergence a typical GA problem [BrSl MaSI

43 Efteettveneaa and Effielency

The power of PLS2 may be traced to cershytain aspects of the perceptual and genetic algoshyrithms just outlined Some existing and emergshying principles or eftective and efficient learning are briefly discussed below (see also (ReSSa ReSSb ReSSel)

Credit loeaIlzatlon The selection of regions for Kmiddotsexual mating may use a single merit value for each region R within a given set R However the value of can just as well be localized to single regions within R t by comshyparing R with similar regions in other seta Since regions estimate an absolute quantity (taskshyrelated utility) in their own volume or feature space they are independent or each other Thus credit and blame may be assigned to feature space cells (ie to gene sequences)

Assignment or credit to individual regions within a cumulative set R is straightforward but it would be difticult to do directly in the final evaluation flluction H since the components of H whiJe appropriate ror performance omit inforshymation relevant to learning (compare Figs 2 )bull3

Knowledge mediation successrul sysshytems tend to employ inrormation structures which mediate data objects and the ultimate knowledge form These mediating structures include means to record growing assurance or tentative hypotheses

Whell used in heuristic sea~ch the PLS region set mediates large numbers of states and a

13 There are various pOSIIibilities lor the evalu tion function R but all contain len useful inlormation than their determinant the resion Bet R The simshyplest heuristic used in (Re 83a Re 83dJ is H - bl where b is a vector of weisbts for the feature vector f (This linear combination is used exclusively in experishyments to be described in 55) The value of b is found usinr regions as data in linear regression (Re 83a Re83bJ

7

very concise evaluation runction H Retention and continual improvemen or this mediatin structure relines the credit assignment problem This view is unlike that or (Oi81p14 Oi821 learning systems orten attempt to improve the control structure itself whereas PLS acquires knowledge efficiently in an appropriate structure and utilizes this knowledge by compressing it only temporarily ror performance In other words PLS does not directly search rule space for a good H but rather searches for good cumul tive regions rrom whicb H is constructed

Full but controlled use of ever) datum Samuels checker player permitted each state encountered to inBuence the beuristic H and at tbe same time no one datum could overwhelm the system The learning was stochastic botb conservative and economic In this respect PLS2 is similar (although more automated)

Schemata In learning systems and genetic algorithms A related efficiency in both Samuels systems and PLS is like tbe scheshymata concept in a GA In a GA a single indivishydual coded as a genotype (a string of digits) supports not only itselr but also aU its subshystrings Similarly a single state arising in heurisshytic search contains information about every feature used to describe it Thus each state can be used to appraise and weight each reature (The effect is more pronounced when a state is described in more elementary terms and combishynations or primitive descriptors are assessed -see (ReSSel)

6 EXPERIMENT AND ANALYSIS

PLS2 is designed to work in a changing environshyment of increasingly difficult problems Tbis seeshytion describes experimental evidence or effective and efficient learning

51 Experimental Condltlons Tame teaturee The features used ror

tbese experiments were the four or (Re83al Tbe relationship beween utility and these features is rairly smootb so tbe rull capability of a GA is not tested although tbe environment was dynamic

De6nltlon ot merit As sect4 described PLS2 choses regions from successrul cumulative sets and recombines them into imprOVed sets For the experiments reported here the selection criterion was the global merit p ie tbe perforshy

mance ora whole region Id without localization or credit to indiyichlal regions This measure p was the average number of nodes developed 0 in a training sample of 8 fmeen puules divided into the mean of aU such averages in a popul tion of K sets ie p 00 where 0 is tbe average over the population (0 E OJ K)

Changing envIronment For these expershyiments successive rounds of training were repeated in incremental learnin over several iterations or enertin The environment was altered in successive generations it was specified as problem difJicult or depth d (defined as the number of moves from the goal in sample probshylems) As a sequence of specifications of problem difficulty this becomes a training difJicult vector d == (d d2 d)

Here d was static one known to be a good progression based on previous experience with user training (Co84)middot In these experiments d was always (81422 SO ) An integer means random production or training problems subject to this difficulty constraint wbile demands production of fully random training instances

6t Discussioll

Berore we examine the experiments themshyselves let us consider potential dilrerences between PLSI aDd PLS2 in terms of their effectiveness and efficiency We also need a crishyterion for assessing differences between the two systems

VulnerabtUty of PLSt With population size K == 1 PLS2 degenerates to tbe simpler PLSI In this case static training can result in utter railure since the process is stocbastic and various things can go wrong (see Appendix B) The wors is railure to solve any problems in some generation and consequent absence or any new information IC the control structure H is tbis poor it will DO improve unless tbe lad is

14 PLS and similar system ror problem and game are IIOmetima neither rully supervised nor rully unsupervised The orjpnu PLSl w intermediate in this respect Trainiq problema were selected by a humiddot man bu rrom each raining instance a multitude or individuu noda ror learning are generated b the SY8shytem Each node can be considered a separate example ror concep learniq (Re 83d] (Co 84] describes experishyments with an automated trainer

8

detected and problem difficulty is reduced (ie dynamic training is needed)

Even without this catastrophe PLSI pershyrorms with varying degrees or success depending on the sophistication 01 ita training and other lacton (explained in Appendix B) With minimal human guidance PLSI always achieves a good evaluation lunction H although not always an optimal one Witb static training PLSI succeeds reasonably about balf the time

Stabtllty ot PLSt In contrast one would expect PLS2 to bave a much better success rate Since PLSI is here being run in parallel (Fig 1) and since PLS2 should reject hopeless cases (tbeir middots are small) a complete catastrophe (all Hs lailing) should occur with probability p s qK wbere q is the probability 01 PLSI lailure and K is popUlation size It q is even as large as one hall but K is 7 or more the probability p 01 catastrophe is less than 001

Cost versus benefit a measure Failure plays a part in costs so PLS2 may have an advantage The ultimate criterion for system quality is cost effectiveness is PLS2 worth its extra complexity Since the main cost is in task performance (here solving) the number of nodes developed 0 to attain some perlormance is a good measure of the expense

If training results in catastrophic bUure however all effort is wasted so a better measure is the expected cost Op where p is the probabiamp ity of success For example if 0 == 1gt00 for viable control structures but the probability of finding solutions is only then the aeraampe cost of useful information is 500 ~ == 1000

To extend this argument probability p depends on what is considered a success Is sucshycess the discovery of a perfect evaluation runcshytion H or is performance satisfactory ir 0 departs Irom optimal by no more than 21gt

53 Rulte

Table I shows perrormances and costa with yarious nlues of K Here p is estimated using roughly 38 trbJa or PLSI in a PLS2 context (it K == I 38 distind runl it K == 2 18 runl etc) Since variances is D are high performance testa were made oyer a random sample or 50 puzzles This typicaJJy iyes 91gt con6dence of 0 t 40

Accuraq ot learnlnl Let UI 6rst comshyp31e results 01 PLSI versul PLS2 for rour different success criteria We consider the learning to be successful it the resUlting heuristic H approaches optimal quality within a given margin (01 100 50 21gt and 10)

Columns two to five in the table (the second group) show the proportion of Hs giving performance 0 within a specified percentage or the best ho 0 (the best 0 is around 3SO nodes for the foar leatures used) For example the last row of the table shows that 01 the 36 individual control structures H tested in (two different) popul2tions of size 19 all 36 were within 100 of optimal 0 (column two) This means that aU developed no more than 700 nodes before a solutio was found Similarly column 6ve in the last row shows that 021 01 the 3t5 Hs or 8 or them were within 10 ie required no more than 386 nodes developed

Cost ot accuracy Columns ten and eleyen (the two riamphtmost columns of the lourth croup) show the estimated costs of achieving pershyformance withia 100 and within 10 or optimum respedively The values are based on the expected total number of nodes required (ie Op) with adjustments in favor of PLSI for extra PLS2 overhead (The unit is one thousand nodes developed) As K increases the cost of a given accuracy 6rst increases Nevertheless with just moderate K Yahles the genetic system becomes cheaper particularly ror an accuracy or 10

lb

I

Propolt0 auat Suce c Itbullbull 1

C lalt otlaa1 D lOot S 15 lot

__ lIdo

IRaNoa SI of SliIt t 654 U5 -nl IU sn n ItS 5 114 It n nl 1M Ul 14 I

Coat 1 1 11

__I

US n bull 11le19a lJ

poet Con t 0lIlbullbull ltalbull

1 101

J III na n au1bull nl na Ul II t4

foebullbullbulleobull It~t I

Jllot aot 12

n

1 4

12 IS 19

n J U bullbull4 6t 11 1 n 12 0 n 11 U )1 11

1 11 14 1 n 61 21

The expected cost benefit is not the only PLS2 and traditional optimization In [ReSl] advantage of PLS2 PLSI was round considerably more efficient than

Average pertormance beat perlo mance and population performance Conshysider now the third group of columns in Table I Tbe sixth column gives tbe IIverllle i5 for an H in the sample (or 36) The seventh column gives Db ror the best Hb in the sample These two measures average and be performance are orten used in assessing genetic systems (Br 81) The eighth column however is unusual it indishycates the population performance Dp resulting when all regions from every set in the population are used together in a regression to determine Up This is sensible because regions are indepenshydent and estimate tbe utility an absolute quaashytity ((Re83cJ cr [Br8l H07S))

Several trends are apparent in Table L First whether the criterion is 100 50 25 or 10 of optimum (columns 2-5) the proportion of good Hs increases steadily as population size K rises Similarly average best and popUlation performance measures 6 Db and Dp (columns 6shy8) also improve with K Perhaps most important is that the population performance Dp is so reu ably close to best even with these low K values This means that the whole population ot regions can be used (for Hp) without independent verification of perrormance In contrast indivi-middot dual Hs would require additional testing to disshycover the best (column 7) and the other alternashytive any H is likely not as good as Hp (colllI 6 and 8) Furthermore the entire populath or regions can become an accurate source of masshysive data for determining an evaluation function capturing reature interaction IRe83b

This accuracy advantage of PLS2 is illusshytrated in the 6nal column ot the table where ror a constant cost rough estimates are given or the expected error in population performance Dp relative to the optimal value

It is interesting that such sman popUlations improve perrormance markedly usually populashytion sizes are SO or more

o EFFICmNCY AND CAPABILITY

Based on these empirical observations ror PLS2 OD other comparisons ror PLSl and on various conceptual differences general properties or three competing methods can be compared PLSl

standard optimization and the luuestion was made tbat PLSI made better ase of available inrormation By Itudying sucll behaviors load underlying reasons we Ihould eventually identify principles of efficieat learning Some aspects are considered below

TradItIonal optlmlloatlon verau PLSI First let us consider efficiency of search for an optimal weight vector b in the evaluation runcshytion H = bt One good optimization method is repone lurace fittin (RSF) It can discover a local optimum in weight space by measuring and regressing the response (bere number of nodes developed D) for various values ot b RSF utilshyizes just a single quantit (ie D) ror every probshylem solved This seems like a small amount of inrormation to extract rrom an entire search lince a typical one may develop hundreds or thousands or nodes each possibly containing relevant inrormation In contrast to this tradishytional statistical approach PLSI like [5a63 5a671 uncovers knowledge about ever feature rrom ever node (see 143) PLSl then might be expected to be more efficient than RSF Experishyments veriry this [ReSI

TraditIonal optlmlratloll venta PLSI oM shown in SS PLS2 is more efficient ItiD We can compare it too with RSF The accuracy of RSF is known to improve with VN where N is the number or data (here the number or or probshylema solved) oM a arst approximation a paranel method like PLS2 should also cause accuracy to increase with the square root of the number or data although the data are now regions instead of D values If roughly the same number of regions is present in each individual set R or a population or size K accuracy must thererore improve as VK Since each of these K structures requires N problems in training the accuracy or PLS2 should increase as yenN like RSF

Obviously though PLS2 involves much more than blind parallelism a genetic algorithm extracts accurate knowledge and dismisses incorrect (un6t) inrormation While it is impossishyble ror PLSI alone PLS2 can rehe merit by localizing credit to individual regions (Re 83c] Planned experiments with this should show rurther increases in efficiency since the additional cost is small Another inexpensive improvement

10

will attempt to reward good regions by decreasshying their estimated errors Even without these refinements PLS2 retains meritorious regions (sect4) and should exhibit accuracy improvement better than vN Table I suggests this

PLSZ veraua PLSZ AI di8eussed ill sect41 and Appendix B PLSI is limited necessitating human tuning for optimum performance In conshytrast the second layer learning system PLS2 requires little human intervention The maiD reason is that PLS2 stabilizes knowledge automatically by comparing region sets and dismissing aberrant ones Accurate cumulative sets have a longer lifetime

1 SUMMARY AND CONCLUSIONS

PLS2 is a general learning system (Re83a Re83dJ Given a set or user-de6ned reatures and some measure of the utilil (eg probability or success in tast (errormance) PLS2 rorms aDd rebes an appropriate knowledge structure the cumulative region d R relating utility to reature values and permitting DOise manageshyment This economical and flexible structure mediate data objects and abstract heuristic knowledge

Since individual regions or the cumulative set R are independent or one anotber botb credit localization and reature interaction are possible

This ability to discriminate merit and retain successful data will likely be accentuated with the localization or credit to indiviltlual regions (see H2) Another improvement is to alter dynamically tbe error or a region (estimated by PLSl) as a runction or its merit (found by PLS2) This will have the elJect or protecting a good region rrom imperrect PLSI utility revision once some parallel PLSl has succeeded in discovshyering an accurate value it will be more immune to damage A fit region will bave a very long lirespan

Inherent differences In capablltty RSF PLSl and PLS2 can be cbaracterized dilJerently

From the standpoint or time costs given a chalshylenging requirement such as the location or a local optimum witbin 10 tbe ordering or tbese methods in terms or efficiency is RSF 5 PLSI 5

PLS2 In terms or capdilit tbe same relationshysbip bolds RSF cannot handle reature interacshytions without a more complex model (wbicb would increase its cost drastically) PLSl on the other hand can provide some performance improvement using piecewise linearity with little additional cost [Re83bJ PLS2 is more robust than PLSl While the original system is someshywhat sensitive to training and parameters PLS2 provides stability using competition to overcome deficiencies obviate tuning and increase accushyracy all at once PLS2 bulJers inadequacies inberent in PLSI Moreover PLS2 being genetishycally based may be able to handle bighly interacting reatures and discover 10601 optima Re 83c1 Tbis is very costly with RSF and seems inreasible with PLSl alone

simultaneously Separating tbe tast con trol structure H from the main store or knowledge R allows straigbtforward credit assignment to this dderminant R or H while H iteelt may incorshyporate reature nonlinearities witbout being responsible for tbem

A concise and adequate embodiment of current heuristic knowledge tbe cumulative region set R waa originally used in the learning system PLSl (Re83aJ PLSI is the only system shown to discover locally optimal evaluation runctions in an AI context Clearly superior to PLSl its genetic successor PLS2 haa been shown to be more stable more accurate more efficient and more convenient PLS2 employs an unusual genetic algorithm having tbe cumulative set R as a compressed genotype PLS2 extends PLSls limmiddot ited operations or revision (controlled mutation) and dilJerentiation (genotype expansion) to include generalization and other rules (K-sexual mating and genotype reorganization) Credit may be localized to individual gene sequences

These improvements may be viewed as eftectillg greater efficiency or as allowing greater capability Compared with a traditional metbod of optimization PLSl is more efficient (Re85aJ but PLS2 does even better Given a required accuracy PLS2 locates an optimum with lower expected cost In terms or capability PLS2 insushylates the system from inherent iIladequacies and sensitivities or PLSl PLS2 is much more stable and caD use the whole population of regions relishyably to create a bigbly informed heuristic (this popultion performance is not meaningrul in standard genetic systems) Tbis availability or massive data haa important implications for teature interaction lRe83blmiddot

11

Additional refinements or PLS2 may further increase efficiency and power These include rewarding meritorious regions 10 they become immune to damage Future experiments will investigate nonlinear capability ability to disshycover lobal optima and efficiency and effectiveness of focalized credit auinment

This paper has quantitatively affirmed some principles believed to improve efficiency and effectiveness of learning (eg credit localization) The paper has also considered some simple but little explored ideas for realizin these capabilishyties (eg tull but controlled use or each datum)

REFERENCES

(An 73) Ande~bert MR auter Aul IJr Applicamiddot tiIJn AcademicPress 1973

(Be 80J Bethke AD Genetic elgIJritlm bull bullbull uchu IJptimizer PhD Thesis Universitl of Michian 1980

[Br 81) Brindle A Genetic algorithms for function optimintion CS Department Report TR81middot2 (PhD Dissertation) Universitl of Alberta 1081

[Bu 78) Buchanan BG Johnson CR Mitchell TM and Smith RG Models or learnin systems in Belser J (Ed) Encdopclli 0 CIJmper Science 11 TcdnolIJfJ 11 (1978) 201

(Co 8) Coles D and Rendell LA Some issues in trainin learn in systems and an autonomoul desip PrIJc FitA Bicftaiel CIJerece cAe Cbullbullbullli SIJeiet IJ ComputetilJel Stli IJICeUiuce 1984

(De 80) DeJonl KA Adaptive system delip A enetic approach IEEE Trdibullbull a Stem Mbullbull nl C6enadiCl SMC-IO (1080) 566-S7(

(Di81) Dietterich TG and Buchanan BG The role of the critic in learnin systems Stanford Universitl Report STANmiddotC8-81middot801 1081

(Di82) Dietterich TG London B Clarkson K and Drome1 G Learninl and inductin inference STANmiddot C8-82-813 Stanford Universitl also Chapter XIV or Tlc HI6t1o tI Arlifidcl InteUlgenee Cohen PR and Feienbaum EA (Ed) Kaufmann 1082

(Ho 75] Holland JH AIIptti Nturel nll Arlificiel Stem Universitl or Michitan Press 1875

(H0801 Holland JH Adapive a1orithms ror discovshyerin and usin eneral patterns in powin knowlede bases IntI loumel Oil Polic AelbullbullbullIIormtiIJ Stem 4 2 (1080) 217middot240

(Ho 81) Holland JH Genetic a1prithms and adaptampshytion Proc NATO Ad Ret In Alpti CIJntrIJI 0

Ulefillel Sem 1081

(8083) Holland JH Eecapinl brit~lenelS Proc Sccod latemetael Mdae Lunai Workdop 1883 02-05

La83 Lanlley P Bradshaw GL and Simon HA Rediscoverinl chemistry with the Bacon system in Michalsk~ RS Carbonell JG and Mi~chell TM (Ed) MuMne Lunalg Aft Arliiciel Intelligente Appro Tiolamp 1983 307middot329

[Ma8 Mauldin ML Maintainin diversitl in renetic search Proc FnrlA NtiIJael CIJaereftce Oft Arlificicl InteUiece 10S4 247middot250 ~

(Me 851 Medin DL and Wattenmaker WD Catepl1 cohesiveness ~heoriel and copitive archeotshyOQ (as et unpublished manuscript) Dept of PsycholshyOQ Universit of Dlinois at Urbana Champaip 1985

[Mic 831o Michalsk~ RS A theory and methodology or inductive learninl Arlificiel laeUgece RO 2 (1883) 1I1middotlali reprinted in Michalski RS et aI (Ed) M4chine Lumi A Arlificiel InteUigcace Appr04cl Tioa 198383-134

[Mic 83b) Michalski RS an d Stepp RE Le arning rrom observation Conceptual clusteriD iD Michalski RS e~ al (Ed) Madiae Lunaia Aa Arlificial Intelmiddot ligence ApproA TioS 1083331middot383

(Mit 83) MitcheD TM Learnin and problem solvinr Proc EigAtA Ilterulibullbullcl loirampl CfJcrete on Arlificilll lteUieaee UI83 ll3G-lISI

(Ni80) Nilsson NJ Prieipl Arlijieiel Intellimiddot eee Tiolamp 1080

(ae 76) Rendell LA A method ror automatic eneramiddot tion of heuristics ror statemiddotspace problem Dept of Computer Science C8-76-10 Universit of Waterloo 1076

(ae 771 Rendell LA A locall optimal solution or the fifteen punle produced by an automatic evaluation runction senerator Dept or Computer Science C8-77middot 38 Universitl of Waterloo 1977

(ae 81] Rendell LA An adaptive plan for statemiddotspace problems Dept of Computer Science C8-81-13 (PhD thesis) Universit of Waterloo 1081

(ae 83aJ Rendell LA A new basis for state-space learDin systems and a successful implementation ArtajicielateUieee 10 (1883) ( 360-302

(Re 83bJ Rendell LA A learninl II)stem which accommodatet future interactions Prtle EiAtA Intermiddot eli1Il I Cbullbullcreaee Arliiciel Ifttdlilence 1883 6072

IRe 83c) RendeU LA A doubly layered sendic penetrance learnins stem Proc Tlir~ NotiouJ Ooeruce ()Il ArlificiJ l_teUigecc 1083 343-341

(ae 83dl Rendel~ LA Conceptual knowlede acquisishyion in Harch University or Guelph Report 015-83-15 Dept or Computins and Inrormation Science Guelph Canada 1083 (to appear in Bole L (ed) Kutllle~ge B4e4 Le4mi Stem SpringerVerias)

IRe SSa) Rendell LA Utility pattern u criteria for efficient seneraliution learninr Proc Jg8500_erec bullbull Ittlligerat Sytem lid M4diflu (to appear) lOSS

(ae SSbl Rendell LA A scientific approach to applied induction Proc 1985 International Machine Learninr Workshop Rutrers University (to appear) 1085

(ae SSc) Rendell LA Substantial constructive inducshytion ulinS layered information compression Tractable leature formation in search Proc Hitlt ltcm4tiobull J Joit Ooerucc o Artificial Intelligence (to appear) 1985

(Sa63) Samuel AL Some Itudies in machine learnshyins usin the 3me of checkers in Feigenbaum EA and Feldman J (Ed) Oomputer 4114 Thugll McGraw-Hili 1963 1H05

(Sa 61) Samue~ ALbull Some studie in machine learnshyinS usinr the same of checkers U-recent progress IBM J Ru IUI4 Duelop 11 (1961) 601-617

(Sm 80) Smitb SF A learnins stem based on lenetic adaptive alrorithms PhD Dissertation Univershysity of Pittsburgh 1980

[Sm 831 Smith SF Flexible learninr of problem 1I01vshyiDe heuristics throurb adaptive Harch Proc EiglUl Iteruttoul Joit OOflerenc 0_ ArlificiJ I_tellishyuce 1983422middot425

APPENDIX A GLOSSARY OF TERMS

Clustering Cluter anailln has lonl been used as a tool ror induction in statiatics and patshytern recognition (An 73) (See middotinduction) Improvements to basic clustering techniques lenshyerally use more than just the reatures or a datum ([An 73 p194) suggests -external criteria) External criteria in [Mi83 Re76 Re83a Re85b] involve prior sped6cation of the forms clusters may take (this has been called middotconceptual clusshytering (Mi83l) Criteria in (Re76 Re83a Re85bl are based on the data enviroDment (see

-utilit) below)ls This paper uses clustering to create economical compressed genetic structures (enotIlPu)

Feature A feature is an attribute ormiddot proshyperty of an object Features are usually quite abstract (el -center control or -mobility) in a board game The utility (see below) varies smoothly with a feature

Genetic algorithm In a GA a the charshyacter of an individual or a population is called a phenotype The phenotype ia coded as a string of digits called the enotvpe A single digit is a ene Instead of searcbing rule space directly (compare -learning system) a GA searches gene space (ie a GA searches for good genes in the population or genotypes) This search uses the merit of individual genotypes selecting the more successrul individuals to undergo genetic operations ror the production of offspring See sect2 and Fig3

Inductton Induction or eneralization learnin is an important means ror knowledge acquisition Information is actually created as data are compressed into c48fe or cateories in order to predict future events efficiently and effectively Induction may create reature space neighborhoods or clusterl See -clustering and sect41

Learning System Buchanan et al present a general model which distinguishes comshyponents of a learning system [Bu 78) The perforshymance element PE is guided by a control trucshyture H Based on observation or the PE the crishytic assesses H possibly localizing credit to parts or H (Bu 7S Di SI] The learning element LE uses thia inrormatioD to improve H for the next round of task perrormance Lavered systems have multiple PEs critics and LEs (eg PLS2 uses PLSJ as its PE -see FilI) Just as a PE searches for its goal in problem space the LE searches in rule pace DiS2) for an optimal H to control the PE-

To facilitate this higher goal PLS2 uses an intermediate knowledge structure which divides feature space into reion relating feature values to object utiitll (Re83d] and discovering a userul subset or features (ef (Sa63l) In this paper the control structure H is a linear evaluation funcshy

15 A new learnin~ system IRe SSc) introduces highermiddotdimensional clusterinr ror creation or structure

13

tion INiSOI and the rules are feature weights for H Search for accurate regions replaces direct search of rule space ie regions mediate data and H AB explained in sect3 sets of regions middotbecome compressed GA genotypes See also genetic algorithms middotPLS bull region and Fig I

Merit Also called paloff or fitness this is the measure used by a genetic algorithm to select parent genotypes lor prelerential reproducshytion of succesduJ individuals Compare middotutilitymiddot also see genetic algorithms

Object Objects are any data to be genshyeralized into categories Relationships usually depend on task domain See utility

PLS The pro6GiiliBtic learnin BIstem can learn what are sometimes called single conshycepts IDi82] but PLS is capable of much more difficult tasks involving noise management incremental learning and normaliution 01 biased data PLSt uniquely discovered locally optimal heuristics in search IRe 83al and PLS2 is the effective and efficient extension examined in this paper PLS manipulates middotregions (see below) using various inductive operations described in sect4

RegIon or Cell Depending on ones viewpoint the region is PLSs basic structure for clustering or for the genetic algorithm The region is a compressed representation 01 a utility sur race in augmented feature space it is also a compressed genotype representing a utility luncshytion to be optimized AB explained in IRe83d) the region representation is fully expressive proshyviding the features are See sect3 and Figs3amp4

Utll1ty u This is any measure of the useshylulness 01 an object in the perlormance or some task The utilitl provides a lid between the task domain and PLS generalization algorithms Utility can be a probability as in Fig 2 Comshypare merit See sect I 3

APPENDIX B PLSI LIMITATIONS

PLSI alone is inherently limited The problems relate to modi6cation of the main knowledge structure the cumulative region set R = (rue) As mentioned in sect41 R undergoes two basic alterations PLSI gradually changes the meanin of an established feature space recshytangle r by updating ita associated utility u (along with us error e) PLSI also incrementally

refines tbe feature space as rectangles r are conshytinually split

Both or these modifications (utility revision and region refinement) are largely directed by search data but the degree to which newer inrorshymation affects R depends on various choices or system parameters IRe 83al System parameters influence estimates of the error e and determine the degree of region re6nement These in turn affect the relative importance of new versus estashyblished knowledge

Consequently values or these parameters influence task performance For example there is a tradeol between utility revision and region refinement It rrampions are refined too quickly accuracy sulers (this is theoretically predictable) It instead utility revision predominates regions become inert (their estimated errors decline) but sometimes incorrectly18

There are several other problems including difficulties in training limitation in the utility revision algorithm and inaccurate estimation of various errors As a result utility estimations are imperfect and biased in unknown ways

Together the above uncertainties and senshysitivities explain the failure of PLSI always to locate an optimum with static training (Table f) The net effect is that PLSI works rairly well with no parameter tuning and unsophisticated trainshying and close to optimal with mild tuning and informed training (CoS4 as long as the features are well behaved

By nature however PLSI requires reatures exhibiting no worse than mild interactions This is a serious restriction since feature nonlinearity is prevalent On its own then PLSt is inherently limited There is simply no wall to learn utility accurately unless the elects of dilering heuristic runctions H are compared as in PLS2

ACKNOWLEDGEMENTS

I would like to thank Dave Coles for his lasting enthusiasm during the development implementashytion and testing of this system I appreciate the helpful suggestions from Chris Matheus Mike Mauldin and the Conference Reviewers

1amp Althourh lI7stem parameters are given by domain-independent statistieal analysis tunine these parameters nevertheless improves performanee in some eases (This is not required in PLS2)

14

3 PLS INFORMATION STRUCTURING DUAL VIEWPOINT

The connection behveea PLSI perceptual learning and PLS2 genetic adaptation is subtle and indirect BasicaUy PLSI deals witb object x (whicb caa be just aboat anything) aad their relationships to taak performance Let UI call the usefulness of an object x in some task domain ita utilitl u(x)

Since the number of objects is typically immense even vast observatioD is incomplete and generalization is required for prediction of u given a previously aneacountered x A significant step ia generalization is usually the expression of x aa a vector of high-level abstract features Xl X2 bullbullbull x so that x really represents not just one object but ratber a large number of similar objects (eg in a board game x might be a vector of reatures sach aa piece advantage center control etc) A further step in generalishyzation is to classif1l or categorize XiS which are similar for current purposese Since the purpose is to succeed well in a task PLSI cl3Ssi6es xs having similar utilities u

Class formation can be accomplished in several ways depending on the model assumed If the task domain and features permit objects having similar utilities may be eluterl in

feature space as illustrated in Figs 2t giving a middotregion set- RT Another model is the linear combination H == bt ot sect2

It is at this point that a GA like PLS2 can aid the learning process WelJ performing bs or Rs may be selected accord in to their merit 110 Note that merit 110 is an overall measure of the task performance while utility u is a quality measure localized to individual objects

The question now is what information structures to choose for representing knowledge about task utility For many reasons PLS incorshyporates the middotreion set- (Figbullbull) which represents domain knowledge by associatin an object with its utility We examine the region set from two

I Here t cttlfif means t frm classes cateories or concepts This is difficult to automate

1 PLSl initiated wbat has become known as conceptutl c1usterin - where not just reature valuea are considered but also predetermined rorms or classes (e rectlnles) and the whole data environment (eampshyutility) See [Re 78 Re 71 Re 83a Re SSa Re SSbl and also Appendix A

points of view as a PLSI knowledge structure and as a PLS2 genetie structure

0001

Fipre 4 Dual inLerpretation or a reion set R A repon set a part-it-loa or reatur pace (bere there are I repons) Poinu are clustered into reion ampccordinl to heir utility u in some task domain (e u - probabiJit7 or contributin to task success-see Fie 2) Here he u alues are bown inside the rectanshy

Ies A rei R is the triple (r u e) where e is the error in u The repon set R - R serves both as the PLSI knowledp IJtructur and as the PLS2 pnotypbull In PLS1 R is a discreLe (sLep) runction expresain varishyation or utility u witb reatures Xl In PLS2 R is a compressed nmon or the detailed enotype illustraLed in Fie5

31 The Region as PLSI Knowledge Strucshyture

In a feature pau representation an object

is a vector x == (Xl ~ x) In a problem or game the basic object is the tat trequently expressed as a vector oC features such as piece advantage center control mobility etc Obsershyvations made dUlin the course of even many problems or games normally cover just a traction of feature space and generalization is required for prediction

In neraliz4lion Iarnin objects are abstracted to form clan cotegoritll or conshycet This may take the form ot a partition of teature space ie a set of mutually exhaustive local neighborhoods called cute or cell (An 73 Di821 Since the goal of clustering ia PLS is to aid task performance the basis for generalishyzation is some measure otthe worth quality or

8 Feature spaces are sometimes avoided because they cannot easily express structure However alLershynative representatione as normally used are also deficient ror realistic eneralization learnioamp- A new scheme mechanizes or a very difficult inductive probshylem reature ibullbull IRe 83d1je 85e1

t The object or ennt could just as wen be an operaLor to be applied to a state or a state-operaLor pair See IRe 83d~

utilit of a state or ceO relative to the task One measure of utility is the probability ot contributshying to a solution or will Ia Figs 24 probability classes are rectangular cells (tor economy) The leftmost rectangle r has probability a 020 The rectangle r is a category generalizing the conditions under which the utility u applies

In PLS a rectangle is associated not just with its utility u but also with the utility error e This expression e ot uncertainty iD u allows quantification ot the elect of noise and provides an infomed and concise means for weighting various contributions to the value of u during learning The triple R - (rue) caUed amp

region is the main knowledge structure for PLSI A set R R or regions defines amp partition ill augmented reature space

R may be used directly as amp (discrete) evalualion or ileuriflic rUlction H == u(r) to assess state x Erin search For example in Fig4 there are six regions which difrerentiate states into si( utility classes Instead of forming a discrete heuristic R may be used indirectly as data (or determining the weight vector b in a smooth evaluation (unction H == bx (employing cune fitting techniques) We shall return to these algorithmic aspects of PLS in 54

Figure 5 Definition ot muimally detailed genotype U If the number of points in reature space is finite and a value of the utility is usoeiated with each point comshyplete information can be captured in a detailed genoshytype U of concatenated utilities U I u bullbull UL Coordishynates could be linearly ordered as shown here ror the two dimensional case U is an tully expanded lenotype correspondinl to the compreSled version or Fit 4

10 This could be expressed in other ways The production rule rorm is r -u Usin 10lie r is represented (0 S XI S 4) n (0 S x S 2)

31 The Region Set at Compressed and Unreatrleted PLSJ Genotype

Now let us examine these inrormation structures from the genetic viewpoint The weight vector b or evaluation function H could be considered a GA phenotype What might the genotype ber One choice a simple one was described in sect2 and illustrated in Fig3 here the genotype B is just a binary coding or b A difFerent possibility is one that captures exhausshytive inrormation about the relationship between utility u and reature vector x (see FigS) In this case the gene would be (x u) It the number of genes is bite they can be indexed and conshycateoated to give a very detailed genotype U which becomes a striDg or values U 1U2 bullbullbull uL codshying the entire utility surface in augmented feature space

This genotype U is unusual in some imporshytant ways Let us compare it with the earlier example B of sect2 (Fig3) B is simply a binary form or weight vector b One obvious difrerence between B and U is that U is more verbose than B This redundancy aspect will be considered shortly The other important diJrerence between Band U is that alleles within B may weD interact (to express feature nonlinearity) but alleles Uj

within U Cllnnot interact (since the Uj express an absolute property or reature vector x ie its utilshyity for some task) As explained in the next secshytion this rreedom rrom gene interdependence permits localization or credit11

The detaned genotype U codes the utility surface which may be very irregular at worst or very smooth at best This surrace may be locally well behaved (it may vary slowly iD some volumes or reature space) [n cases or local regushylarity portions or U are redundant As shown in FigS PLS2 compresses the genotype U into the region set R (examined in 531 from the PLSI viewpoil1 t) [n PLS2 a sin gle region R == (r u e ) is a et or genes the whole having just one allele u (we disregard the genetic coding or e) Unlike standard genotypes which have a stationary locus ror each gene and a fixed Dumber or genes the regioD set has no explicit loci but rather a

11 While one of the strengths of a GA is its abilshyity to manage interaetion ot nriable (by middotcoshyadaptinc alleles) PLS2 achieves efficient and concise knowledre representation and acquisition by ftexible sene compression and by certain other mdhods examshyined later in this paper

6

variable number 01 elements (regions) eacb representinamp a variable number of lenes A relion compresses lene seta having similar utility according to current knowledge

4 KNOWLEDGE ACQUISITION SYNERGIC LEARNING ALGORITHMS

In this section we examine how R is used to proshyvide barely adequate information about the utilshyity surface This compact representation resulta in ecoDomy of both 8pace aDd time and in eftective -learning Some reasons for this power are considered

The ultimate purpose of PLS is to discover utility classes in the form 01 a region set R This knowledge structure controls the primary task lor example in heuristic search R R=II =II

(rue) defines a discrete evaluation lunction H(r) =- u

The ideal R would be perfectly accurate and maximally compre Accuracy of utility u determines the quaJitl of task performance Appropriate compression of R characterizes the task domain conciselJ but adequately (see Figs 3 4) saving storage and time both during task performance and during learning

These goals 01 accuracl and economl are approached bl the doublJ lalered learninl 8Ylshytem PLS2 Fig 1) PLSI and PLS2 combine to become eftective rules lor generalization (inducshytion) specializatioD (differentiation) and reorganshyization The two layers support each other in various ways for example PLS2 stabilizes the pershyceptual system PLSl and PLSI maintains genoshytype diversity of the genetic system PLS2 In the tollowing we consider details amprBt trom the standpoint of PLSl then Irom the perspective 01 PLS2

41 PLSI Revision and DlfIerentlatlon

Even without a genetic component PLSI is a ftexible learning system which can be employed in noisy domains requiring incremental learning It can be used lor simple concept learnin like the systems in [Di82J but most experiments have involved state space problem solvin and game playing12 Here we examine PLS in the context of

12 These experimenLs have led to unique reaulLs 8uch as discovery ot locally optimal evaluation runcshytions (see [Re S3amp Re 83dl)

these difficult tasks

Ns described in sect31 the main PLSI knowledge strucfare is the region set R == ( r I u e) Intermediate between basic data obtained duriD search and a general heuristic ased to control search R defines a feature space augmen ted and partitioned by u and e Because R is improved incrementally it is called the cumulatil1e reiDn d PLSI repeatedly perrorms two basic operationl on R One operation is correction or relinDn (of utility u and error e) and the other is specialization difterentiation or refinement (or leature space cells r) These operators are detailed in (Re83a Re83dJ here we simply outUne their eftecta and note their limshyitations

Revlalon 01 a and e For an established region R (rue) E R PLSI is able to modify u and to decrease e by using new data This is accomplished in a rough fashion by comparing established values within aU rectangles r with fresh values within the same r It is difficult or impossible to learn the -true- values of u since data are acquired during performance of hard tasks and these data are biased in unknown ways because of nontrivial search

Reflnemen or R Alternately perrorming then learning PLSI acquires more and more detail about the natare or variation or utility u with reatures This information accumulates in the region set R c R III ( r a e ) where the primary eftect 01 clustering a is increasing resolushytion of R The number sizes and shapes 01 recshytanles in R reflect current knowledge resolution As this differentiation continues in successive iterations of PLSl attention locuses on more useshyrul parts 01 feature space and heuristic power improves

Unfortunately so does the likelihood of error Further errors are difficult to quantily and hard to localbe to individual regions

In brier while the incremental learnin 01 PLSI is powerful enoagh to learn locally optimal heuristics under certain conditions and while PLSI feedback is good enough to control and correct mild errors the leedback can become unstable in ualayorable situations instead of beinl corrected errors can become more proshyDounced Moreover PLSI is sensitive to parameshyter settings (see Appendix B) The system needs support

42 PLS2 Genetic Operatou

Qualities missing in PLSI can be provided by PLS2 M bull1 concluded PLSl with its single region set cannot discover accurate values of utilities u PLS2 however maintains an entire population of region seta which means that several regions in aU cover any given feature space volume The availability of comparable regions ultimately permits greater accuracy in u and brings other benefits

As sect32 explained a PLS2 genotype is the region set R = R and each region R = (rue) ( R is a compressed gene whose aDele is the utility u Details of an early version or PLS2 are given in (Re83cl Those algorithms have been improved the time complexity of the operashytors in recent program implementations is linear with population sile K The rollowing discussion outlines the overall eJlects and properties or the various genetic operators (compare to the more usual GA of sect2)

K-aexual mating is the operator analoshygous to crossover Consider a population R of K difterent region sets R Each set is composed of a number of regions R which together cover feature space A new region set Rt is formed by selecting individual regions (one at a time) from parents R with probability proportional to merit (merit is the performance of R defined at the end of sect2) Selection of regions rrom the whole population of region sets continues until the feature space cover is approximately the average cover or the parents This creates the oftspring region set Rt which is generally not a partition

Gene reorganization For economy of storale and time the olfsprinl region set Rt is repartitioned so that regions do not overlap in feature space

Controlled mutation Standard mutashytion operators alter an allele randomly III conshytrast the PLS2 operator analogous to mutation changes aD allele accordinl to evidence arising in the task domaia The controlled mutation operator ror a region set R = ( r Ii e) is the utility revision operator of PLSI M described in sect4I PLSI modifies the utility u for each feature space ceO r

Genotype expansion This operator is also provided by PLSI Recall the discussion of sect32 about the economy resultinl from compressshying genes (utility-feature vectors) into a region

set R The rerinement operator was described in sect41 This feature space refiDement amounta to an expansion or the lenotype R and is carried out when data warrant the discrimination

Both controlled mutation and genotype expansion promote genotype diversity Thus PLSI helps PLS2 to avoid premature convergence a typical GA problem [BrSl MaSI

43 Efteettveneaa and Effielency

The power of PLS2 may be traced to cershytain aspects of the perceptual and genetic algoshyrithms just outlined Some existing and emergshying principles or eftective and efficient learning are briefly discussed below (see also (ReSSa ReSSb ReSSel)

Credit loeaIlzatlon The selection of regions for Kmiddotsexual mating may use a single merit value for each region R within a given set R However the value of can just as well be localized to single regions within R t by comshyparing R with similar regions in other seta Since regions estimate an absolute quantity (taskshyrelated utility) in their own volume or feature space they are independent or each other Thus credit and blame may be assigned to feature space cells (ie to gene sequences)

Assignment or credit to individual regions within a cumulative set R is straightforward but it would be difticult to do directly in the final evaluation flluction H since the components of H whiJe appropriate ror performance omit inforshymation relevant to learning (compare Figs 2 )bull3

Knowledge mediation successrul sysshytems tend to employ inrormation structures which mediate data objects and the ultimate knowledge form These mediating structures include means to record growing assurance or tentative hypotheses

Whell used in heuristic sea~ch the PLS region set mediates large numbers of states and a

13 There are various pOSIIibilities lor the evalu tion function R but all contain len useful inlormation than their determinant the resion Bet R The simshyplest heuristic used in (Re 83a Re 83dJ is H - bl where b is a vector of weisbts for the feature vector f (This linear combination is used exclusively in experishyments to be described in 55) The value of b is found usinr regions as data in linear regression (Re 83a Re83bJ

7

very concise evaluation runction H Retention and continual improvemen or this mediatin structure relines the credit assignment problem This view is unlike that or (Oi81p14 Oi821 learning systems orten attempt to improve the control structure itself whereas PLS acquires knowledge efficiently in an appropriate structure and utilizes this knowledge by compressing it only temporarily ror performance In other words PLS does not directly search rule space for a good H but rather searches for good cumul tive regions rrom whicb H is constructed

Full but controlled use of ever) datum Samuels checker player permitted each state encountered to inBuence the beuristic H and at tbe same time no one datum could overwhelm the system The learning was stochastic botb conservative and economic In this respect PLS2 is similar (although more automated)

Schemata In learning systems and genetic algorithms A related efficiency in both Samuels systems and PLS is like tbe scheshymata concept in a GA In a GA a single indivishydual coded as a genotype (a string of digits) supports not only itselr but also aU its subshystrings Similarly a single state arising in heurisshytic search contains information about every feature used to describe it Thus each state can be used to appraise and weight each reature (The effect is more pronounced when a state is described in more elementary terms and combishynations or primitive descriptors are assessed -see (ReSSel)

6 EXPERIMENT AND ANALYSIS

PLS2 is designed to work in a changing environshyment of increasingly difficult problems Tbis seeshytion describes experimental evidence or effective and efficient learning

51 Experimental Condltlons Tame teaturee The features used ror

tbese experiments were the four or (Re83al Tbe relationship beween utility and these features is rairly smootb so tbe rull capability of a GA is not tested although tbe environment was dynamic

De6nltlon ot merit As sect4 described PLS2 choses regions from successrul cumulative sets and recombines them into imprOVed sets For the experiments reported here the selection criterion was the global merit p ie tbe perforshy

mance ora whole region Id without localization or credit to indiyichlal regions This measure p was the average number of nodes developed 0 in a training sample of 8 fmeen puules divided into the mean of aU such averages in a popul tion of K sets ie p 00 where 0 is tbe average over the population (0 E OJ K)

Changing envIronment For these expershyiments successive rounds of training were repeated in incremental learnin over several iterations or enertin The environment was altered in successive generations it was specified as problem difJicult or depth d (defined as the number of moves from the goal in sample probshylems) As a sequence of specifications of problem difficulty this becomes a training difJicult vector d == (d d2 d)

Here d was static one known to be a good progression based on previous experience with user training (Co84)middot In these experiments d was always (81422 SO ) An integer means random production or training problems subject to this difficulty constraint wbile demands production of fully random training instances

6t Discussioll

Berore we examine the experiments themshyselves let us consider potential dilrerences between PLSI aDd PLS2 in terms of their effectiveness and efficiency We also need a crishyterion for assessing differences between the two systems

VulnerabtUty of PLSt With population size K == 1 PLS2 degenerates to tbe simpler PLSI In this case static training can result in utter railure since the process is stocbastic and various things can go wrong (see Appendix B) The wors is railure to solve any problems in some generation and consequent absence or any new information IC the control structure H is tbis poor it will DO improve unless tbe lad is

14 PLS and similar system ror problem and game are IIOmetima neither rully supervised nor rully unsupervised The orjpnu PLSl w intermediate in this respect Trainiq problema were selected by a humiddot man bu rrom each raining instance a multitude or individuu noda ror learning are generated b the SY8shytem Each node can be considered a separate example ror concep learniq (Re 83d] (Co 84] describes experishyments with an automated trainer

8

detected and problem difficulty is reduced (ie dynamic training is needed)

Even without this catastrophe PLSI pershyrorms with varying degrees or success depending on the sophistication 01 ita training and other lacton (explained in Appendix B) With minimal human guidance PLSI always achieves a good evaluation lunction H although not always an optimal one Witb static training PLSI succeeds reasonably about balf the time

Stabtllty ot PLSt In contrast one would expect PLS2 to bave a much better success rate Since PLSI is here being run in parallel (Fig 1) and since PLS2 should reject hopeless cases (tbeir middots are small) a complete catastrophe (all Hs lailing) should occur with probability p s qK wbere q is the probability 01 PLSI lailure and K is popUlation size It q is even as large as one hall but K is 7 or more the probability p 01 catastrophe is less than 001

Cost versus benefit a measure Failure plays a part in costs so PLS2 may have an advantage The ultimate criterion for system quality is cost effectiveness is PLS2 worth its extra complexity Since the main cost is in task performance (here solving) the number of nodes developed 0 to attain some perlormance is a good measure of the expense

If training results in catastrophic bUure however all effort is wasted so a better measure is the expected cost Op where p is the probabiamp ity of success For example if 0 == 1gt00 for viable control structures but the probability of finding solutions is only then the aeraampe cost of useful information is 500 ~ == 1000

To extend this argument probability p depends on what is considered a success Is sucshycess the discovery of a perfect evaluation runcshytion H or is performance satisfactory ir 0 departs Irom optimal by no more than 21gt

53 Rulte

Table I shows perrormances and costa with yarious nlues of K Here p is estimated using roughly 38 trbJa or PLSI in a PLS2 context (it K == I 38 distind runl it K == 2 18 runl etc) Since variances is D are high performance testa were made oyer a random sample or 50 puzzles This typicaJJy iyes 91gt con6dence of 0 t 40

Accuraq ot learnlnl Let UI 6rst comshyp31e results 01 PLSI versul PLS2 for rour different success criteria We consider the learning to be successful it the resUlting heuristic H approaches optimal quality within a given margin (01 100 50 21gt and 10)

Columns two to five in the table (the second group) show the proportion of Hs giving performance 0 within a specified percentage or the best ho 0 (the best 0 is around 3SO nodes for the foar leatures used) For example the last row of the table shows that 01 the 36 individual control structures H tested in (two different) popul2tions of size 19 all 36 were within 100 of optimal 0 (column two) This means that aU developed no more than 700 nodes before a solutio was found Similarly column 6ve in the last row shows that 021 01 the 3t5 Hs or 8 or them were within 10 ie required no more than 386 nodes developed

Cost ot accuracy Columns ten and eleyen (the two riamphtmost columns of the lourth croup) show the estimated costs of achieving pershyformance withia 100 and within 10 or optimum respedively The values are based on the expected total number of nodes required (ie Op) with adjustments in favor of PLSI for extra PLS2 overhead (The unit is one thousand nodes developed) As K increases the cost of a given accuracy 6rst increases Nevertheless with just moderate K Yahles the genetic system becomes cheaper particularly ror an accuracy or 10

lb

I

Propolt0 auat Suce c Itbullbull 1

C lalt otlaa1 D lOot S 15 lot

__ lIdo

IRaNoa SI of SliIt t 654 U5 -nl IU sn n ItS 5 114 It n nl 1M Ul 14 I

Coat 1 1 11

__I

US n bull 11le19a lJ

poet Con t 0lIlbullbull ltalbull

1 101

J III na n au1bull nl na Ul II t4

foebullbullbulleobull It~t I

Jllot aot 12

n

1 4

12 IS 19

n J U bullbull4 6t 11 1 n 12 0 n 11 U )1 11

1 11 14 1 n 61 21

The expected cost benefit is not the only PLS2 and traditional optimization In [ReSl] advantage of PLS2 PLSI was round considerably more efficient than

Average pertormance beat perlo mance and population performance Conshysider now the third group of columns in Table I Tbe sixth column gives tbe IIverllle i5 for an H in the sample (or 36) The seventh column gives Db ror the best Hb in the sample These two measures average and be performance are orten used in assessing genetic systems (Br 81) The eighth column however is unusual it indishycates the population performance Dp resulting when all regions from every set in the population are used together in a regression to determine Up This is sensible because regions are indepenshydent and estimate tbe utility an absolute quaashytity ((Re83cJ cr [Br8l H07S))

Several trends are apparent in Table L First whether the criterion is 100 50 25 or 10 of optimum (columns 2-5) the proportion of good Hs increases steadily as population size K rises Similarly average best and popUlation performance measures 6 Db and Dp (columns 6shy8) also improve with K Perhaps most important is that the population performance Dp is so reu ably close to best even with these low K values This means that the whole population ot regions can be used (for Hp) without independent verification of perrormance In contrast indivi-middot dual Hs would require additional testing to disshycover the best (column 7) and the other alternashytive any H is likely not as good as Hp (colllI 6 and 8) Furthermore the entire populath or regions can become an accurate source of masshysive data for determining an evaluation function capturing reature interaction IRe83b

This accuracy advantage of PLS2 is illusshytrated in the 6nal column ot the table where ror a constant cost rough estimates are given or the expected error in population performance Dp relative to the optimal value

It is interesting that such sman popUlations improve perrormance markedly usually populashytion sizes are SO or more

o EFFICmNCY AND CAPABILITY

Based on these empirical observations ror PLS2 OD other comparisons ror PLSl and on various conceptual differences general properties or three competing methods can be compared PLSl

standard optimization and the luuestion was made tbat PLSI made better ase of available inrormation By Itudying sucll behaviors load underlying reasons we Ihould eventually identify principles of efficieat learning Some aspects are considered below

TradItIonal optlmlloatlon verau PLSI First let us consider efficiency of search for an optimal weight vector b in the evaluation runcshytion H = bt One good optimization method is repone lurace fittin (RSF) It can discover a local optimum in weight space by measuring and regressing the response (bere number of nodes developed D) for various values ot b RSF utilshyizes just a single quantit (ie D) ror every probshylem solved This seems like a small amount of inrormation to extract rrom an entire search lince a typical one may develop hundreds or thousands or nodes each possibly containing relevant inrormation In contrast to this tradishytional statistical approach PLSI like [5a63 5a671 uncovers knowledge about ever feature rrom ever node (see 143) PLSl then might be expected to be more efficient than RSF Experishyments veriry this [ReSI

TraditIonal optlmlratloll venta PLSI oM shown in SS PLS2 is more efficient ItiD We can compare it too with RSF The accuracy of RSF is known to improve with VN where N is the number or data (here the number or or probshylema solved) oM a arst approximation a paranel method like PLS2 should also cause accuracy to increase with the square root of the number or data although the data are now regions instead of D values If roughly the same number of regions is present in each individual set R or a population or size K accuracy must thererore improve as VK Since each of these K structures requires N problems in training the accuracy or PLS2 should increase as yenN like RSF

Obviously though PLS2 involves much more than blind parallelism a genetic algorithm extracts accurate knowledge and dismisses incorrect (un6t) inrormation While it is impossishyble ror PLSI alone PLS2 can rehe merit by localizing credit to individual regions (Re 83c] Planned experiments with this should show rurther increases in efficiency since the additional cost is small Another inexpensive improvement

10

will attempt to reward good regions by decreasshying their estimated errors Even without these refinements PLS2 retains meritorious regions (sect4) and should exhibit accuracy improvement better than vN Table I suggests this

PLSZ veraua PLSZ AI di8eussed ill sect41 and Appendix B PLSI is limited necessitating human tuning for optimum performance In conshytrast the second layer learning system PLS2 requires little human intervention The maiD reason is that PLS2 stabilizes knowledge automatically by comparing region sets and dismissing aberrant ones Accurate cumulative sets have a longer lifetime

1 SUMMARY AND CONCLUSIONS

PLS2 is a general learning system (Re83a Re83dJ Given a set or user-de6ned reatures and some measure of the utilil (eg probability or success in tast (errormance) PLS2 rorms aDd rebes an appropriate knowledge structure the cumulative region d R relating utility to reature values and permitting DOise manageshyment This economical and flexible structure mediate data objects and abstract heuristic knowledge

Since individual regions or the cumulative set R are independent or one anotber botb credit localization and reature interaction are possible

This ability to discriminate merit and retain successful data will likely be accentuated with the localization or credit to indiviltlual regions (see H2) Another improvement is to alter dynamically tbe error or a region (estimated by PLSl) as a runction or its merit (found by PLS2) This will have the elJect or protecting a good region rrom imperrect PLSI utility revision once some parallel PLSl has succeeded in discovshyering an accurate value it will be more immune to damage A fit region will bave a very long lirespan

Inherent differences In capablltty RSF PLSl and PLS2 can be cbaracterized dilJerently

From the standpoint or time costs given a chalshylenging requirement such as the location or a local optimum witbin 10 tbe ordering or tbese methods in terms or efficiency is RSF 5 PLSI 5

PLS2 In terms or capdilit tbe same relationshysbip bolds RSF cannot handle reature interacshytions without a more complex model (wbicb would increase its cost drastically) PLSl on the other hand can provide some performance improvement using piecewise linearity with little additional cost [Re83bJ PLS2 is more robust than PLSl While the original system is someshywhat sensitive to training and parameters PLS2 provides stability using competition to overcome deficiencies obviate tuning and increase accushyracy all at once PLS2 bulJers inadequacies inberent in PLSI Moreover PLS2 being genetishycally based may be able to handle bighly interacting reatures and discover 10601 optima Re 83c1 Tbis is very costly with RSF and seems inreasible with PLSl alone

simultaneously Separating tbe tast con trol structure H from the main store or knowledge R allows straigbtforward credit assignment to this dderminant R or H while H iteelt may incorshyporate reature nonlinearities witbout being responsible for tbem

A concise and adequate embodiment of current heuristic knowledge tbe cumulative region set R waa originally used in the learning system PLSl (Re83aJ PLSI is the only system shown to discover locally optimal evaluation runctions in an AI context Clearly superior to PLSl its genetic successor PLS2 haa been shown to be more stable more accurate more efficient and more convenient PLS2 employs an unusual genetic algorithm having tbe cumulative set R as a compressed genotype PLS2 extends PLSls limmiddot ited operations or revision (controlled mutation) and dilJerentiation (genotype expansion) to include generalization and other rules (K-sexual mating and genotype reorganization) Credit may be localized to individual gene sequences

These improvements may be viewed as eftectillg greater efficiency or as allowing greater capability Compared with a traditional metbod of optimization PLSl is more efficient (Re85aJ but PLS2 does even better Given a required accuracy PLS2 locates an optimum with lower expected cost In terms or capability PLS2 insushylates the system from inherent iIladequacies and sensitivities or PLSl PLS2 is much more stable and caD use the whole population of regions relishyably to create a bigbly informed heuristic (this popultion performance is not meaningrul in standard genetic systems) Tbis availability or massive data haa important implications for teature interaction lRe83blmiddot

11

Additional refinements or PLS2 may further increase efficiency and power These include rewarding meritorious regions 10 they become immune to damage Future experiments will investigate nonlinear capability ability to disshycover lobal optima and efficiency and effectiveness of focalized credit auinment

This paper has quantitatively affirmed some principles believed to improve efficiency and effectiveness of learning (eg credit localization) The paper has also considered some simple but little explored ideas for realizin these capabilishyties (eg tull but controlled use or each datum)

REFERENCES

(An 73) Ande~bert MR auter Aul IJr Applicamiddot tiIJn AcademicPress 1973

(Be 80J Bethke AD Genetic elgIJritlm bull bullbull uchu IJptimizer PhD Thesis Universitl of Michian 1980

[Br 81) Brindle A Genetic algorithms for function optimintion CS Department Report TR81middot2 (PhD Dissertation) Universitl of Alberta 1081

[Bu 78) Buchanan BG Johnson CR Mitchell TM and Smith RG Models or learnin systems in Belser J (Ed) Encdopclli 0 CIJmper Science 11 TcdnolIJfJ 11 (1978) 201

(Co 8) Coles D and Rendell LA Some issues in trainin learn in systems and an autonomoul desip PrIJc FitA Bicftaiel CIJerece cAe Cbullbullbullli SIJeiet IJ ComputetilJel Stli IJICeUiuce 1984

(De 80) DeJonl KA Adaptive system delip A enetic approach IEEE Trdibullbull a Stem Mbullbull nl C6enadiCl SMC-IO (1080) 566-S7(

(Di81) Dietterich TG and Buchanan BG The role of the critic in learnin systems Stanford Universitl Report STANmiddotC8-81middot801 1081

(Di82) Dietterich TG London B Clarkson K and Drome1 G Learninl and inductin inference STANmiddot C8-82-813 Stanford Universitl also Chapter XIV or Tlc HI6t1o tI Arlifidcl InteUlgenee Cohen PR and Feienbaum EA (Ed) Kaufmann 1082

(Ho 75] Holland JH AIIptti Nturel nll Arlificiel Stem Universitl or Michitan Press 1875

(H0801 Holland JH Adapive a1orithms ror discovshyerin and usin eneral patterns in powin knowlede bases IntI loumel Oil Polic AelbullbullbullIIormtiIJ Stem 4 2 (1080) 217middot240

(Ho 81) Holland JH Genetic a1prithms and adaptampshytion Proc NATO Ad Ret In Alpti CIJntrIJI 0

Ulefillel Sem 1081

(8083) Holland JH Eecapinl brit~lenelS Proc Sccod latemetael Mdae Lunai Workdop 1883 02-05

La83 Lanlley P Bradshaw GL and Simon HA Rediscoverinl chemistry with the Bacon system in Michalsk~ RS Carbonell JG and Mi~chell TM (Ed) MuMne Lunalg Aft Arliiciel Intelligente Appro Tiolamp 1983 307middot329

[Ma8 Mauldin ML Maintainin diversitl in renetic search Proc FnrlA NtiIJael CIJaereftce Oft Arlificicl InteUiece 10S4 247middot250 ~

(Me 851 Medin DL and Wattenmaker WD Catepl1 cohesiveness ~heoriel and copitive archeotshyOQ (as et unpublished manuscript) Dept of PsycholshyOQ Universit of Dlinois at Urbana Champaip 1985

[Mic 831o Michalsk~ RS A theory and methodology or inductive learninl Arlificiel laeUgece RO 2 (1883) 1I1middotlali reprinted in Michalski RS et aI (Ed) M4chine Lumi A Arlificiel InteUigcace Appr04cl Tioa 198383-134

[Mic 83b) Michalski RS an d Stepp RE Le arning rrom observation Conceptual clusteriD iD Michalski RS e~ al (Ed) Madiae Lunaia Aa Arlificial Intelmiddot ligence ApproA TioS 1083331middot383

(Mit 83) MitcheD TM Learnin and problem solvinr Proc EigAtA Ilterulibullbullcl loirampl CfJcrete on Arlificilll lteUieaee UI83 ll3G-lISI

(Ni80) Nilsson NJ Prieipl Arlijieiel Intellimiddot eee Tiolamp 1080

(ae 76) Rendell LA A method ror automatic eneramiddot tion of heuristics ror statemiddotspace problem Dept of Computer Science C8-76-10 Universit of Waterloo 1076

(ae 771 Rendell LA A locall optimal solution or the fifteen punle produced by an automatic evaluation runction senerator Dept or Computer Science C8-77middot 38 Universitl of Waterloo 1977

(ae 81] Rendell LA An adaptive plan for statemiddotspace problems Dept of Computer Science C8-81-13 (PhD thesis) Universit of Waterloo 1081

(ae 83aJ Rendell LA A new basis for state-space learDin systems and a successful implementation ArtajicielateUieee 10 (1883) ( 360-302

(Re 83bJ Rendell LA A learninl II)stem which accommodatet future interactions Prtle EiAtA Intermiddot eli1Il I Cbullbullcreaee Arliiciel Ifttdlilence 1883 6072

IRe 83c) RendeU LA A doubly layered sendic penetrance learnins stem Proc Tlir~ NotiouJ Ooeruce ()Il ArlificiJ l_teUigecc 1083 343-341

(ae 83dl Rendel~ LA Conceptual knowlede acquisishyion in Harch University or Guelph Report 015-83-15 Dept or Computins and Inrormation Science Guelph Canada 1083 (to appear in Bole L (ed) Kutllle~ge B4e4 Le4mi Stem SpringerVerias)

IRe SSa) Rendell LA Utility pattern u criteria for efficient seneraliution learninr Proc Jg8500_erec bullbull Ittlligerat Sytem lid M4diflu (to appear) lOSS

(ae SSbl Rendell LA A scientific approach to applied induction Proc 1985 International Machine Learninr Workshop Rutrers University (to appear) 1085

(ae SSc) Rendell LA Substantial constructive inducshytion ulinS layered information compression Tractable leature formation in search Proc Hitlt ltcm4tiobull J Joit Ooerucc o Artificial Intelligence (to appear) 1985

(Sa63) Samuel AL Some Itudies in machine learnshyins usin the 3me of checkers in Feigenbaum EA and Feldman J (Ed) Oomputer 4114 Thugll McGraw-Hili 1963 1H05

(Sa 61) Samue~ ALbull Some studie in machine learnshyinS usinr the same of checkers U-recent progress IBM J Ru IUI4 Duelop 11 (1961) 601-617

(Sm 80) Smitb SF A learnins stem based on lenetic adaptive alrorithms PhD Dissertation Univershysity of Pittsburgh 1980

[Sm 831 Smith SF Flexible learninr of problem 1I01vshyiDe heuristics throurb adaptive Harch Proc EiglUl Iteruttoul Joit OOflerenc 0_ ArlificiJ I_tellishyuce 1983422middot425

APPENDIX A GLOSSARY OF TERMS

Clustering Cluter anailln has lonl been used as a tool ror induction in statiatics and patshytern recognition (An 73) (See middotinduction) Improvements to basic clustering techniques lenshyerally use more than just the reatures or a datum ([An 73 p194) suggests -external criteria) External criteria in [Mi83 Re76 Re83a Re85b] involve prior sped6cation of the forms clusters may take (this has been called middotconceptual clusshytering (Mi83l) Criteria in (Re76 Re83a Re85bl are based on the data enviroDment (see

-utilit) below)ls This paper uses clustering to create economical compressed genetic structures (enotIlPu)

Feature A feature is an attribute ormiddot proshyperty of an object Features are usually quite abstract (el -center control or -mobility) in a board game The utility (see below) varies smoothly with a feature

Genetic algorithm In a GA a the charshyacter of an individual or a population is called a phenotype The phenotype ia coded as a string of digits called the enotvpe A single digit is a ene Instead of searcbing rule space directly (compare -learning system) a GA searches gene space (ie a GA searches for good genes in the population or genotypes) This search uses the merit of individual genotypes selecting the more successrul individuals to undergo genetic operations ror the production of offspring See sect2 and Fig3

Inductton Induction or eneralization learnin is an important means ror knowledge acquisition Information is actually created as data are compressed into c48fe or cateories in order to predict future events efficiently and effectively Induction may create reature space neighborhoods or clusterl See -clustering and sect41

Learning System Buchanan et al present a general model which distinguishes comshyponents of a learning system [Bu 78) The perforshymance element PE is guided by a control trucshyture H Based on observation or the PE the crishytic assesses H possibly localizing credit to parts or H (Bu 7S Di SI] The learning element LE uses thia inrormatioD to improve H for the next round of task perrormance Lavered systems have multiple PEs critics and LEs (eg PLS2 uses PLSJ as its PE -see FilI) Just as a PE searches for its goal in problem space the LE searches in rule pace DiS2) for an optimal H to control the PE-

To facilitate this higher goal PLS2 uses an intermediate knowledge structure which divides feature space into reion relating feature values to object utiitll (Re83d] and discovering a userul subset or features (ef (Sa63l) In this paper the control structure H is a linear evaluation funcshy

15 A new learnin~ system IRe SSc) introduces highermiddotdimensional clusterinr ror creation or structure

13

tion INiSOI and the rules are feature weights for H Search for accurate regions replaces direct search of rule space ie regions mediate data and H AB explained in sect3 sets of regions middotbecome compressed GA genotypes See also genetic algorithms middotPLS bull region and Fig I

Merit Also called paloff or fitness this is the measure used by a genetic algorithm to select parent genotypes lor prelerential reproducshytion of succesduJ individuals Compare middotutilitymiddot also see genetic algorithms

Object Objects are any data to be genshyeralized into categories Relationships usually depend on task domain See utility

PLS The pro6GiiliBtic learnin BIstem can learn what are sometimes called single conshycepts IDi82] but PLS is capable of much more difficult tasks involving noise management incremental learning and normaliution 01 biased data PLSt uniquely discovered locally optimal heuristics in search IRe 83al and PLS2 is the effective and efficient extension examined in this paper PLS manipulates middotregions (see below) using various inductive operations described in sect4

RegIon or Cell Depending on ones viewpoint the region is PLSs basic structure for clustering or for the genetic algorithm The region is a compressed representation 01 a utility sur race in augmented feature space it is also a compressed genotype representing a utility luncshytion to be optimized AB explained in IRe83d) the region representation is fully expressive proshyviding the features are See sect3 and Figs3amp4

Utll1ty u This is any measure of the useshylulness 01 an object in the perlormance or some task The utilitl provides a lid between the task domain and PLS generalization algorithms Utility can be a probability as in Fig 2 Comshypare merit See sect I 3

APPENDIX B PLSI LIMITATIONS

PLSI alone is inherently limited The problems relate to modi6cation of the main knowledge structure the cumulative region set R = (rue) As mentioned in sect41 R undergoes two basic alterations PLSI gradually changes the meanin of an established feature space recshytangle r by updating ita associated utility u (along with us error e) PLSI also incrementally

refines tbe feature space as rectangles r are conshytinually split

Both or these modifications (utility revision and region refinement) are largely directed by search data but the degree to which newer inrorshymation affects R depends on various choices or system parameters IRe 83al System parameters influence estimates of the error e and determine the degree of region re6nement These in turn affect the relative importance of new versus estashyblished knowledge

Consequently values or these parameters influence task performance For example there is a tradeol between utility revision and region refinement It rrampions are refined too quickly accuracy sulers (this is theoretically predictable) It instead utility revision predominates regions become inert (their estimated errors decline) but sometimes incorrectly18

There are several other problems including difficulties in training limitation in the utility revision algorithm and inaccurate estimation of various errors As a result utility estimations are imperfect and biased in unknown ways

Together the above uncertainties and senshysitivities explain the failure of PLSI always to locate an optimum with static training (Table f) The net effect is that PLSI works rairly well with no parameter tuning and unsophisticated trainshying and close to optimal with mild tuning and informed training (CoS4 as long as the features are well behaved

By nature however PLSI requires reatures exhibiting no worse than mild interactions This is a serious restriction since feature nonlinearity is prevalent On its own then PLSt is inherently limited There is simply no wall to learn utility accurately unless the elects of dilering heuristic runctions H are compared as in PLS2

ACKNOWLEDGEMENTS

I would like to thank Dave Coles for his lasting enthusiasm during the development implementashytion and testing of this system I appreciate the helpful suggestions from Chris Matheus Mike Mauldin and the Conference Reviewers

1amp Althourh lI7stem parameters are given by domain-independent statistieal analysis tunine these parameters nevertheless improves performanee in some eases (This is not required in PLS2)

14

utilit of a state or ceO relative to the task One measure of utility is the probability ot contributshying to a solution or will Ia Figs 24 probability classes are rectangular cells (tor economy) The leftmost rectangle r has probability a 020 The rectangle r is a category generalizing the conditions under which the utility u applies

In PLS a rectangle is associated not just with its utility u but also with the utility error e This expression e ot uncertainty iD u allows quantification ot the elect of noise and provides an infomed and concise means for weighting various contributions to the value of u during learning The triple R - (rue) caUed amp

region is the main knowledge structure for PLSI A set R R or regions defines amp partition ill augmented reature space

R may be used directly as amp (discrete) evalualion or ileuriflic rUlction H == u(r) to assess state x Erin search For example in Fig4 there are six regions which difrerentiate states into si( utility classes Instead of forming a discrete heuristic R may be used indirectly as data (or determining the weight vector b in a smooth evaluation (unction H == bx (employing cune fitting techniques) We shall return to these algorithmic aspects of PLS in 54

Figure 5 Definition ot muimally detailed genotype U If the number of points in reature space is finite and a value of the utility is usoeiated with each point comshyplete information can be captured in a detailed genoshytype U of concatenated utilities U I u bullbull UL Coordishynates could be linearly ordered as shown here ror the two dimensional case U is an tully expanded lenotype correspondinl to the compreSled version or Fit 4

10 This could be expressed in other ways The production rule rorm is r -u Usin 10lie r is represented (0 S XI S 4) n (0 S x S 2)

31 The Region Set at Compressed and Unreatrleted PLSJ Genotype

Now let us examine these inrormation structures from the genetic viewpoint The weight vector b or evaluation function H could be considered a GA phenotype What might the genotype ber One choice a simple one was described in sect2 and illustrated in Fig3 here the genotype B is just a binary coding or b A difFerent possibility is one that captures exhausshytive inrormation about the relationship between utility u and reature vector x (see FigS) In this case the gene would be (x u) It the number of genes is bite they can be indexed and conshycateoated to give a very detailed genotype U which becomes a striDg or values U 1U2 bullbullbull uL codshying the entire utility surface in augmented feature space

This genotype U is unusual in some imporshytant ways Let us compare it with the earlier example B of sect2 (Fig3) B is simply a binary form or weight vector b One obvious difrerence between B and U is that U is more verbose than B This redundancy aspect will be considered shortly The other important diJrerence between Band U is that alleles within B may weD interact (to express feature nonlinearity) but alleles Uj

within U Cllnnot interact (since the Uj express an absolute property or reature vector x ie its utilshyity for some task) As explained in the next secshytion this rreedom rrom gene interdependence permits localization or credit11

The detaned genotype U codes the utility surface which may be very irregular at worst or very smooth at best This surrace may be locally well behaved (it may vary slowly iD some volumes or reature space) [n cases or local regushylarity portions or U are redundant As shown in FigS PLS2 compresses the genotype U into the region set R (examined in 531 from the PLSI viewpoil1 t) [n PLS2 a sin gle region R == (r u e ) is a et or genes the whole having just one allele u (we disregard the genetic coding or e) Unlike standard genotypes which have a stationary locus ror each gene and a fixed Dumber or genes the regioD set has no explicit loci but rather a

11 While one of the strengths of a GA is its abilshyity to manage interaetion ot nriable (by middotcoshyadaptinc alleles) PLS2 achieves efficient and concise knowledre representation and acquisition by ftexible sene compression and by certain other mdhods examshyined later in this paper

6

variable number 01 elements (regions) eacb representinamp a variable number of lenes A relion compresses lene seta having similar utility according to current knowledge

4 KNOWLEDGE ACQUISITION SYNERGIC LEARNING ALGORITHMS

In this section we examine how R is used to proshyvide barely adequate information about the utilshyity surface This compact representation resulta in ecoDomy of both 8pace aDd time and in eftective -learning Some reasons for this power are considered

The ultimate purpose of PLS is to discover utility classes in the form 01 a region set R This knowledge structure controls the primary task lor example in heuristic search R R=II =II

(rue) defines a discrete evaluation lunction H(r) =- u

The ideal R would be perfectly accurate and maximally compre Accuracy of utility u determines the quaJitl of task performance Appropriate compression of R characterizes the task domain conciselJ but adequately (see Figs 3 4) saving storage and time both during task performance and during learning

These goals 01 accuracl and economl are approached bl the doublJ lalered learninl 8Ylshytem PLS2 Fig 1) PLSI and PLS2 combine to become eftective rules lor generalization (inducshytion) specializatioD (differentiation) and reorganshyization The two layers support each other in various ways for example PLS2 stabilizes the pershyceptual system PLSl and PLSI maintains genoshytype diversity of the genetic system PLS2 In the tollowing we consider details amprBt trom the standpoint of PLSl then Irom the perspective 01 PLS2

41 PLSI Revision and DlfIerentlatlon

Even without a genetic component PLSI is a ftexible learning system which can be employed in noisy domains requiring incremental learning It can be used lor simple concept learnin like the systems in [Di82J but most experiments have involved state space problem solvin and game playing12 Here we examine PLS in the context of

12 These experimenLs have led to unique reaulLs 8uch as discovery ot locally optimal evaluation runcshytions (see [Re S3amp Re 83dl)

these difficult tasks

Ns described in sect31 the main PLSI knowledge strucfare is the region set R == ( r I u e) Intermediate between basic data obtained duriD search and a general heuristic ased to control search R defines a feature space augmen ted and partitioned by u and e Because R is improved incrementally it is called the cumulatil1e reiDn d PLSI repeatedly perrorms two basic operationl on R One operation is correction or relinDn (of utility u and error e) and the other is specialization difterentiation or refinement (or leature space cells r) These operators are detailed in (Re83a Re83dJ here we simply outUne their eftecta and note their limshyitations

Revlalon 01 a and e For an established region R (rue) E R PLSI is able to modify u and to decrease e by using new data This is accomplished in a rough fashion by comparing established values within aU rectangles r with fresh values within the same r It is difficult or impossible to learn the -true- values of u since data are acquired during performance of hard tasks and these data are biased in unknown ways because of nontrivial search

Reflnemen or R Alternately perrorming then learning PLSI acquires more and more detail about the natare or variation or utility u with reatures This information accumulates in the region set R c R III ( r a e ) where the primary eftect 01 clustering a is increasing resolushytion of R The number sizes and shapes 01 recshytanles in R reflect current knowledge resolution As this differentiation continues in successive iterations of PLSl attention locuses on more useshyrul parts 01 feature space and heuristic power improves

Unfortunately so does the likelihood of error Further errors are difficult to quantily and hard to localbe to individual regions

In brier while the incremental learnin 01 PLSI is powerful enoagh to learn locally optimal heuristics under certain conditions and while PLSI feedback is good enough to control and correct mild errors the leedback can become unstable in ualayorable situations instead of beinl corrected errors can become more proshyDounced Moreover PLSI is sensitive to parameshyter settings (see Appendix B) The system needs support

42 PLS2 Genetic Operatou

Qualities missing in PLSI can be provided by PLS2 M bull1 concluded PLSl with its single region set cannot discover accurate values of utilities u PLS2 however maintains an entire population of region seta which means that several regions in aU cover any given feature space volume The availability of comparable regions ultimately permits greater accuracy in u and brings other benefits

As sect32 explained a PLS2 genotype is the region set R = R and each region R = (rue) ( R is a compressed gene whose aDele is the utility u Details of an early version or PLS2 are given in (Re83cl Those algorithms have been improved the time complexity of the operashytors in recent program implementations is linear with population sile K The rollowing discussion outlines the overall eJlects and properties or the various genetic operators (compare to the more usual GA of sect2)

K-aexual mating is the operator analoshygous to crossover Consider a population R of K difterent region sets R Each set is composed of a number of regions R which together cover feature space A new region set Rt is formed by selecting individual regions (one at a time) from parents R with probability proportional to merit (merit is the performance of R defined at the end of sect2) Selection of regions rrom the whole population of region sets continues until the feature space cover is approximately the average cover or the parents This creates the oftspring region set Rt which is generally not a partition

Gene reorganization For economy of storale and time the olfsprinl region set Rt is repartitioned so that regions do not overlap in feature space

Controlled mutation Standard mutashytion operators alter an allele randomly III conshytrast the PLS2 operator analogous to mutation changes aD allele accordinl to evidence arising in the task domaia The controlled mutation operator ror a region set R = ( r Ii e) is the utility revision operator of PLSI M described in sect4I PLSI modifies the utility u for each feature space ceO r

Genotype expansion This operator is also provided by PLSI Recall the discussion of sect32 about the economy resultinl from compressshying genes (utility-feature vectors) into a region

set R The rerinement operator was described in sect41 This feature space refiDement amounta to an expansion or the lenotype R and is carried out when data warrant the discrimination

Both controlled mutation and genotype expansion promote genotype diversity Thus PLSI helps PLS2 to avoid premature convergence a typical GA problem [BrSl MaSI

43 Efteettveneaa and Effielency

The power of PLS2 may be traced to cershytain aspects of the perceptual and genetic algoshyrithms just outlined Some existing and emergshying principles or eftective and efficient learning are briefly discussed below (see also (ReSSa ReSSb ReSSel)

Credit loeaIlzatlon The selection of regions for Kmiddotsexual mating may use a single merit value for each region R within a given set R However the value of can just as well be localized to single regions within R t by comshyparing R with similar regions in other seta Since regions estimate an absolute quantity (taskshyrelated utility) in their own volume or feature space they are independent or each other Thus credit and blame may be assigned to feature space cells (ie to gene sequences)

Assignment or credit to individual regions within a cumulative set R is straightforward but it would be difticult to do directly in the final evaluation flluction H since the components of H whiJe appropriate ror performance omit inforshymation relevant to learning (compare Figs 2 )bull3

Knowledge mediation successrul sysshytems tend to employ inrormation structures which mediate data objects and the ultimate knowledge form These mediating structures include means to record growing assurance or tentative hypotheses

Whell used in heuristic sea~ch the PLS region set mediates large numbers of states and a

13 There are various pOSIIibilities lor the evalu tion function R but all contain len useful inlormation than their determinant the resion Bet R The simshyplest heuristic used in (Re 83a Re 83dJ is H - bl where b is a vector of weisbts for the feature vector f (This linear combination is used exclusively in experishyments to be described in 55) The value of b is found usinr regions as data in linear regression (Re 83a Re83bJ

7

very concise evaluation runction H Retention and continual improvemen or this mediatin structure relines the credit assignment problem This view is unlike that or (Oi81p14 Oi821 learning systems orten attempt to improve the control structure itself whereas PLS acquires knowledge efficiently in an appropriate structure and utilizes this knowledge by compressing it only temporarily ror performance In other words PLS does not directly search rule space for a good H but rather searches for good cumul tive regions rrom whicb H is constructed

Full but controlled use of ever) datum Samuels checker player permitted each state encountered to inBuence the beuristic H and at tbe same time no one datum could overwhelm the system The learning was stochastic botb conservative and economic In this respect PLS2 is similar (although more automated)

Schemata In learning systems and genetic algorithms A related efficiency in both Samuels systems and PLS is like tbe scheshymata concept in a GA In a GA a single indivishydual coded as a genotype (a string of digits) supports not only itselr but also aU its subshystrings Similarly a single state arising in heurisshytic search contains information about every feature used to describe it Thus each state can be used to appraise and weight each reature (The effect is more pronounced when a state is described in more elementary terms and combishynations or primitive descriptors are assessed -see (ReSSel)

6 EXPERIMENT AND ANALYSIS

PLS2 is designed to work in a changing environshyment of increasingly difficult problems Tbis seeshytion describes experimental evidence or effective and efficient learning

51 Experimental Condltlons Tame teaturee The features used ror

tbese experiments were the four or (Re83al Tbe relationship beween utility and these features is rairly smootb so tbe rull capability of a GA is not tested although tbe environment was dynamic

De6nltlon ot merit As sect4 described PLS2 choses regions from successrul cumulative sets and recombines them into imprOVed sets For the experiments reported here the selection criterion was the global merit p ie tbe perforshy

mance ora whole region Id without localization or credit to indiyichlal regions This measure p was the average number of nodes developed 0 in a training sample of 8 fmeen puules divided into the mean of aU such averages in a popul tion of K sets ie p 00 where 0 is tbe average over the population (0 E OJ K)

Changing envIronment For these expershyiments successive rounds of training were repeated in incremental learnin over several iterations or enertin The environment was altered in successive generations it was specified as problem difJicult or depth d (defined as the number of moves from the goal in sample probshylems) As a sequence of specifications of problem difficulty this becomes a training difJicult vector d == (d d2 d)

Here d was static one known to be a good progression based on previous experience with user training (Co84)middot In these experiments d was always (81422 SO ) An integer means random production or training problems subject to this difficulty constraint wbile demands production of fully random training instances

6t Discussioll

Berore we examine the experiments themshyselves let us consider potential dilrerences between PLSI aDd PLS2 in terms of their effectiveness and efficiency We also need a crishyterion for assessing differences between the two systems

VulnerabtUty of PLSt With population size K == 1 PLS2 degenerates to tbe simpler PLSI In this case static training can result in utter railure since the process is stocbastic and various things can go wrong (see Appendix B) The wors is railure to solve any problems in some generation and consequent absence or any new information IC the control structure H is tbis poor it will DO improve unless tbe lad is

14 PLS and similar system ror problem and game are IIOmetima neither rully supervised nor rully unsupervised The orjpnu PLSl w intermediate in this respect Trainiq problema were selected by a humiddot man bu rrom each raining instance a multitude or individuu noda ror learning are generated b the SY8shytem Each node can be considered a separate example ror concep learniq (Re 83d] (Co 84] describes experishyments with an automated trainer

8

detected and problem difficulty is reduced (ie dynamic training is needed)

Even without this catastrophe PLSI pershyrorms with varying degrees or success depending on the sophistication 01 ita training and other lacton (explained in Appendix B) With minimal human guidance PLSI always achieves a good evaluation lunction H although not always an optimal one Witb static training PLSI succeeds reasonably about balf the time

Stabtllty ot PLSt In contrast one would expect PLS2 to bave a much better success rate Since PLSI is here being run in parallel (Fig 1) and since PLS2 should reject hopeless cases (tbeir middots are small) a complete catastrophe (all Hs lailing) should occur with probability p s qK wbere q is the probability 01 PLSI lailure and K is popUlation size It q is even as large as one hall but K is 7 or more the probability p 01 catastrophe is less than 001

Cost versus benefit a measure Failure plays a part in costs so PLS2 may have an advantage The ultimate criterion for system quality is cost effectiveness is PLS2 worth its extra complexity Since the main cost is in task performance (here solving) the number of nodes developed 0 to attain some perlormance is a good measure of the expense

If training results in catastrophic bUure however all effort is wasted so a better measure is the expected cost Op where p is the probabiamp ity of success For example if 0 == 1gt00 for viable control structures but the probability of finding solutions is only then the aeraampe cost of useful information is 500 ~ == 1000

To extend this argument probability p depends on what is considered a success Is sucshycess the discovery of a perfect evaluation runcshytion H or is performance satisfactory ir 0 departs Irom optimal by no more than 21gt

53 Rulte

Table I shows perrormances and costa with yarious nlues of K Here p is estimated using roughly 38 trbJa or PLSI in a PLS2 context (it K == I 38 distind runl it K == 2 18 runl etc) Since variances is D are high performance testa were made oyer a random sample or 50 puzzles This typicaJJy iyes 91gt con6dence of 0 t 40

Accuraq ot learnlnl Let UI 6rst comshyp31e results 01 PLSI versul PLS2 for rour different success criteria We consider the learning to be successful it the resUlting heuristic H approaches optimal quality within a given margin (01 100 50 21gt and 10)

Columns two to five in the table (the second group) show the proportion of Hs giving performance 0 within a specified percentage or the best ho 0 (the best 0 is around 3SO nodes for the foar leatures used) For example the last row of the table shows that 01 the 36 individual control structures H tested in (two different) popul2tions of size 19 all 36 were within 100 of optimal 0 (column two) This means that aU developed no more than 700 nodes before a solutio was found Similarly column 6ve in the last row shows that 021 01 the 3t5 Hs or 8 or them were within 10 ie required no more than 386 nodes developed

Cost ot accuracy Columns ten and eleyen (the two riamphtmost columns of the lourth croup) show the estimated costs of achieving pershyformance withia 100 and within 10 or optimum respedively The values are based on the expected total number of nodes required (ie Op) with adjustments in favor of PLSI for extra PLS2 overhead (The unit is one thousand nodes developed) As K increases the cost of a given accuracy 6rst increases Nevertheless with just moderate K Yahles the genetic system becomes cheaper particularly ror an accuracy or 10

lb

I

Propolt0 auat Suce c Itbullbull 1

C lalt otlaa1 D lOot S 15 lot

__ lIdo

IRaNoa SI of SliIt t 654 U5 -nl IU sn n ItS 5 114 It n nl 1M Ul 14 I

Coat 1 1 11

__I

US n bull 11le19a lJ

poet Con t 0lIlbullbull ltalbull

1 101

J III na n au1bull nl na Ul II t4

foebullbullbulleobull It~t I

Jllot aot 12

n

1 4

12 IS 19

n J U bullbull4 6t 11 1 n 12 0 n 11 U )1 11

1 11 14 1 n 61 21

The expected cost benefit is not the only PLS2 and traditional optimization In [ReSl] advantage of PLS2 PLSI was round considerably more efficient than

Average pertormance beat perlo mance and population performance Conshysider now the third group of columns in Table I Tbe sixth column gives tbe IIverllle i5 for an H in the sample (or 36) The seventh column gives Db ror the best Hb in the sample These two measures average and be performance are orten used in assessing genetic systems (Br 81) The eighth column however is unusual it indishycates the population performance Dp resulting when all regions from every set in the population are used together in a regression to determine Up This is sensible because regions are indepenshydent and estimate tbe utility an absolute quaashytity ((Re83cJ cr [Br8l H07S))

Several trends are apparent in Table L First whether the criterion is 100 50 25 or 10 of optimum (columns 2-5) the proportion of good Hs increases steadily as population size K rises Similarly average best and popUlation performance measures 6 Db and Dp (columns 6shy8) also improve with K Perhaps most important is that the population performance Dp is so reu ably close to best even with these low K values This means that the whole population ot regions can be used (for Hp) without independent verification of perrormance In contrast indivi-middot dual Hs would require additional testing to disshycover the best (column 7) and the other alternashytive any H is likely not as good as Hp (colllI 6 and 8) Furthermore the entire populath or regions can become an accurate source of masshysive data for determining an evaluation function capturing reature interaction IRe83b

This accuracy advantage of PLS2 is illusshytrated in the 6nal column ot the table where ror a constant cost rough estimates are given or the expected error in population performance Dp relative to the optimal value

It is interesting that such sman popUlations improve perrormance markedly usually populashytion sizes are SO or more

o EFFICmNCY AND CAPABILITY

Based on these empirical observations ror PLS2 OD other comparisons ror PLSl and on various conceptual differences general properties or three competing methods can be compared PLSl

standard optimization and the luuestion was made tbat PLSI made better ase of available inrormation By Itudying sucll behaviors load underlying reasons we Ihould eventually identify principles of efficieat learning Some aspects are considered below

TradItIonal optlmlloatlon verau PLSI First let us consider efficiency of search for an optimal weight vector b in the evaluation runcshytion H = bt One good optimization method is repone lurace fittin (RSF) It can discover a local optimum in weight space by measuring and regressing the response (bere number of nodes developed D) for various values ot b RSF utilshyizes just a single quantit (ie D) ror every probshylem solved This seems like a small amount of inrormation to extract rrom an entire search lince a typical one may develop hundreds or thousands or nodes each possibly containing relevant inrormation In contrast to this tradishytional statistical approach PLSI like [5a63 5a671 uncovers knowledge about ever feature rrom ever node (see 143) PLSl then might be expected to be more efficient than RSF Experishyments veriry this [ReSI

TraditIonal optlmlratloll venta PLSI oM shown in SS PLS2 is more efficient ItiD We can compare it too with RSF The accuracy of RSF is known to improve with VN where N is the number or data (here the number or or probshylema solved) oM a arst approximation a paranel method like PLS2 should also cause accuracy to increase with the square root of the number or data although the data are now regions instead of D values If roughly the same number of regions is present in each individual set R or a population or size K accuracy must thererore improve as VK Since each of these K structures requires N problems in training the accuracy or PLS2 should increase as yenN like RSF

Obviously though PLS2 involves much more than blind parallelism a genetic algorithm extracts accurate knowledge and dismisses incorrect (un6t) inrormation While it is impossishyble ror PLSI alone PLS2 can rehe merit by localizing credit to individual regions (Re 83c] Planned experiments with this should show rurther increases in efficiency since the additional cost is small Another inexpensive improvement

10

will attempt to reward good regions by decreasshying their estimated errors Even without these refinements PLS2 retains meritorious regions (sect4) and should exhibit accuracy improvement better than vN Table I suggests this

PLSZ veraua PLSZ AI di8eussed ill sect41 and Appendix B PLSI is limited necessitating human tuning for optimum performance In conshytrast the second layer learning system PLS2 requires little human intervention The maiD reason is that PLS2 stabilizes knowledge automatically by comparing region sets and dismissing aberrant ones Accurate cumulative sets have a longer lifetime

1 SUMMARY AND CONCLUSIONS

PLS2 is a general learning system (Re83a Re83dJ Given a set or user-de6ned reatures and some measure of the utilil (eg probability or success in tast (errormance) PLS2 rorms aDd rebes an appropriate knowledge structure the cumulative region d R relating utility to reature values and permitting DOise manageshyment This economical and flexible structure mediate data objects and abstract heuristic knowledge

Since individual regions or the cumulative set R are independent or one anotber botb credit localization and reature interaction are possible

This ability to discriminate merit and retain successful data will likely be accentuated with the localization or credit to indiviltlual regions (see H2) Another improvement is to alter dynamically tbe error or a region (estimated by PLSl) as a runction or its merit (found by PLS2) This will have the elJect or protecting a good region rrom imperrect PLSI utility revision once some parallel PLSl has succeeded in discovshyering an accurate value it will be more immune to damage A fit region will bave a very long lirespan

Inherent differences In capablltty RSF PLSl and PLS2 can be cbaracterized dilJerently

From the standpoint or time costs given a chalshylenging requirement such as the location or a local optimum witbin 10 tbe ordering or tbese methods in terms or efficiency is RSF 5 PLSI 5

PLS2 In terms or capdilit tbe same relationshysbip bolds RSF cannot handle reature interacshytions without a more complex model (wbicb would increase its cost drastically) PLSl on the other hand can provide some performance improvement using piecewise linearity with little additional cost [Re83bJ PLS2 is more robust than PLSl While the original system is someshywhat sensitive to training and parameters PLS2 provides stability using competition to overcome deficiencies obviate tuning and increase accushyracy all at once PLS2 bulJers inadequacies inberent in PLSI Moreover PLS2 being genetishycally based may be able to handle bighly interacting reatures and discover 10601 optima Re 83c1 Tbis is very costly with RSF and seems inreasible with PLSl alone

simultaneously Separating tbe tast con trol structure H from the main store or knowledge R allows straigbtforward credit assignment to this dderminant R or H while H iteelt may incorshyporate reature nonlinearities witbout being responsible for tbem

A concise and adequate embodiment of current heuristic knowledge tbe cumulative region set R waa originally used in the learning system PLSl (Re83aJ PLSI is the only system shown to discover locally optimal evaluation runctions in an AI context Clearly superior to PLSl its genetic successor PLS2 haa been shown to be more stable more accurate more efficient and more convenient PLS2 employs an unusual genetic algorithm having tbe cumulative set R as a compressed genotype PLS2 extends PLSls limmiddot ited operations or revision (controlled mutation) and dilJerentiation (genotype expansion) to include generalization and other rules (K-sexual mating and genotype reorganization) Credit may be localized to individual gene sequences

These improvements may be viewed as eftectillg greater efficiency or as allowing greater capability Compared with a traditional metbod of optimization PLSl is more efficient (Re85aJ but PLS2 does even better Given a required accuracy PLS2 locates an optimum with lower expected cost In terms or capability PLS2 insushylates the system from inherent iIladequacies and sensitivities or PLSl PLS2 is much more stable and caD use the whole population of regions relishyably to create a bigbly informed heuristic (this popultion performance is not meaningrul in standard genetic systems) Tbis availability or massive data haa important implications for teature interaction lRe83blmiddot

11

Additional refinements or PLS2 may further increase efficiency and power These include rewarding meritorious regions 10 they become immune to damage Future experiments will investigate nonlinear capability ability to disshycover lobal optima and efficiency and effectiveness of focalized credit auinment

This paper has quantitatively affirmed some principles believed to improve efficiency and effectiveness of learning (eg credit localization) The paper has also considered some simple but little explored ideas for realizin these capabilishyties (eg tull but controlled use or each datum)

REFERENCES

(An 73) Ande~bert MR auter Aul IJr Applicamiddot tiIJn AcademicPress 1973

(Be 80J Bethke AD Genetic elgIJritlm bull bullbull uchu IJptimizer PhD Thesis Universitl of Michian 1980

[Br 81) Brindle A Genetic algorithms for function optimintion CS Department Report TR81middot2 (PhD Dissertation) Universitl of Alberta 1081

[Bu 78) Buchanan BG Johnson CR Mitchell TM and Smith RG Models or learnin systems in Belser J (Ed) Encdopclli 0 CIJmper Science 11 TcdnolIJfJ 11 (1978) 201

(Co 8) Coles D and Rendell LA Some issues in trainin learn in systems and an autonomoul desip PrIJc FitA Bicftaiel CIJerece cAe Cbullbullbullli SIJeiet IJ ComputetilJel Stli IJICeUiuce 1984

(De 80) DeJonl KA Adaptive system delip A enetic approach IEEE Trdibullbull a Stem Mbullbull nl C6enadiCl SMC-IO (1080) 566-S7(

(Di81) Dietterich TG and Buchanan BG The role of the critic in learnin systems Stanford Universitl Report STANmiddotC8-81middot801 1081

(Di82) Dietterich TG London B Clarkson K and Drome1 G Learninl and inductin inference STANmiddot C8-82-813 Stanford Universitl also Chapter XIV or Tlc HI6t1o tI Arlifidcl InteUlgenee Cohen PR and Feienbaum EA (Ed) Kaufmann 1082

(Ho 75] Holland JH AIIptti Nturel nll Arlificiel Stem Universitl or Michitan Press 1875

(H0801 Holland JH Adapive a1orithms ror discovshyerin and usin eneral patterns in powin knowlede bases IntI loumel Oil Polic AelbullbullbullIIormtiIJ Stem 4 2 (1080) 217middot240

(Ho 81) Holland JH Genetic a1prithms and adaptampshytion Proc NATO Ad Ret In Alpti CIJntrIJI 0

Ulefillel Sem 1081

(8083) Holland JH Eecapinl brit~lenelS Proc Sccod latemetael Mdae Lunai Workdop 1883 02-05

La83 Lanlley P Bradshaw GL and Simon HA Rediscoverinl chemistry with the Bacon system in Michalsk~ RS Carbonell JG and Mi~chell TM (Ed) MuMne Lunalg Aft Arliiciel Intelligente Appro Tiolamp 1983 307middot329

[Ma8 Mauldin ML Maintainin diversitl in renetic search Proc FnrlA NtiIJael CIJaereftce Oft Arlificicl InteUiece 10S4 247middot250 ~

(Me 851 Medin DL and Wattenmaker WD Catepl1 cohesiveness ~heoriel and copitive archeotshyOQ (as et unpublished manuscript) Dept of PsycholshyOQ Universit of Dlinois at Urbana Champaip 1985

[Mic 831o Michalsk~ RS A theory and methodology or inductive learninl Arlificiel laeUgece RO 2 (1883) 1I1middotlali reprinted in Michalski RS et aI (Ed) M4chine Lumi A Arlificiel InteUigcace Appr04cl Tioa 198383-134

[Mic 83b) Michalski RS an d Stepp RE Le arning rrom observation Conceptual clusteriD iD Michalski RS e~ al (Ed) Madiae Lunaia Aa Arlificial Intelmiddot ligence ApproA TioS 1083331middot383

(Mit 83) MitcheD TM Learnin and problem solvinr Proc EigAtA Ilterulibullbullcl loirampl CfJcrete on Arlificilll lteUieaee UI83 ll3G-lISI

(Ni80) Nilsson NJ Prieipl Arlijieiel Intellimiddot eee Tiolamp 1080

(ae 76) Rendell LA A method ror automatic eneramiddot tion of heuristics ror statemiddotspace problem Dept of Computer Science C8-76-10 Universit of Waterloo 1076

(ae 771 Rendell LA A locall optimal solution or the fifteen punle produced by an automatic evaluation runction senerator Dept or Computer Science C8-77middot 38 Universitl of Waterloo 1977

(ae 81] Rendell LA An adaptive plan for statemiddotspace problems Dept of Computer Science C8-81-13 (PhD thesis) Universit of Waterloo 1081

(ae 83aJ Rendell LA A new basis for state-space learDin systems and a successful implementation ArtajicielateUieee 10 (1883) ( 360-302

(Re 83bJ Rendell LA A learninl II)stem which accommodatet future interactions Prtle EiAtA Intermiddot eli1Il I Cbullbullcreaee Arliiciel Ifttdlilence 1883 6072

IRe 83c) RendeU LA A doubly layered sendic penetrance learnins stem Proc Tlir~ NotiouJ Ooeruce ()Il ArlificiJ l_teUigecc 1083 343-341

(ae 83dl Rendel~ LA Conceptual knowlede acquisishyion in Harch University or Guelph Report 015-83-15 Dept or Computins and Inrormation Science Guelph Canada 1083 (to appear in Bole L (ed) Kutllle~ge B4e4 Le4mi Stem SpringerVerias)

IRe SSa) Rendell LA Utility pattern u criteria for efficient seneraliution learninr Proc Jg8500_erec bullbull Ittlligerat Sytem lid M4diflu (to appear) lOSS

(ae SSbl Rendell LA A scientific approach to applied induction Proc 1985 International Machine Learninr Workshop Rutrers University (to appear) 1085

(ae SSc) Rendell LA Substantial constructive inducshytion ulinS layered information compression Tractable leature formation in search Proc Hitlt ltcm4tiobull J Joit Ooerucc o Artificial Intelligence (to appear) 1985

(Sa63) Samuel AL Some Itudies in machine learnshyins usin the 3me of checkers in Feigenbaum EA and Feldman J (Ed) Oomputer 4114 Thugll McGraw-Hili 1963 1H05

(Sa 61) Samue~ ALbull Some studie in machine learnshyinS usinr the same of checkers U-recent progress IBM J Ru IUI4 Duelop 11 (1961) 601-617

(Sm 80) Smitb SF A learnins stem based on lenetic adaptive alrorithms PhD Dissertation Univershysity of Pittsburgh 1980

[Sm 831 Smith SF Flexible learninr of problem 1I01vshyiDe heuristics throurb adaptive Harch Proc EiglUl Iteruttoul Joit OOflerenc 0_ ArlificiJ I_tellishyuce 1983422middot425

APPENDIX A GLOSSARY OF TERMS

Clustering Cluter anailln has lonl been used as a tool ror induction in statiatics and patshytern recognition (An 73) (See middotinduction) Improvements to basic clustering techniques lenshyerally use more than just the reatures or a datum ([An 73 p194) suggests -external criteria) External criteria in [Mi83 Re76 Re83a Re85b] involve prior sped6cation of the forms clusters may take (this has been called middotconceptual clusshytering (Mi83l) Criteria in (Re76 Re83a Re85bl are based on the data enviroDment (see

-utilit) below)ls This paper uses clustering to create economical compressed genetic structures (enotIlPu)

Feature A feature is an attribute ormiddot proshyperty of an object Features are usually quite abstract (el -center control or -mobility) in a board game The utility (see below) varies smoothly with a feature

Genetic algorithm In a GA a the charshyacter of an individual or a population is called a phenotype The phenotype ia coded as a string of digits called the enotvpe A single digit is a ene Instead of searcbing rule space directly (compare -learning system) a GA searches gene space (ie a GA searches for good genes in the population or genotypes) This search uses the merit of individual genotypes selecting the more successrul individuals to undergo genetic operations ror the production of offspring See sect2 and Fig3

Inductton Induction or eneralization learnin is an important means ror knowledge acquisition Information is actually created as data are compressed into c48fe or cateories in order to predict future events efficiently and effectively Induction may create reature space neighborhoods or clusterl See -clustering and sect41

Learning System Buchanan et al present a general model which distinguishes comshyponents of a learning system [Bu 78) The perforshymance element PE is guided by a control trucshyture H Based on observation or the PE the crishytic assesses H possibly localizing credit to parts or H (Bu 7S Di SI] The learning element LE uses thia inrormatioD to improve H for the next round of task perrormance Lavered systems have multiple PEs critics and LEs (eg PLS2 uses PLSJ as its PE -see FilI) Just as a PE searches for its goal in problem space the LE searches in rule pace DiS2) for an optimal H to control the PE-

To facilitate this higher goal PLS2 uses an intermediate knowledge structure which divides feature space into reion relating feature values to object utiitll (Re83d] and discovering a userul subset or features (ef (Sa63l) In this paper the control structure H is a linear evaluation funcshy

15 A new learnin~ system IRe SSc) introduces highermiddotdimensional clusterinr ror creation or structure

13

tion INiSOI and the rules are feature weights for H Search for accurate regions replaces direct search of rule space ie regions mediate data and H AB explained in sect3 sets of regions middotbecome compressed GA genotypes See also genetic algorithms middotPLS bull region and Fig I

Merit Also called paloff or fitness this is the measure used by a genetic algorithm to select parent genotypes lor prelerential reproducshytion of succesduJ individuals Compare middotutilitymiddot also see genetic algorithms

Object Objects are any data to be genshyeralized into categories Relationships usually depend on task domain See utility

PLS The pro6GiiliBtic learnin BIstem can learn what are sometimes called single conshycepts IDi82] but PLS is capable of much more difficult tasks involving noise management incremental learning and normaliution 01 biased data PLSt uniquely discovered locally optimal heuristics in search IRe 83al and PLS2 is the effective and efficient extension examined in this paper PLS manipulates middotregions (see below) using various inductive operations described in sect4

RegIon or Cell Depending on ones viewpoint the region is PLSs basic structure for clustering or for the genetic algorithm The region is a compressed representation 01 a utility sur race in augmented feature space it is also a compressed genotype representing a utility luncshytion to be optimized AB explained in IRe83d) the region representation is fully expressive proshyviding the features are See sect3 and Figs3amp4

Utll1ty u This is any measure of the useshylulness 01 an object in the perlormance or some task The utilitl provides a lid between the task domain and PLS generalization algorithms Utility can be a probability as in Fig 2 Comshypare merit See sect I 3

APPENDIX B PLSI LIMITATIONS

PLSI alone is inherently limited The problems relate to modi6cation of the main knowledge structure the cumulative region set R = (rue) As mentioned in sect41 R undergoes two basic alterations PLSI gradually changes the meanin of an established feature space recshytangle r by updating ita associated utility u (along with us error e) PLSI also incrementally

refines tbe feature space as rectangles r are conshytinually split

Both or these modifications (utility revision and region refinement) are largely directed by search data but the degree to which newer inrorshymation affects R depends on various choices or system parameters IRe 83al System parameters influence estimates of the error e and determine the degree of region re6nement These in turn affect the relative importance of new versus estashyblished knowledge

Consequently values or these parameters influence task performance For example there is a tradeol between utility revision and region refinement It rrampions are refined too quickly accuracy sulers (this is theoretically predictable) It instead utility revision predominates regions become inert (their estimated errors decline) but sometimes incorrectly18

There are several other problems including difficulties in training limitation in the utility revision algorithm and inaccurate estimation of various errors As a result utility estimations are imperfect and biased in unknown ways

Together the above uncertainties and senshysitivities explain the failure of PLSI always to locate an optimum with static training (Table f) The net effect is that PLSI works rairly well with no parameter tuning and unsophisticated trainshying and close to optimal with mild tuning and informed training (CoS4 as long as the features are well behaved

By nature however PLSI requires reatures exhibiting no worse than mild interactions This is a serious restriction since feature nonlinearity is prevalent On its own then PLSt is inherently limited There is simply no wall to learn utility accurately unless the elects of dilering heuristic runctions H are compared as in PLS2

ACKNOWLEDGEMENTS

I would like to thank Dave Coles for his lasting enthusiasm during the development implementashytion and testing of this system I appreciate the helpful suggestions from Chris Matheus Mike Mauldin and the Conference Reviewers

1amp Althourh lI7stem parameters are given by domain-independent statistieal analysis tunine these parameters nevertheless improves performanee in some eases (This is not required in PLS2)

14

variable number 01 elements (regions) eacb representinamp a variable number of lenes A relion compresses lene seta having similar utility according to current knowledge

4 KNOWLEDGE ACQUISITION SYNERGIC LEARNING ALGORITHMS

In this section we examine how R is used to proshyvide barely adequate information about the utilshyity surface This compact representation resulta in ecoDomy of both 8pace aDd time and in eftective -learning Some reasons for this power are considered

The ultimate purpose of PLS is to discover utility classes in the form 01 a region set R This knowledge structure controls the primary task lor example in heuristic search R R=II =II

(rue) defines a discrete evaluation lunction H(r) =- u

The ideal R would be perfectly accurate and maximally compre Accuracy of utility u determines the quaJitl of task performance Appropriate compression of R characterizes the task domain conciselJ but adequately (see Figs 3 4) saving storage and time both during task performance and during learning

These goals 01 accuracl and economl are approached bl the doublJ lalered learninl 8Ylshytem PLS2 Fig 1) PLSI and PLS2 combine to become eftective rules lor generalization (inducshytion) specializatioD (differentiation) and reorganshyization The two layers support each other in various ways for example PLS2 stabilizes the pershyceptual system PLSl and PLSI maintains genoshytype diversity of the genetic system PLS2 In the tollowing we consider details amprBt trom the standpoint of PLSl then Irom the perspective 01 PLS2

41 PLSI Revision and DlfIerentlatlon

Even without a genetic component PLSI is a ftexible learning system which can be employed in noisy domains requiring incremental learning It can be used lor simple concept learnin like the systems in [Di82J but most experiments have involved state space problem solvin and game playing12 Here we examine PLS in the context of

12 These experimenLs have led to unique reaulLs 8uch as discovery ot locally optimal evaluation runcshytions (see [Re S3amp Re 83dl)

these difficult tasks

Ns described in sect31 the main PLSI knowledge strucfare is the region set R == ( r I u e) Intermediate between basic data obtained duriD search and a general heuristic ased to control search R defines a feature space augmen ted and partitioned by u and e Because R is improved incrementally it is called the cumulatil1e reiDn d PLSI repeatedly perrorms two basic operationl on R One operation is correction or relinDn (of utility u and error e) and the other is specialization difterentiation or refinement (or leature space cells r) These operators are detailed in (Re83a Re83dJ here we simply outUne their eftecta and note their limshyitations

Revlalon 01 a and e For an established region R (rue) E R PLSI is able to modify u and to decrease e by using new data This is accomplished in a rough fashion by comparing established values within aU rectangles r with fresh values within the same r It is difficult or impossible to learn the -true- values of u since data are acquired during performance of hard tasks and these data are biased in unknown ways because of nontrivial search

Reflnemen or R Alternately perrorming then learning PLSI acquires more and more detail about the natare or variation or utility u with reatures This information accumulates in the region set R c R III ( r a e ) where the primary eftect 01 clustering a is increasing resolushytion of R The number sizes and shapes 01 recshytanles in R reflect current knowledge resolution As this differentiation continues in successive iterations of PLSl attention locuses on more useshyrul parts 01 feature space and heuristic power improves

Unfortunately so does the likelihood of error Further errors are difficult to quantily and hard to localbe to individual regions

In brier while the incremental learnin 01 PLSI is powerful enoagh to learn locally optimal heuristics under certain conditions and while PLSI feedback is good enough to control and correct mild errors the leedback can become unstable in ualayorable situations instead of beinl corrected errors can become more proshyDounced Moreover PLSI is sensitive to parameshyter settings (see Appendix B) The system needs support

42 PLS2 Genetic Operatou

Qualities missing in PLSI can be provided by PLS2 M bull1 concluded PLSl with its single region set cannot discover accurate values of utilities u PLS2 however maintains an entire population of region seta which means that several regions in aU cover any given feature space volume The availability of comparable regions ultimately permits greater accuracy in u and brings other benefits

As sect32 explained a PLS2 genotype is the region set R = R and each region R = (rue) ( R is a compressed gene whose aDele is the utility u Details of an early version or PLS2 are given in (Re83cl Those algorithms have been improved the time complexity of the operashytors in recent program implementations is linear with population sile K The rollowing discussion outlines the overall eJlects and properties or the various genetic operators (compare to the more usual GA of sect2)

K-aexual mating is the operator analoshygous to crossover Consider a population R of K difterent region sets R Each set is composed of a number of regions R which together cover feature space A new region set Rt is formed by selecting individual regions (one at a time) from parents R with probability proportional to merit (merit is the performance of R defined at the end of sect2) Selection of regions rrom the whole population of region sets continues until the feature space cover is approximately the average cover or the parents This creates the oftspring region set Rt which is generally not a partition

Gene reorganization For economy of storale and time the olfsprinl region set Rt is repartitioned so that regions do not overlap in feature space

Controlled mutation Standard mutashytion operators alter an allele randomly III conshytrast the PLS2 operator analogous to mutation changes aD allele accordinl to evidence arising in the task domaia The controlled mutation operator ror a region set R = ( r Ii e) is the utility revision operator of PLSI M described in sect4I PLSI modifies the utility u for each feature space ceO r

Genotype expansion This operator is also provided by PLSI Recall the discussion of sect32 about the economy resultinl from compressshying genes (utility-feature vectors) into a region

set R The rerinement operator was described in sect41 This feature space refiDement amounta to an expansion or the lenotype R and is carried out when data warrant the discrimination

Both controlled mutation and genotype expansion promote genotype diversity Thus PLSI helps PLS2 to avoid premature convergence a typical GA problem [BrSl MaSI

43 Efteettveneaa and Effielency

The power of PLS2 may be traced to cershytain aspects of the perceptual and genetic algoshyrithms just outlined Some existing and emergshying principles or eftective and efficient learning are briefly discussed below (see also (ReSSa ReSSb ReSSel)

Credit loeaIlzatlon The selection of regions for Kmiddotsexual mating may use a single merit value for each region R within a given set R However the value of can just as well be localized to single regions within R t by comshyparing R with similar regions in other seta Since regions estimate an absolute quantity (taskshyrelated utility) in their own volume or feature space they are independent or each other Thus credit and blame may be assigned to feature space cells (ie to gene sequences)

Assignment or credit to individual regions within a cumulative set R is straightforward but it would be difticult to do directly in the final evaluation flluction H since the components of H whiJe appropriate ror performance omit inforshymation relevant to learning (compare Figs 2 )bull3

Knowledge mediation successrul sysshytems tend to employ inrormation structures which mediate data objects and the ultimate knowledge form These mediating structures include means to record growing assurance or tentative hypotheses

Whell used in heuristic sea~ch the PLS region set mediates large numbers of states and a

13 There are various pOSIIibilities lor the evalu tion function R but all contain len useful inlormation than their determinant the resion Bet R The simshyplest heuristic used in (Re 83a Re 83dJ is H - bl where b is a vector of weisbts for the feature vector f (This linear combination is used exclusively in experishyments to be described in 55) The value of b is found usinr regions as data in linear regression (Re 83a Re83bJ

7

very concise evaluation runction H Retention and continual improvemen or this mediatin structure relines the credit assignment problem This view is unlike that or (Oi81p14 Oi821 learning systems orten attempt to improve the control structure itself whereas PLS acquires knowledge efficiently in an appropriate structure and utilizes this knowledge by compressing it only temporarily ror performance In other words PLS does not directly search rule space for a good H but rather searches for good cumul tive regions rrom whicb H is constructed

Full but controlled use of ever) datum Samuels checker player permitted each state encountered to inBuence the beuristic H and at tbe same time no one datum could overwhelm the system The learning was stochastic botb conservative and economic In this respect PLS2 is similar (although more automated)

Schemata In learning systems and genetic algorithms A related efficiency in both Samuels systems and PLS is like tbe scheshymata concept in a GA In a GA a single indivishydual coded as a genotype (a string of digits) supports not only itselr but also aU its subshystrings Similarly a single state arising in heurisshytic search contains information about every feature used to describe it Thus each state can be used to appraise and weight each reature (The effect is more pronounced when a state is described in more elementary terms and combishynations or primitive descriptors are assessed -see (ReSSel)

6 EXPERIMENT AND ANALYSIS

PLS2 is designed to work in a changing environshyment of increasingly difficult problems Tbis seeshytion describes experimental evidence or effective and efficient learning

51 Experimental Condltlons Tame teaturee The features used ror

tbese experiments were the four or (Re83al Tbe relationship beween utility and these features is rairly smootb so tbe rull capability of a GA is not tested although tbe environment was dynamic

De6nltlon ot merit As sect4 described PLS2 choses regions from successrul cumulative sets and recombines them into imprOVed sets For the experiments reported here the selection criterion was the global merit p ie tbe perforshy

mance ora whole region Id without localization or credit to indiyichlal regions This measure p was the average number of nodes developed 0 in a training sample of 8 fmeen puules divided into the mean of aU such averages in a popul tion of K sets ie p 00 where 0 is tbe average over the population (0 E OJ K)

Changing envIronment For these expershyiments successive rounds of training were repeated in incremental learnin over several iterations or enertin The environment was altered in successive generations it was specified as problem difJicult or depth d (defined as the number of moves from the goal in sample probshylems) As a sequence of specifications of problem difficulty this becomes a training difJicult vector d == (d d2 d)

Here d was static one known to be a good progression based on previous experience with user training (Co84)middot In these experiments d was always (81422 SO ) An integer means random production or training problems subject to this difficulty constraint wbile demands production of fully random training instances

6t Discussioll

Berore we examine the experiments themshyselves let us consider potential dilrerences between PLSI aDd PLS2 in terms of their effectiveness and efficiency We also need a crishyterion for assessing differences between the two systems

VulnerabtUty of PLSt With population size K == 1 PLS2 degenerates to tbe simpler PLSI In this case static training can result in utter railure since the process is stocbastic and various things can go wrong (see Appendix B) The wors is railure to solve any problems in some generation and consequent absence or any new information IC the control structure H is tbis poor it will DO improve unless tbe lad is

14 PLS and similar system ror problem and game are IIOmetima neither rully supervised nor rully unsupervised The orjpnu PLSl w intermediate in this respect Trainiq problema were selected by a humiddot man bu rrom each raining instance a multitude or individuu noda ror learning are generated b the SY8shytem Each node can be considered a separate example ror concep learniq (Re 83d] (Co 84] describes experishyments with an automated trainer

8

detected and problem difficulty is reduced (ie dynamic training is needed)

Even without this catastrophe PLSI pershyrorms with varying degrees or success depending on the sophistication 01 ita training and other lacton (explained in Appendix B) With minimal human guidance PLSI always achieves a good evaluation lunction H although not always an optimal one Witb static training PLSI succeeds reasonably about balf the time

Stabtllty ot PLSt In contrast one would expect PLS2 to bave a much better success rate Since PLSI is here being run in parallel (Fig 1) and since PLS2 should reject hopeless cases (tbeir middots are small) a complete catastrophe (all Hs lailing) should occur with probability p s qK wbere q is the probability 01 PLSI lailure and K is popUlation size It q is even as large as one hall but K is 7 or more the probability p 01 catastrophe is less than 001

Cost versus benefit a measure Failure plays a part in costs so PLS2 may have an advantage The ultimate criterion for system quality is cost effectiveness is PLS2 worth its extra complexity Since the main cost is in task performance (here solving) the number of nodes developed 0 to attain some perlormance is a good measure of the expense

If training results in catastrophic bUure however all effort is wasted so a better measure is the expected cost Op where p is the probabiamp ity of success For example if 0 == 1gt00 for viable control structures but the probability of finding solutions is only then the aeraampe cost of useful information is 500 ~ == 1000

To extend this argument probability p depends on what is considered a success Is sucshycess the discovery of a perfect evaluation runcshytion H or is performance satisfactory ir 0 departs Irom optimal by no more than 21gt

53 Rulte

Table I shows perrormances and costa with yarious nlues of K Here p is estimated using roughly 38 trbJa or PLSI in a PLS2 context (it K == I 38 distind runl it K == 2 18 runl etc) Since variances is D are high performance testa were made oyer a random sample or 50 puzzles This typicaJJy iyes 91gt con6dence of 0 t 40

Accuraq ot learnlnl Let UI 6rst comshyp31e results 01 PLSI versul PLS2 for rour different success criteria We consider the learning to be successful it the resUlting heuristic H approaches optimal quality within a given margin (01 100 50 21gt and 10)

Columns two to five in the table (the second group) show the proportion of Hs giving performance 0 within a specified percentage or the best ho 0 (the best 0 is around 3SO nodes for the foar leatures used) For example the last row of the table shows that 01 the 36 individual control structures H tested in (two different) popul2tions of size 19 all 36 were within 100 of optimal 0 (column two) This means that aU developed no more than 700 nodes before a solutio was found Similarly column 6ve in the last row shows that 021 01 the 3t5 Hs or 8 or them were within 10 ie required no more than 386 nodes developed

Cost ot accuracy Columns ten and eleyen (the two riamphtmost columns of the lourth croup) show the estimated costs of achieving pershyformance withia 100 and within 10 or optimum respedively The values are based on the expected total number of nodes required (ie Op) with adjustments in favor of PLSI for extra PLS2 overhead (The unit is one thousand nodes developed) As K increases the cost of a given accuracy 6rst increases Nevertheless with just moderate K Yahles the genetic system becomes cheaper particularly ror an accuracy or 10

lb

I

Propolt0 auat Suce c Itbullbull 1

C lalt otlaa1 D lOot S 15 lot

__ lIdo

IRaNoa SI of SliIt t 654 U5 -nl IU sn n ItS 5 114 It n nl 1M Ul 14 I

Coat 1 1 11

__I

US n bull 11le19a lJ

poet Con t 0lIlbullbull ltalbull

1 101

J III na n au1bull nl na Ul II t4

foebullbullbulleobull It~t I

Jllot aot 12

n

1 4

12 IS 19

n J U bullbull4 6t 11 1 n 12 0 n 11 U )1 11

1 11 14 1 n 61 21

The expected cost benefit is not the only PLS2 and traditional optimization In [ReSl] advantage of PLS2 PLSI was round considerably more efficient than

Average pertormance beat perlo mance and population performance Conshysider now the third group of columns in Table I Tbe sixth column gives tbe IIverllle i5 for an H in the sample (or 36) The seventh column gives Db ror the best Hb in the sample These two measures average and be performance are orten used in assessing genetic systems (Br 81) The eighth column however is unusual it indishycates the population performance Dp resulting when all regions from every set in the population are used together in a regression to determine Up This is sensible because regions are indepenshydent and estimate tbe utility an absolute quaashytity ((Re83cJ cr [Br8l H07S))

Several trends are apparent in Table L First whether the criterion is 100 50 25 or 10 of optimum (columns 2-5) the proportion of good Hs increases steadily as population size K rises Similarly average best and popUlation performance measures 6 Db and Dp (columns 6shy8) also improve with K Perhaps most important is that the population performance Dp is so reu ably close to best even with these low K values This means that the whole population ot regions can be used (for Hp) without independent verification of perrormance In contrast indivi-middot dual Hs would require additional testing to disshycover the best (column 7) and the other alternashytive any H is likely not as good as Hp (colllI 6 and 8) Furthermore the entire populath or regions can become an accurate source of masshysive data for determining an evaluation function capturing reature interaction IRe83b

This accuracy advantage of PLS2 is illusshytrated in the 6nal column ot the table where ror a constant cost rough estimates are given or the expected error in population performance Dp relative to the optimal value

It is interesting that such sman popUlations improve perrormance markedly usually populashytion sizes are SO or more

o EFFICmNCY AND CAPABILITY

Based on these empirical observations ror PLS2 OD other comparisons ror PLSl and on various conceptual differences general properties or three competing methods can be compared PLSl

standard optimization and the luuestion was made tbat PLSI made better ase of available inrormation By Itudying sucll behaviors load underlying reasons we Ihould eventually identify principles of efficieat learning Some aspects are considered below

TradItIonal optlmlloatlon verau PLSI First let us consider efficiency of search for an optimal weight vector b in the evaluation runcshytion H = bt One good optimization method is repone lurace fittin (RSF) It can discover a local optimum in weight space by measuring and regressing the response (bere number of nodes developed D) for various values ot b RSF utilshyizes just a single quantit (ie D) ror every probshylem solved This seems like a small amount of inrormation to extract rrom an entire search lince a typical one may develop hundreds or thousands or nodes each possibly containing relevant inrormation In contrast to this tradishytional statistical approach PLSI like [5a63 5a671 uncovers knowledge about ever feature rrom ever node (see 143) PLSl then might be expected to be more efficient than RSF Experishyments veriry this [ReSI

TraditIonal optlmlratloll venta PLSI oM shown in SS PLS2 is more efficient ItiD We can compare it too with RSF The accuracy of RSF is known to improve with VN where N is the number or data (here the number or or probshylema solved) oM a arst approximation a paranel method like PLS2 should also cause accuracy to increase with the square root of the number or data although the data are now regions instead of D values If roughly the same number of regions is present in each individual set R or a population or size K accuracy must thererore improve as VK Since each of these K structures requires N problems in training the accuracy or PLS2 should increase as yenN like RSF

Obviously though PLS2 involves much more than blind parallelism a genetic algorithm extracts accurate knowledge and dismisses incorrect (un6t) inrormation While it is impossishyble ror PLSI alone PLS2 can rehe merit by localizing credit to individual regions (Re 83c] Planned experiments with this should show rurther increases in efficiency since the additional cost is small Another inexpensive improvement

10

will attempt to reward good regions by decreasshying their estimated errors Even without these refinements PLS2 retains meritorious regions (sect4) and should exhibit accuracy improvement better than vN Table I suggests this

PLSZ veraua PLSZ AI di8eussed ill sect41 and Appendix B PLSI is limited necessitating human tuning for optimum performance In conshytrast the second layer learning system PLS2 requires little human intervention The maiD reason is that PLS2 stabilizes knowledge automatically by comparing region sets and dismissing aberrant ones Accurate cumulative sets have a longer lifetime

1 SUMMARY AND CONCLUSIONS

PLS2 is a general learning system (Re83a Re83dJ Given a set or user-de6ned reatures and some measure of the utilil (eg probability or success in tast (errormance) PLS2 rorms aDd rebes an appropriate knowledge structure the cumulative region d R relating utility to reature values and permitting DOise manageshyment This economical and flexible structure mediate data objects and abstract heuristic knowledge

Since individual regions or the cumulative set R are independent or one anotber botb credit localization and reature interaction are possible

This ability to discriminate merit and retain successful data will likely be accentuated with the localization or credit to indiviltlual regions (see H2) Another improvement is to alter dynamically tbe error or a region (estimated by PLSl) as a runction or its merit (found by PLS2) This will have the elJect or protecting a good region rrom imperrect PLSI utility revision once some parallel PLSl has succeeded in discovshyering an accurate value it will be more immune to damage A fit region will bave a very long lirespan

Inherent differences In capablltty RSF PLSl and PLS2 can be cbaracterized dilJerently

From the standpoint or time costs given a chalshylenging requirement such as the location or a local optimum witbin 10 tbe ordering or tbese methods in terms or efficiency is RSF 5 PLSI 5

PLS2 In terms or capdilit tbe same relationshysbip bolds RSF cannot handle reature interacshytions without a more complex model (wbicb would increase its cost drastically) PLSl on the other hand can provide some performance improvement using piecewise linearity with little additional cost [Re83bJ PLS2 is more robust than PLSl While the original system is someshywhat sensitive to training and parameters PLS2 provides stability using competition to overcome deficiencies obviate tuning and increase accushyracy all at once PLS2 bulJers inadequacies inberent in PLSI Moreover PLS2 being genetishycally based may be able to handle bighly interacting reatures and discover 10601 optima Re 83c1 Tbis is very costly with RSF and seems inreasible with PLSl alone

simultaneously Separating tbe tast con trol structure H from the main store or knowledge R allows straigbtforward credit assignment to this dderminant R or H while H iteelt may incorshyporate reature nonlinearities witbout being responsible for tbem

A concise and adequate embodiment of current heuristic knowledge tbe cumulative region set R waa originally used in the learning system PLSl (Re83aJ PLSI is the only system shown to discover locally optimal evaluation runctions in an AI context Clearly superior to PLSl its genetic successor PLS2 haa been shown to be more stable more accurate more efficient and more convenient PLS2 employs an unusual genetic algorithm having tbe cumulative set R as a compressed genotype PLS2 extends PLSls limmiddot ited operations or revision (controlled mutation) and dilJerentiation (genotype expansion) to include generalization and other rules (K-sexual mating and genotype reorganization) Credit may be localized to individual gene sequences

These improvements may be viewed as eftectillg greater efficiency or as allowing greater capability Compared with a traditional metbod of optimization PLSl is more efficient (Re85aJ but PLS2 does even better Given a required accuracy PLS2 locates an optimum with lower expected cost In terms or capability PLS2 insushylates the system from inherent iIladequacies and sensitivities or PLSl PLS2 is much more stable and caD use the whole population of regions relishyably to create a bigbly informed heuristic (this popultion performance is not meaningrul in standard genetic systems) Tbis availability or massive data haa important implications for teature interaction lRe83blmiddot

11

Additional refinements or PLS2 may further increase efficiency and power These include rewarding meritorious regions 10 they become immune to damage Future experiments will investigate nonlinear capability ability to disshycover lobal optima and efficiency and effectiveness of focalized credit auinment

This paper has quantitatively affirmed some principles believed to improve efficiency and effectiveness of learning (eg credit localization) The paper has also considered some simple but little explored ideas for realizin these capabilishyties (eg tull but controlled use or each datum)

REFERENCES

(An 73) Ande~bert MR auter Aul IJr Applicamiddot tiIJn AcademicPress 1973

(Be 80J Bethke AD Genetic elgIJritlm bull bullbull uchu IJptimizer PhD Thesis Universitl of Michian 1980

[Br 81) Brindle A Genetic algorithms for function optimintion CS Department Report TR81middot2 (PhD Dissertation) Universitl of Alberta 1081

[Bu 78) Buchanan BG Johnson CR Mitchell TM and Smith RG Models or learnin systems in Belser J (Ed) Encdopclli 0 CIJmper Science 11 TcdnolIJfJ 11 (1978) 201

(Co 8) Coles D and Rendell LA Some issues in trainin learn in systems and an autonomoul desip PrIJc FitA Bicftaiel CIJerece cAe Cbullbullbullli SIJeiet IJ ComputetilJel Stli IJICeUiuce 1984

(De 80) DeJonl KA Adaptive system delip A enetic approach IEEE Trdibullbull a Stem Mbullbull nl C6enadiCl SMC-IO (1080) 566-S7(

(Di81) Dietterich TG and Buchanan BG The role of the critic in learnin systems Stanford Universitl Report STANmiddotC8-81middot801 1081

(Di82) Dietterich TG London B Clarkson K and Drome1 G Learninl and inductin inference STANmiddot C8-82-813 Stanford Universitl also Chapter XIV or Tlc HI6t1o tI Arlifidcl InteUlgenee Cohen PR and Feienbaum EA (Ed) Kaufmann 1082

(Ho 75] Holland JH AIIptti Nturel nll Arlificiel Stem Universitl or Michitan Press 1875

(H0801 Holland JH Adapive a1orithms ror discovshyerin and usin eneral patterns in powin knowlede bases IntI loumel Oil Polic AelbullbullbullIIormtiIJ Stem 4 2 (1080) 217middot240

(Ho 81) Holland JH Genetic a1prithms and adaptampshytion Proc NATO Ad Ret In Alpti CIJntrIJI 0

Ulefillel Sem 1081

(8083) Holland JH Eecapinl brit~lenelS Proc Sccod latemetael Mdae Lunai Workdop 1883 02-05

La83 Lanlley P Bradshaw GL and Simon HA Rediscoverinl chemistry with the Bacon system in Michalsk~ RS Carbonell JG and Mi~chell TM (Ed) MuMne Lunalg Aft Arliiciel Intelligente Appro Tiolamp 1983 307middot329

[Ma8 Mauldin ML Maintainin diversitl in renetic search Proc FnrlA NtiIJael CIJaereftce Oft Arlificicl InteUiece 10S4 247middot250 ~

(Me 851 Medin DL and Wattenmaker WD Catepl1 cohesiveness ~heoriel and copitive archeotshyOQ (as et unpublished manuscript) Dept of PsycholshyOQ Universit of Dlinois at Urbana Champaip 1985

[Mic 831o Michalsk~ RS A theory and methodology or inductive learninl Arlificiel laeUgece RO 2 (1883) 1I1middotlali reprinted in Michalski RS et aI (Ed) M4chine Lumi A Arlificiel InteUigcace Appr04cl Tioa 198383-134

[Mic 83b) Michalski RS an d Stepp RE Le arning rrom observation Conceptual clusteriD iD Michalski RS e~ al (Ed) Madiae Lunaia Aa Arlificial Intelmiddot ligence ApproA TioS 1083331middot383

(Mit 83) MitcheD TM Learnin and problem solvinr Proc EigAtA Ilterulibullbullcl loirampl CfJcrete on Arlificilll lteUieaee UI83 ll3G-lISI

(Ni80) Nilsson NJ Prieipl Arlijieiel Intellimiddot eee Tiolamp 1080

(ae 76) Rendell LA A method ror automatic eneramiddot tion of heuristics ror statemiddotspace problem Dept of Computer Science C8-76-10 Universit of Waterloo 1076

(ae 771 Rendell LA A locall optimal solution or the fifteen punle produced by an automatic evaluation runction senerator Dept or Computer Science C8-77middot 38 Universitl of Waterloo 1977

(ae 81] Rendell LA An adaptive plan for statemiddotspace problems Dept of Computer Science C8-81-13 (PhD thesis) Universit of Waterloo 1081

(ae 83aJ Rendell LA A new basis for state-space learDin systems and a successful implementation ArtajicielateUieee 10 (1883) ( 360-302

(Re 83bJ Rendell LA A learninl II)stem which accommodatet future interactions Prtle EiAtA Intermiddot eli1Il I Cbullbullcreaee Arliiciel Ifttdlilence 1883 6072

IRe 83c) RendeU LA A doubly layered sendic penetrance learnins stem Proc Tlir~ NotiouJ Ooeruce ()Il ArlificiJ l_teUigecc 1083 343-341

(ae 83dl Rendel~ LA Conceptual knowlede acquisishyion in Harch University or Guelph Report 015-83-15 Dept or Computins and Inrormation Science Guelph Canada 1083 (to appear in Bole L (ed) Kutllle~ge B4e4 Le4mi Stem SpringerVerias)

IRe SSa) Rendell LA Utility pattern u criteria for efficient seneraliution learninr Proc Jg8500_erec bullbull Ittlligerat Sytem lid M4diflu (to appear) lOSS

(ae SSbl Rendell LA A scientific approach to applied induction Proc 1985 International Machine Learninr Workshop Rutrers University (to appear) 1085

(ae SSc) Rendell LA Substantial constructive inducshytion ulinS layered information compression Tractable leature formation in search Proc Hitlt ltcm4tiobull J Joit Ooerucc o Artificial Intelligence (to appear) 1985

(Sa63) Samuel AL Some Itudies in machine learnshyins usin the 3me of checkers in Feigenbaum EA and Feldman J (Ed) Oomputer 4114 Thugll McGraw-Hili 1963 1H05

(Sa 61) Samue~ ALbull Some studie in machine learnshyinS usinr the same of checkers U-recent progress IBM J Ru IUI4 Duelop 11 (1961) 601-617

(Sm 80) Smitb SF A learnins stem based on lenetic adaptive alrorithms PhD Dissertation Univershysity of Pittsburgh 1980

[Sm 831 Smith SF Flexible learninr of problem 1I01vshyiDe heuristics throurb adaptive Harch Proc EiglUl Iteruttoul Joit OOflerenc 0_ ArlificiJ I_tellishyuce 1983422middot425

APPENDIX A GLOSSARY OF TERMS

Clustering Cluter anailln has lonl been used as a tool ror induction in statiatics and patshytern recognition (An 73) (See middotinduction) Improvements to basic clustering techniques lenshyerally use more than just the reatures or a datum ([An 73 p194) suggests -external criteria) External criteria in [Mi83 Re76 Re83a Re85b] involve prior sped6cation of the forms clusters may take (this has been called middotconceptual clusshytering (Mi83l) Criteria in (Re76 Re83a Re85bl are based on the data enviroDment (see

-utilit) below)ls This paper uses clustering to create economical compressed genetic structures (enotIlPu)

Feature A feature is an attribute ormiddot proshyperty of an object Features are usually quite abstract (el -center control or -mobility) in a board game The utility (see below) varies smoothly with a feature

Genetic algorithm In a GA a the charshyacter of an individual or a population is called a phenotype The phenotype ia coded as a string of digits called the enotvpe A single digit is a ene Instead of searcbing rule space directly (compare -learning system) a GA searches gene space (ie a GA searches for good genes in the population or genotypes) This search uses the merit of individual genotypes selecting the more successrul individuals to undergo genetic operations ror the production of offspring See sect2 and Fig3

Inductton Induction or eneralization learnin is an important means ror knowledge acquisition Information is actually created as data are compressed into c48fe or cateories in order to predict future events efficiently and effectively Induction may create reature space neighborhoods or clusterl See -clustering and sect41

Learning System Buchanan et al present a general model which distinguishes comshyponents of a learning system [Bu 78) The perforshymance element PE is guided by a control trucshyture H Based on observation or the PE the crishytic assesses H possibly localizing credit to parts or H (Bu 7S Di SI] The learning element LE uses thia inrormatioD to improve H for the next round of task perrormance Lavered systems have multiple PEs critics and LEs (eg PLS2 uses PLSJ as its PE -see FilI) Just as a PE searches for its goal in problem space the LE searches in rule pace DiS2) for an optimal H to control the PE-

To facilitate this higher goal PLS2 uses an intermediate knowledge structure which divides feature space into reion relating feature values to object utiitll (Re83d] and discovering a userul subset or features (ef (Sa63l) In this paper the control structure H is a linear evaluation funcshy

15 A new learnin~ system IRe SSc) introduces highermiddotdimensional clusterinr ror creation or structure

13

tion INiSOI and the rules are feature weights for H Search for accurate regions replaces direct search of rule space ie regions mediate data and H AB explained in sect3 sets of regions middotbecome compressed GA genotypes See also genetic algorithms middotPLS bull region and Fig I

Merit Also called paloff or fitness this is the measure used by a genetic algorithm to select parent genotypes lor prelerential reproducshytion of succesduJ individuals Compare middotutilitymiddot also see genetic algorithms

Object Objects are any data to be genshyeralized into categories Relationships usually depend on task domain See utility

PLS The pro6GiiliBtic learnin BIstem can learn what are sometimes called single conshycepts IDi82] but PLS is capable of much more difficult tasks involving noise management incremental learning and normaliution 01 biased data PLSt uniquely discovered locally optimal heuristics in search IRe 83al and PLS2 is the effective and efficient extension examined in this paper PLS manipulates middotregions (see below) using various inductive operations described in sect4

RegIon or Cell Depending on ones viewpoint the region is PLSs basic structure for clustering or for the genetic algorithm The region is a compressed representation 01 a utility sur race in augmented feature space it is also a compressed genotype representing a utility luncshytion to be optimized AB explained in IRe83d) the region representation is fully expressive proshyviding the features are See sect3 and Figs3amp4

Utll1ty u This is any measure of the useshylulness 01 an object in the perlormance or some task The utilitl provides a lid between the task domain and PLS generalization algorithms Utility can be a probability as in Fig 2 Comshypare merit See sect I 3

APPENDIX B PLSI LIMITATIONS

PLSI alone is inherently limited The problems relate to modi6cation of the main knowledge structure the cumulative region set R = (rue) As mentioned in sect41 R undergoes two basic alterations PLSI gradually changes the meanin of an established feature space recshytangle r by updating ita associated utility u (along with us error e) PLSI also incrementally

refines tbe feature space as rectangles r are conshytinually split

Both or these modifications (utility revision and region refinement) are largely directed by search data but the degree to which newer inrorshymation affects R depends on various choices or system parameters IRe 83al System parameters influence estimates of the error e and determine the degree of region re6nement These in turn affect the relative importance of new versus estashyblished knowledge

Consequently values or these parameters influence task performance For example there is a tradeol between utility revision and region refinement It rrampions are refined too quickly accuracy sulers (this is theoretically predictable) It instead utility revision predominates regions become inert (their estimated errors decline) but sometimes incorrectly18

There are several other problems including difficulties in training limitation in the utility revision algorithm and inaccurate estimation of various errors As a result utility estimations are imperfect and biased in unknown ways

Together the above uncertainties and senshysitivities explain the failure of PLSI always to locate an optimum with static training (Table f) The net effect is that PLSI works rairly well with no parameter tuning and unsophisticated trainshying and close to optimal with mild tuning and informed training (CoS4 as long as the features are well behaved

By nature however PLSI requires reatures exhibiting no worse than mild interactions This is a serious restriction since feature nonlinearity is prevalent On its own then PLSt is inherently limited There is simply no wall to learn utility accurately unless the elects of dilering heuristic runctions H are compared as in PLS2

ACKNOWLEDGEMENTS

I would like to thank Dave Coles for his lasting enthusiasm during the development implementashytion and testing of this system I appreciate the helpful suggestions from Chris Matheus Mike Mauldin and the Conference Reviewers

1amp Althourh lI7stem parameters are given by domain-independent statistieal analysis tunine these parameters nevertheless improves performanee in some eases (This is not required in PLS2)

14

42 PLS2 Genetic Operatou

Qualities missing in PLSI can be provided by PLS2 M bull1 concluded PLSl with its single region set cannot discover accurate values of utilities u PLS2 however maintains an entire population of region seta which means that several regions in aU cover any given feature space volume The availability of comparable regions ultimately permits greater accuracy in u and brings other benefits

As sect32 explained a PLS2 genotype is the region set R = R and each region R = (rue) ( R is a compressed gene whose aDele is the utility u Details of an early version or PLS2 are given in (Re83cl Those algorithms have been improved the time complexity of the operashytors in recent program implementations is linear with population sile K The rollowing discussion outlines the overall eJlects and properties or the various genetic operators (compare to the more usual GA of sect2)

K-aexual mating is the operator analoshygous to crossover Consider a population R of K difterent region sets R Each set is composed of a number of regions R which together cover feature space A new region set Rt is formed by selecting individual regions (one at a time) from parents R with probability proportional to merit (merit is the performance of R defined at the end of sect2) Selection of regions rrom the whole population of region sets continues until the feature space cover is approximately the average cover or the parents This creates the oftspring region set Rt which is generally not a partition

Gene reorganization For economy of storale and time the olfsprinl region set Rt is repartitioned so that regions do not overlap in feature space

Controlled mutation Standard mutashytion operators alter an allele randomly III conshytrast the PLS2 operator analogous to mutation changes aD allele accordinl to evidence arising in the task domaia The controlled mutation operator ror a region set R = ( r Ii e) is the utility revision operator of PLSI M described in sect4I PLSI modifies the utility u for each feature space ceO r

Genotype expansion This operator is also provided by PLSI Recall the discussion of sect32 about the economy resultinl from compressshying genes (utility-feature vectors) into a region

set R The rerinement operator was described in sect41 This feature space refiDement amounta to an expansion or the lenotype R and is carried out when data warrant the discrimination

Both controlled mutation and genotype expansion promote genotype diversity Thus PLSI helps PLS2 to avoid premature convergence a typical GA problem [BrSl MaSI

43 Efteettveneaa and Effielency

The power of PLS2 may be traced to cershytain aspects of the perceptual and genetic algoshyrithms just outlined Some existing and emergshying principles or eftective and efficient learning are briefly discussed below (see also (ReSSa ReSSb ReSSel)

Credit loeaIlzatlon The selection of regions for Kmiddotsexual mating may use a single merit value for each region R within a given set R However the value of can just as well be localized to single regions within R t by comshyparing R with similar regions in other seta Since regions estimate an absolute quantity (taskshyrelated utility) in their own volume or feature space they are independent or each other Thus credit and blame may be assigned to feature space cells (ie to gene sequences)

Assignment or credit to individual regions within a cumulative set R is straightforward but it would be difticult to do directly in the final evaluation flluction H since the components of H whiJe appropriate ror performance omit inforshymation relevant to learning (compare Figs 2 )bull3

Knowledge mediation successrul sysshytems tend to employ inrormation structures which mediate data objects and the ultimate knowledge form These mediating structures include means to record growing assurance or tentative hypotheses

Whell used in heuristic sea~ch the PLS region set mediates large numbers of states and a

13 There are various pOSIIibilities lor the evalu tion function R but all contain len useful inlormation than their determinant the resion Bet R The simshyplest heuristic used in (Re 83a Re 83dJ is H - bl where b is a vector of weisbts for the feature vector f (This linear combination is used exclusively in experishyments to be described in 55) The value of b is found usinr regions as data in linear regression (Re 83a Re83bJ

7

very concise evaluation runction H Retention and continual improvemen or this mediatin structure relines the credit assignment problem This view is unlike that or (Oi81p14 Oi821 learning systems orten attempt to improve the control structure itself whereas PLS acquires knowledge efficiently in an appropriate structure and utilizes this knowledge by compressing it only temporarily ror performance In other words PLS does not directly search rule space for a good H but rather searches for good cumul tive regions rrom whicb H is constructed

Full but controlled use of ever) datum Samuels checker player permitted each state encountered to inBuence the beuristic H and at tbe same time no one datum could overwhelm the system The learning was stochastic botb conservative and economic In this respect PLS2 is similar (although more automated)

Schemata In learning systems and genetic algorithms A related efficiency in both Samuels systems and PLS is like tbe scheshymata concept in a GA In a GA a single indivishydual coded as a genotype (a string of digits) supports not only itselr but also aU its subshystrings Similarly a single state arising in heurisshytic search contains information about every feature used to describe it Thus each state can be used to appraise and weight each reature (The effect is more pronounced when a state is described in more elementary terms and combishynations or primitive descriptors are assessed -see (ReSSel)

6 EXPERIMENT AND ANALYSIS

PLS2 is designed to work in a changing environshyment of increasingly difficult problems Tbis seeshytion describes experimental evidence or effective and efficient learning

51 Experimental Condltlons Tame teaturee The features used ror

tbese experiments were the four or (Re83al Tbe relationship beween utility and these features is rairly smootb so tbe rull capability of a GA is not tested although tbe environment was dynamic

De6nltlon ot merit As sect4 described PLS2 choses regions from successrul cumulative sets and recombines them into imprOVed sets For the experiments reported here the selection criterion was the global merit p ie tbe perforshy

mance ora whole region Id without localization or credit to indiyichlal regions This measure p was the average number of nodes developed 0 in a training sample of 8 fmeen puules divided into the mean of aU such averages in a popul tion of K sets ie p 00 where 0 is tbe average over the population (0 E OJ K)

Changing envIronment For these expershyiments successive rounds of training were repeated in incremental learnin over several iterations or enertin The environment was altered in successive generations it was specified as problem difJicult or depth d (defined as the number of moves from the goal in sample probshylems) As a sequence of specifications of problem difficulty this becomes a training difJicult vector d == (d d2 d)

Here d was static one known to be a good progression based on previous experience with user training (Co84)middot In these experiments d was always (81422 SO ) An integer means random production or training problems subject to this difficulty constraint wbile demands production of fully random training instances

6t Discussioll

Berore we examine the experiments themshyselves let us consider potential dilrerences between PLSI aDd PLS2 in terms of their effectiveness and efficiency We also need a crishyterion for assessing differences between the two systems

VulnerabtUty of PLSt With population size K == 1 PLS2 degenerates to tbe simpler PLSI In this case static training can result in utter railure since the process is stocbastic and various things can go wrong (see Appendix B) The wors is railure to solve any problems in some generation and consequent absence or any new information IC the control structure H is tbis poor it will DO improve unless tbe lad is

14 PLS and similar system ror problem and game are IIOmetima neither rully supervised nor rully unsupervised The orjpnu PLSl w intermediate in this respect Trainiq problema were selected by a humiddot man bu rrom each raining instance a multitude or individuu noda ror learning are generated b the SY8shytem Each node can be considered a separate example ror concep learniq (Re 83d] (Co 84] describes experishyments with an automated trainer

8

detected and problem difficulty is reduced (ie dynamic training is needed)

Even without this catastrophe PLSI pershyrorms with varying degrees or success depending on the sophistication 01 ita training and other lacton (explained in Appendix B) With minimal human guidance PLSI always achieves a good evaluation lunction H although not always an optimal one Witb static training PLSI succeeds reasonably about balf the time

Stabtllty ot PLSt In contrast one would expect PLS2 to bave a much better success rate Since PLSI is here being run in parallel (Fig 1) and since PLS2 should reject hopeless cases (tbeir middots are small) a complete catastrophe (all Hs lailing) should occur with probability p s qK wbere q is the probability 01 PLSI lailure and K is popUlation size It q is even as large as one hall but K is 7 or more the probability p 01 catastrophe is less than 001

Cost versus benefit a measure Failure plays a part in costs so PLS2 may have an advantage The ultimate criterion for system quality is cost effectiveness is PLS2 worth its extra complexity Since the main cost is in task performance (here solving) the number of nodes developed 0 to attain some perlormance is a good measure of the expense

If training results in catastrophic bUure however all effort is wasted so a better measure is the expected cost Op where p is the probabiamp ity of success For example if 0 == 1gt00 for viable control structures but the probability of finding solutions is only then the aeraampe cost of useful information is 500 ~ == 1000

To extend this argument probability p depends on what is considered a success Is sucshycess the discovery of a perfect evaluation runcshytion H or is performance satisfactory ir 0 departs Irom optimal by no more than 21gt

53 Rulte

Table I shows perrormances and costa with yarious nlues of K Here p is estimated using roughly 38 trbJa or PLSI in a PLS2 context (it K == I 38 distind runl it K == 2 18 runl etc) Since variances is D are high performance testa were made oyer a random sample or 50 puzzles This typicaJJy iyes 91gt con6dence of 0 t 40

Accuraq ot learnlnl Let UI 6rst comshyp31e results 01 PLSI versul PLS2 for rour different success criteria We consider the learning to be successful it the resUlting heuristic H approaches optimal quality within a given margin (01 100 50 21gt and 10)

Columns two to five in the table (the second group) show the proportion of Hs giving performance 0 within a specified percentage or the best ho 0 (the best 0 is around 3SO nodes for the foar leatures used) For example the last row of the table shows that 01 the 36 individual control structures H tested in (two different) popul2tions of size 19 all 36 were within 100 of optimal 0 (column two) This means that aU developed no more than 700 nodes before a solutio was found Similarly column 6ve in the last row shows that 021 01 the 3t5 Hs or 8 or them were within 10 ie required no more than 386 nodes developed

Cost ot accuracy Columns ten and eleyen (the two riamphtmost columns of the lourth croup) show the estimated costs of achieving pershyformance withia 100 and within 10 or optimum respedively The values are based on the expected total number of nodes required (ie Op) with adjustments in favor of PLSI for extra PLS2 overhead (The unit is one thousand nodes developed) As K increases the cost of a given accuracy 6rst increases Nevertheless with just moderate K Yahles the genetic system becomes cheaper particularly ror an accuracy or 10

lb

I

Propolt0 auat Suce c Itbullbull 1

C lalt otlaa1 D lOot S 15 lot

__ lIdo

IRaNoa SI of SliIt t 654 U5 -nl IU sn n ItS 5 114 It n nl 1M Ul 14 I

Coat 1 1 11

__I

US n bull 11le19a lJ

poet Con t 0lIlbullbull ltalbull

1 101

J III na n au1bull nl na Ul II t4

foebullbullbulleobull It~t I

Jllot aot 12

n

1 4

12 IS 19

n J U bullbull4 6t 11 1 n 12 0 n 11 U )1 11

1 11 14 1 n 61 21

The expected cost benefit is not the only PLS2 and traditional optimization In [ReSl] advantage of PLS2 PLSI was round considerably more efficient than

Average pertormance beat perlo mance and population performance Conshysider now the third group of columns in Table I Tbe sixth column gives tbe IIverllle i5 for an H in the sample (or 36) The seventh column gives Db ror the best Hb in the sample These two measures average and be performance are orten used in assessing genetic systems (Br 81) The eighth column however is unusual it indishycates the population performance Dp resulting when all regions from every set in the population are used together in a regression to determine Up This is sensible because regions are indepenshydent and estimate tbe utility an absolute quaashytity ((Re83cJ cr [Br8l H07S))

Several trends are apparent in Table L First whether the criterion is 100 50 25 or 10 of optimum (columns 2-5) the proportion of good Hs increases steadily as population size K rises Similarly average best and popUlation performance measures 6 Db and Dp (columns 6shy8) also improve with K Perhaps most important is that the population performance Dp is so reu ably close to best even with these low K values This means that the whole population ot regions can be used (for Hp) without independent verification of perrormance In contrast indivi-middot dual Hs would require additional testing to disshycover the best (column 7) and the other alternashytive any H is likely not as good as Hp (colllI 6 and 8) Furthermore the entire populath or regions can become an accurate source of masshysive data for determining an evaluation function capturing reature interaction IRe83b

This accuracy advantage of PLS2 is illusshytrated in the 6nal column ot the table where ror a constant cost rough estimates are given or the expected error in population performance Dp relative to the optimal value

It is interesting that such sman popUlations improve perrormance markedly usually populashytion sizes are SO or more

o EFFICmNCY AND CAPABILITY

Based on these empirical observations ror PLS2 OD other comparisons ror PLSl and on various conceptual differences general properties or three competing methods can be compared PLSl

standard optimization and the luuestion was made tbat PLSI made better ase of available inrormation By Itudying sucll behaviors load underlying reasons we Ihould eventually identify principles of efficieat learning Some aspects are considered below

TradItIonal optlmlloatlon verau PLSI First let us consider efficiency of search for an optimal weight vector b in the evaluation runcshytion H = bt One good optimization method is repone lurace fittin (RSF) It can discover a local optimum in weight space by measuring and regressing the response (bere number of nodes developed D) for various values ot b RSF utilshyizes just a single quantit (ie D) ror every probshylem solved This seems like a small amount of inrormation to extract rrom an entire search lince a typical one may develop hundreds or thousands or nodes each possibly containing relevant inrormation In contrast to this tradishytional statistical approach PLSI like [5a63 5a671 uncovers knowledge about ever feature rrom ever node (see 143) PLSl then might be expected to be more efficient than RSF Experishyments veriry this [ReSI

TraditIonal optlmlratloll venta PLSI oM shown in SS PLS2 is more efficient ItiD We can compare it too with RSF The accuracy of RSF is known to improve with VN where N is the number or data (here the number or or probshylema solved) oM a arst approximation a paranel method like PLS2 should also cause accuracy to increase with the square root of the number or data although the data are now regions instead of D values If roughly the same number of regions is present in each individual set R or a population or size K accuracy must thererore improve as VK Since each of these K structures requires N problems in training the accuracy or PLS2 should increase as yenN like RSF

Obviously though PLS2 involves much more than blind parallelism a genetic algorithm extracts accurate knowledge and dismisses incorrect (un6t) inrormation While it is impossishyble ror PLSI alone PLS2 can rehe merit by localizing credit to individual regions (Re 83c] Planned experiments with this should show rurther increases in efficiency since the additional cost is small Another inexpensive improvement

10

will attempt to reward good regions by decreasshying their estimated errors Even without these refinements PLS2 retains meritorious regions (sect4) and should exhibit accuracy improvement better than vN Table I suggests this

PLSZ veraua PLSZ AI di8eussed ill sect41 and Appendix B PLSI is limited necessitating human tuning for optimum performance In conshytrast the second layer learning system PLS2 requires little human intervention The maiD reason is that PLS2 stabilizes knowledge automatically by comparing region sets and dismissing aberrant ones Accurate cumulative sets have a longer lifetime

1 SUMMARY AND CONCLUSIONS

PLS2 is a general learning system (Re83a Re83dJ Given a set or user-de6ned reatures and some measure of the utilil (eg probability or success in tast (errormance) PLS2 rorms aDd rebes an appropriate knowledge structure the cumulative region d R relating utility to reature values and permitting DOise manageshyment This economical and flexible structure mediate data objects and abstract heuristic knowledge

Since individual regions or the cumulative set R are independent or one anotber botb credit localization and reature interaction are possible

This ability to discriminate merit and retain successful data will likely be accentuated with the localization or credit to indiviltlual regions (see H2) Another improvement is to alter dynamically tbe error or a region (estimated by PLSl) as a runction or its merit (found by PLS2) This will have the elJect or protecting a good region rrom imperrect PLSI utility revision once some parallel PLSl has succeeded in discovshyering an accurate value it will be more immune to damage A fit region will bave a very long lirespan

Inherent differences In capablltty RSF PLSl and PLS2 can be cbaracterized dilJerently

From the standpoint or time costs given a chalshylenging requirement such as the location or a local optimum witbin 10 tbe ordering or tbese methods in terms or efficiency is RSF 5 PLSI 5

PLS2 In terms or capdilit tbe same relationshysbip bolds RSF cannot handle reature interacshytions without a more complex model (wbicb would increase its cost drastically) PLSl on the other hand can provide some performance improvement using piecewise linearity with little additional cost [Re83bJ PLS2 is more robust than PLSl While the original system is someshywhat sensitive to training and parameters PLS2 provides stability using competition to overcome deficiencies obviate tuning and increase accushyracy all at once PLS2 bulJers inadequacies inberent in PLSI Moreover PLS2 being genetishycally based may be able to handle bighly interacting reatures and discover 10601 optima Re 83c1 Tbis is very costly with RSF and seems inreasible with PLSl alone

simultaneously Separating tbe tast con trol structure H from the main store or knowledge R allows straigbtforward credit assignment to this dderminant R or H while H iteelt may incorshyporate reature nonlinearities witbout being responsible for tbem

A concise and adequate embodiment of current heuristic knowledge tbe cumulative region set R waa originally used in the learning system PLSl (Re83aJ PLSI is the only system shown to discover locally optimal evaluation runctions in an AI context Clearly superior to PLSl its genetic successor PLS2 haa been shown to be more stable more accurate more efficient and more convenient PLS2 employs an unusual genetic algorithm having tbe cumulative set R as a compressed genotype PLS2 extends PLSls limmiddot ited operations or revision (controlled mutation) and dilJerentiation (genotype expansion) to include generalization and other rules (K-sexual mating and genotype reorganization) Credit may be localized to individual gene sequences

These improvements may be viewed as eftectillg greater efficiency or as allowing greater capability Compared with a traditional metbod of optimization PLSl is more efficient (Re85aJ but PLS2 does even better Given a required accuracy PLS2 locates an optimum with lower expected cost In terms or capability PLS2 insushylates the system from inherent iIladequacies and sensitivities or PLSl PLS2 is much more stable and caD use the whole population of regions relishyably to create a bigbly informed heuristic (this popultion performance is not meaningrul in standard genetic systems) Tbis availability or massive data haa important implications for teature interaction lRe83blmiddot

11

Additional refinements or PLS2 may further increase efficiency and power These include rewarding meritorious regions 10 they become immune to damage Future experiments will investigate nonlinear capability ability to disshycover lobal optima and efficiency and effectiveness of focalized credit auinment

This paper has quantitatively affirmed some principles believed to improve efficiency and effectiveness of learning (eg credit localization) The paper has also considered some simple but little explored ideas for realizin these capabilishyties (eg tull but controlled use or each datum)

REFERENCES

(An 73) Ande~bert MR auter Aul IJr Applicamiddot tiIJn AcademicPress 1973

(Be 80J Bethke AD Genetic elgIJritlm bull bullbull uchu IJptimizer PhD Thesis Universitl of Michian 1980

[Br 81) Brindle A Genetic algorithms for function optimintion CS Department Report TR81middot2 (PhD Dissertation) Universitl of Alberta 1081

[Bu 78) Buchanan BG Johnson CR Mitchell TM and Smith RG Models or learnin systems in Belser J (Ed) Encdopclli 0 CIJmper Science 11 TcdnolIJfJ 11 (1978) 201

(Co 8) Coles D and Rendell LA Some issues in trainin learn in systems and an autonomoul desip PrIJc FitA Bicftaiel CIJerece cAe Cbullbullbullli SIJeiet IJ ComputetilJel Stli IJICeUiuce 1984

(De 80) DeJonl KA Adaptive system delip A enetic approach IEEE Trdibullbull a Stem Mbullbull nl C6enadiCl SMC-IO (1080) 566-S7(

(Di81) Dietterich TG and Buchanan BG The role of the critic in learnin systems Stanford Universitl Report STANmiddotC8-81middot801 1081

(Di82) Dietterich TG London B Clarkson K and Drome1 G Learninl and inductin inference STANmiddot C8-82-813 Stanford Universitl also Chapter XIV or Tlc HI6t1o tI Arlifidcl InteUlgenee Cohen PR and Feienbaum EA (Ed) Kaufmann 1082

(Ho 75] Holland JH AIIptti Nturel nll Arlificiel Stem Universitl or Michitan Press 1875

(H0801 Holland JH Adapive a1orithms ror discovshyerin and usin eneral patterns in powin knowlede bases IntI loumel Oil Polic AelbullbullbullIIormtiIJ Stem 4 2 (1080) 217middot240

(Ho 81) Holland JH Genetic a1prithms and adaptampshytion Proc NATO Ad Ret In Alpti CIJntrIJI 0

Ulefillel Sem 1081

(8083) Holland JH Eecapinl brit~lenelS Proc Sccod latemetael Mdae Lunai Workdop 1883 02-05

La83 Lanlley P Bradshaw GL and Simon HA Rediscoverinl chemistry with the Bacon system in Michalsk~ RS Carbonell JG and Mi~chell TM (Ed) MuMne Lunalg Aft Arliiciel Intelligente Appro Tiolamp 1983 307middot329

[Ma8 Mauldin ML Maintainin diversitl in renetic search Proc FnrlA NtiIJael CIJaereftce Oft Arlificicl InteUiece 10S4 247middot250 ~

(Me 851 Medin DL and Wattenmaker WD Catepl1 cohesiveness ~heoriel and copitive archeotshyOQ (as et unpublished manuscript) Dept of PsycholshyOQ Universit of Dlinois at Urbana Champaip 1985

[Mic 831o Michalsk~ RS A theory and methodology or inductive learninl Arlificiel laeUgece RO 2 (1883) 1I1middotlali reprinted in Michalski RS et aI (Ed) M4chine Lumi A Arlificiel InteUigcace Appr04cl Tioa 198383-134

[Mic 83b) Michalski RS an d Stepp RE Le arning rrom observation Conceptual clusteriD iD Michalski RS e~ al (Ed) Madiae Lunaia Aa Arlificial Intelmiddot ligence ApproA TioS 1083331middot383

(Mit 83) MitcheD TM Learnin and problem solvinr Proc EigAtA Ilterulibullbullcl loirampl CfJcrete on Arlificilll lteUieaee UI83 ll3G-lISI

(Ni80) Nilsson NJ Prieipl Arlijieiel Intellimiddot eee Tiolamp 1080

(ae 76) Rendell LA A method ror automatic eneramiddot tion of heuristics ror statemiddotspace problem Dept of Computer Science C8-76-10 Universit of Waterloo 1076

(ae 771 Rendell LA A locall optimal solution or the fifteen punle produced by an automatic evaluation runction senerator Dept or Computer Science C8-77middot 38 Universitl of Waterloo 1977

(ae 81] Rendell LA An adaptive plan for statemiddotspace problems Dept of Computer Science C8-81-13 (PhD thesis) Universit of Waterloo 1081

(ae 83aJ Rendell LA A new basis for state-space learDin systems and a successful implementation ArtajicielateUieee 10 (1883) ( 360-302

(Re 83bJ Rendell LA A learninl II)stem which accommodatet future interactions Prtle EiAtA Intermiddot eli1Il I Cbullbullcreaee Arliiciel Ifttdlilence 1883 6072

IRe 83c) RendeU LA A doubly layered sendic penetrance learnins stem Proc Tlir~ NotiouJ Ooeruce ()Il ArlificiJ l_teUigecc 1083 343-341

(ae 83dl Rendel~ LA Conceptual knowlede acquisishyion in Harch University or Guelph Report 015-83-15 Dept or Computins and Inrormation Science Guelph Canada 1083 (to appear in Bole L (ed) Kutllle~ge B4e4 Le4mi Stem SpringerVerias)

IRe SSa) Rendell LA Utility pattern u criteria for efficient seneraliution learninr Proc Jg8500_erec bullbull Ittlligerat Sytem lid M4diflu (to appear) lOSS

(ae SSbl Rendell LA A scientific approach to applied induction Proc 1985 International Machine Learninr Workshop Rutrers University (to appear) 1085

(ae SSc) Rendell LA Substantial constructive inducshytion ulinS layered information compression Tractable leature formation in search Proc Hitlt ltcm4tiobull J Joit Ooerucc o Artificial Intelligence (to appear) 1985

(Sa63) Samuel AL Some Itudies in machine learnshyins usin the 3me of checkers in Feigenbaum EA and Feldman J (Ed) Oomputer 4114 Thugll McGraw-Hili 1963 1H05

(Sa 61) Samue~ ALbull Some studie in machine learnshyinS usinr the same of checkers U-recent progress IBM J Ru IUI4 Duelop 11 (1961) 601-617

(Sm 80) Smitb SF A learnins stem based on lenetic adaptive alrorithms PhD Dissertation Univershysity of Pittsburgh 1980

[Sm 831 Smith SF Flexible learninr of problem 1I01vshyiDe heuristics throurb adaptive Harch Proc EiglUl Iteruttoul Joit OOflerenc 0_ ArlificiJ I_tellishyuce 1983422middot425

APPENDIX A GLOSSARY OF TERMS

Clustering Cluter anailln has lonl been used as a tool ror induction in statiatics and patshytern recognition (An 73) (See middotinduction) Improvements to basic clustering techniques lenshyerally use more than just the reatures or a datum ([An 73 p194) suggests -external criteria) External criteria in [Mi83 Re76 Re83a Re85b] involve prior sped6cation of the forms clusters may take (this has been called middotconceptual clusshytering (Mi83l) Criteria in (Re76 Re83a Re85bl are based on the data enviroDment (see

-utilit) below)ls This paper uses clustering to create economical compressed genetic structures (enotIlPu)

Feature A feature is an attribute ormiddot proshyperty of an object Features are usually quite abstract (el -center control or -mobility) in a board game The utility (see below) varies smoothly with a feature

Genetic algorithm In a GA a the charshyacter of an individual or a population is called a phenotype The phenotype ia coded as a string of digits called the enotvpe A single digit is a ene Instead of searcbing rule space directly (compare -learning system) a GA searches gene space (ie a GA searches for good genes in the population or genotypes) This search uses the merit of individual genotypes selecting the more successrul individuals to undergo genetic operations ror the production of offspring See sect2 and Fig3

Inductton Induction or eneralization learnin is an important means ror knowledge acquisition Information is actually created as data are compressed into c48fe or cateories in order to predict future events efficiently and effectively Induction may create reature space neighborhoods or clusterl See -clustering and sect41

Learning System Buchanan et al present a general model which distinguishes comshyponents of a learning system [Bu 78) The perforshymance element PE is guided by a control trucshyture H Based on observation or the PE the crishytic assesses H possibly localizing credit to parts or H (Bu 7S Di SI] The learning element LE uses thia inrormatioD to improve H for the next round of task perrormance Lavered systems have multiple PEs critics and LEs (eg PLS2 uses PLSJ as its PE -see FilI) Just as a PE searches for its goal in problem space the LE searches in rule pace DiS2) for an optimal H to control the PE-

To facilitate this higher goal PLS2 uses an intermediate knowledge structure which divides feature space into reion relating feature values to object utiitll (Re83d] and discovering a userul subset or features (ef (Sa63l) In this paper the control structure H is a linear evaluation funcshy

15 A new learnin~ system IRe SSc) introduces highermiddotdimensional clusterinr ror creation or structure

13

tion INiSOI and the rules are feature weights for H Search for accurate regions replaces direct search of rule space ie regions mediate data and H AB explained in sect3 sets of regions middotbecome compressed GA genotypes See also genetic algorithms middotPLS bull region and Fig I

Merit Also called paloff or fitness this is the measure used by a genetic algorithm to select parent genotypes lor prelerential reproducshytion of succesduJ individuals Compare middotutilitymiddot also see genetic algorithms

Object Objects are any data to be genshyeralized into categories Relationships usually depend on task domain See utility

PLS The pro6GiiliBtic learnin BIstem can learn what are sometimes called single conshycepts IDi82] but PLS is capable of much more difficult tasks involving noise management incremental learning and normaliution 01 biased data PLSt uniquely discovered locally optimal heuristics in search IRe 83al and PLS2 is the effective and efficient extension examined in this paper PLS manipulates middotregions (see below) using various inductive operations described in sect4

RegIon or Cell Depending on ones viewpoint the region is PLSs basic structure for clustering or for the genetic algorithm The region is a compressed representation 01 a utility sur race in augmented feature space it is also a compressed genotype representing a utility luncshytion to be optimized AB explained in IRe83d) the region representation is fully expressive proshyviding the features are See sect3 and Figs3amp4

Utll1ty u This is any measure of the useshylulness 01 an object in the perlormance or some task The utilitl provides a lid between the task domain and PLS generalization algorithms Utility can be a probability as in Fig 2 Comshypare merit See sect I 3

APPENDIX B PLSI LIMITATIONS

PLSI alone is inherently limited The problems relate to modi6cation of the main knowledge structure the cumulative region set R = (rue) As mentioned in sect41 R undergoes two basic alterations PLSI gradually changes the meanin of an established feature space recshytangle r by updating ita associated utility u (along with us error e) PLSI also incrementally

refines tbe feature space as rectangles r are conshytinually split

Both or these modifications (utility revision and region refinement) are largely directed by search data but the degree to which newer inrorshymation affects R depends on various choices or system parameters IRe 83al System parameters influence estimates of the error e and determine the degree of region re6nement These in turn affect the relative importance of new versus estashyblished knowledge

Consequently values or these parameters influence task performance For example there is a tradeol between utility revision and region refinement It rrampions are refined too quickly accuracy sulers (this is theoretically predictable) It instead utility revision predominates regions become inert (their estimated errors decline) but sometimes incorrectly18

There are several other problems including difficulties in training limitation in the utility revision algorithm and inaccurate estimation of various errors As a result utility estimations are imperfect and biased in unknown ways

Together the above uncertainties and senshysitivities explain the failure of PLSI always to locate an optimum with static training (Table f) The net effect is that PLSI works rairly well with no parameter tuning and unsophisticated trainshying and close to optimal with mild tuning and informed training (CoS4 as long as the features are well behaved

By nature however PLSI requires reatures exhibiting no worse than mild interactions This is a serious restriction since feature nonlinearity is prevalent On its own then PLSt is inherently limited There is simply no wall to learn utility accurately unless the elects of dilering heuristic runctions H are compared as in PLS2

ACKNOWLEDGEMENTS

I would like to thank Dave Coles for his lasting enthusiasm during the development implementashytion and testing of this system I appreciate the helpful suggestions from Chris Matheus Mike Mauldin and the Conference Reviewers

1amp Althourh lI7stem parameters are given by domain-independent statistieal analysis tunine these parameters nevertheless improves performanee in some eases (This is not required in PLS2)

14

very concise evaluation runction H Retention and continual improvemen or this mediatin structure relines the credit assignment problem This view is unlike that or (Oi81p14 Oi821 learning systems orten attempt to improve the control structure itself whereas PLS acquires knowledge efficiently in an appropriate structure and utilizes this knowledge by compressing it only temporarily ror performance In other words PLS does not directly search rule space for a good H but rather searches for good cumul tive regions rrom whicb H is constructed

Full but controlled use of ever) datum Samuels checker player permitted each state encountered to inBuence the beuristic H and at tbe same time no one datum could overwhelm the system The learning was stochastic botb conservative and economic In this respect PLS2 is similar (although more automated)

Schemata In learning systems and genetic algorithms A related efficiency in both Samuels systems and PLS is like tbe scheshymata concept in a GA In a GA a single indivishydual coded as a genotype (a string of digits) supports not only itselr but also aU its subshystrings Similarly a single state arising in heurisshytic search contains information about every feature used to describe it Thus each state can be used to appraise and weight each reature (The effect is more pronounced when a state is described in more elementary terms and combishynations or primitive descriptors are assessed -see (ReSSel)

6 EXPERIMENT AND ANALYSIS

PLS2 is designed to work in a changing environshyment of increasingly difficult problems Tbis seeshytion describes experimental evidence or effective and efficient learning

51 Experimental Condltlons Tame teaturee The features used ror

tbese experiments were the four or (Re83al Tbe relationship beween utility and these features is rairly smootb so tbe rull capability of a GA is not tested although tbe environment was dynamic

De6nltlon ot merit As sect4 described PLS2 choses regions from successrul cumulative sets and recombines them into imprOVed sets For the experiments reported here the selection criterion was the global merit p ie tbe perforshy

mance ora whole region Id without localization or credit to indiyichlal regions This measure p was the average number of nodes developed 0 in a training sample of 8 fmeen puules divided into the mean of aU such averages in a popul tion of K sets ie p 00 where 0 is tbe average over the population (0 E OJ K)

Changing envIronment For these expershyiments successive rounds of training were repeated in incremental learnin over several iterations or enertin The environment was altered in successive generations it was specified as problem difJicult or depth d (defined as the number of moves from the goal in sample probshylems) As a sequence of specifications of problem difficulty this becomes a training difJicult vector d == (d d2 d)

Here d was static one known to be a good progression based on previous experience with user training (Co84)middot In these experiments d was always (81422 SO ) An integer means random production or training problems subject to this difficulty constraint wbile demands production of fully random training instances

6t Discussioll

Berore we examine the experiments themshyselves let us consider potential dilrerences between PLSI aDd PLS2 in terms of their effectiveness and efficiency We also need a crishyterion for assessing differences between the two systems

VulnerabtUty of PLSt With population size K == 1 PLS2 degenerates to tbe simpler PLSI In this case static training can result in utter railure since the process is stocbastic and various things can go wrong (see Appendix B) The wors is railure to solve any problems in some generation and consequent absence or any new information IC the control structure H is tbis poor it will DO improve unless tbe lad is

14 PLS and similar system ror problem and game are IIOmetima neither rully supervised nor rully unsupervised The orjpnu PLSl w intermediate in this respect Trainiq problema were selected by a humiddot man bu rrom each raining instance a multitude or individuu noda ror learning are generated b the SY8shytem Each node can be considered a separate example ror concep learniq (Re 83d] (Co 84] describes experishyments with an automated trainer

8

detected and problem difficulty is reduced (ie dynamic training is needed)

Even without this catastrophe PLSI pershyrorms with varying degrees or success depending on the sophistication 01 ita training and other lacton (explained in Appendix B) With minimal human guidance PLSI always achieves a good evaluation lunction H although not always an optimal one Witb static training PLSI succeeds reasonably about balf the time

Stabtllty ot PLSt In contrast one would expect PLS2 to bave a much better success rate Since PLSI is here being run in parallel (Fig 1) and since PLS2 should reject hopeless cases (tbeir middots are small) a complete catastrophe (all Hs lailing) should occur with probability p s qK wbere q is the probability 01 PLSI lailure and K is popUlation size It q is even as large as one hall but K is 7 or more the probability p 01 catastrophe is less than 001

Cost versus benefit a measure Failure plays a part in costs so PLS2 may have an advantage The ultimate criterion for system quality is cost effectiveness is PLS2 worth its extra complexity Since the main cost is in task performance (here solving) the number of nodes developed 0 to attain some perlormance is a good measure of the expense

If training results in catastrophic bUure however all effort is wasted so a better measure is the expected cost Op where p is the probabiamp ity of success For example if 0 == 1gt00 for viable control structures but the probability of finding solutions is only then the aeraampe cost of useful information is 500 ~ == 1000

To extend this argument probability p depends on what is considered a success Is sucshycess the discovery of a perfect evaluation runcshytion H or is performance satisfactory ir 0 departs Irom optimal by no more than 21gt

53 Rulte

Table I shows perrormances and costa with yarious nlues of K Here p is estimated using roughly 38 trbJa or PLSI in a PLS2 context (it K == I 38 distind runl it K == 2 18 runl etc) Since variances is D are high performance testa were made oyer a random sample or 50 puzzles This typicaJJy iyes 91gt con6dence of 0 t 40

Accuraq ot learnlnl Let UI 6rst comshyp31e results 01 PLSI versul PLS2 for rour different success criteria We consider the learning to be successful it the resUlting heuristic H approaches optimal quality within a given margin (01 100 50 21gt and 10)

Columns two to five in the table (the second group) show the proportion of Hs giving performance 0 within a specified percentage or the best ho 0 (the best 0 is around 3SO nodes for the foar leatures used) For example the last row of the table shows that 01 the 36 individual control structures H tested in (two different) popul2tions of size 19 all 36 were within 100 of optimal 0 (column two) This means that aU developed no more than 700 nodes before a solutio was found Similarly column 6ve in the last row shows that 021 01 the 3t5 Hs or 8 or them were within 10 ie required no more than 386 nodes developed

Cost ot accuracy Columns ten and eleyen (the two riamphtmost columns of the lourth croup) show the estimated costs of achieving pershyformance withia 100 and within 10 or optimum respedively The values are based on the expected total number of nodes required (ie Op) with adjustments in favor of PLSI for extra PLS2 overhead (The unit is one thousand nodes developed) As K increases the cost of a given accuracy 6rst increases Nevertheless with just moderate K Yahles the genetic system becomes cheaper particularly ror an accuracy or 10

lb

I

Propolt0 auat Suce c Itbullbull 1

C lalt otlaa1 D lOot S 15 lot

__ lIdo

IRaNoa SI of SliIt t 654 U5 -nl IU sn n ItS 5 114 It n nl 1M Ul 14 I

Coat 1 1 11

__I

US n bull 11le19a lJ

poet Con t 0lIlbullbull ltalbull

1 101

J III na n au1bull nl na Ul II t4

foebullbullbulleobull It~t I

Jllot aot 12

n

1 4

12 IS 19

n J U bullbull4 6t 11 1 n 12 0 n 11 U )1 11

1 11 14 1 n 61 21

The expected cost benefit is not the only PLS2 and traditional optimization In [ReSl] advantage of PLS2 PLSI was round considerably more efficient than

Average pertormance beat perlo mance and population performance Conshysider now the third group of columns in Table I Tbe sixth column gives tbe IIverllle i5 for an H in the sample (or 36) The seventh column gives Db ror the best Hb in the sample These two measures average and be performance are orten used in assessing genetic systems (Br 81) The eighth column however is unusual it indishycates the population performance Dp resulting when all regions from every set in the population are used together in a regression to determine Up This is sensible because regions are indepenshydent and estimate tbe utility an absolute quaashytity ((Re83cJ cr [Br8l H07S))

Several trends are apparent in Table L First whether the criterion is 100 50 25 or 10 of optimum (columns 2-5) the proportion of good Hs increases steadily as population size K rises Similarly average best and popUlation performance measures 6 Db and Dp (columns 6shy8) also improve with K Perhaps most important is that the population performance Dp is so reu ably close to best even with these low K values This means that the whole population ot regions can be used (for Hp) without independent verification of perrormance In contrast indivi-middot dual Hs would require additional testing to disshycover the best (column 7) and the other alternashytive any H is likely not as good as Hp (colllI 6 and 8) Furthermore the entire populath or regions can become an accurate source of masshysive data for determining an evaluation function capturing reature interaction IRe83b

This accuracy advantage of PLS2 is illusshytrated in the 6nal column ot the table where ror a constant cost rough estimates are given or the expected error in population performance Dp relative to the optimal value

It is interesting that such sman popUlations improve perrormance markedly usually populashytion sizes are SO or more

o EFFICmNCY AND CAPABILITY

Based on these empirical observations ror PLS2 OD other comparisons ror PLSl and on various conceptual differences general properties or three competing methods can be compared PLSl

standard optimization and the luuestion was made tbat PLSI made better ase of available inrormation By Itudying sucll behaviors load underlying reasons we Ihould eventually identify principles of efficieat learning Some aspects are considered below

TradItIonal optlmlloatlon verau PLSI First let us consider efficiency of search for an optimal weight vector b in the evaluation runcshytion H = bt One good optimization method is repone lurace fittin (RSF) It can discover a local optimum in weight space by measuring and regressing the response (bere number of nodes developed D) for various values ot b RSF utilshyizes just a single quantit (ie D) ror every probshylem solved This seems like a small amount of inrormation to extract rrom an entire search lince a typical one may develop hundreds or thousands or nodes each possibly containing relevant inrormation In contrast to this tradishytional statistical approach PLSI like [5a63 5a671 uncovers knowledge about ever feature rrom ever node (see 143) PLSl then might be expected to be more efficient than RSF Experishyments veriry this [ReSI

TraditIonal optlmlratloll venta PLSI oM shown in SS PLS2 is more efficient ItiD We can compare it too with RSF The accuracy of RSF is known to improve with VN where N is the number or data (here the number or or probshylema solved) oM a arst approximation a paranel method like PLS2 should also cause accuracy to increase with the square root of the number or data although the data are now regions instead of D values If roughly the same number of regions is present in each individual set R or a population or size K accuracy must thererore improve as VK Since each of these K structures requires N problems in training the accuracy or PLS2 should increase as yenN like RSF

Obviously though PLS2 involves much more than blind parallelism a genetic algorithm extracts accurate knowledge and dismisses incorrect (un6t) inrormation While it is impossishyble ror PLSI alone PLS2 can rehe merit by localizing credit to individual regions (Re 83c] Planned experiments with this should show rurther increases in efficiency since the additional cost is small Another inexpensive improvement

10

will attempt to reward good regions by decreasshying their estimated errors Even without these refinements PLS2 retains meritorious regions (sect4) and should exhibit accuracy improvement better than vN Table I suggests this

PLSZ veraua PLSZ AI di8eussed ill sect41 and Appendix B PLSI is limited necessitating human tuning for optimum performance In conshytrast the second layer learning system PLS2 requires little human intervention The maiD reason is that PLS2 stabilizes knowledge automatically by comparing region sets and dismissing aberrant ones Accurate cumulative sets have a longer lifetime

1 SUMMARY AND CONCLUSIONS

PLS2 is a general learning system (Re83a Re83dJ Given a set or user-de6ned reatures and some measure of the utilil (eg probability or success in tast (errormance) PLS2 rorms aDd rebes an appropriate knowledge structure the cumulative region d R relating utility to reature values and permitting DOise manageshyment This economical and flexible structure mediate data objects and abstract heuristic knowledge

Since individual regions or the cumulative set R are independent or one anotber botb credit localization and reature interaction are possible

This ability to discriminate merit and retain successful data will likely be accentuated with the localization or credit to indiviltlual regions (see H2) Another improvement is to alter dynamically tbe error or a region (estimated by PLSl) as a runction or its merit (found by PLS2) This will have the elJect or protecting a good region rrom imperrect PLSI utility revision once some parallel PLSl has succeeded in discovshyering an accurate value it will be more immune to damage A fit region will bave a very long lirespan

Inherent differences In capablltty RSF PLSl and PLS2 can be cbaracterized dilJerently

From the standpoint or time costs given a chalshylenging requirement such as the location or a local optimum witbin 10 tbe ordering or tbese methods in terms or efficiency is RSF 5 PLSI 5

PLS2 In terms or capdilit tbe same relationshysbip bolds RSF cannot handle reature interacshytions without a more complex model (wbicb would increase its cost drastically) PLSl on the other hand can provide some performance improvement using piecewise linearity with little additional cost [Re83bJ PLS2 is more robust than PLSl While the original system is someshywhat sensitive to training and parameters PLS2 provides stability using competition to overcome deficiencies obviate tuning and increase accushyracy all at once PLS2 bulJers inadequacies inberent in PLSI Moreover PLS2 being genetishycally based may be able to handle bighly interacting reatures and discover 10601 optima Re 83c1 Tbis is very costly with RSF and seems inreasible with PLSl alone

simultaneously Separating tbe tast con trol structure H from the main store or knowledge R allows straigbtforward credit assignment to this dderminant R or H while H iteelt may incorshyporate reature nonlinearities witbout being responsible for tbem

A concise and adequate embodiment of current heuristic knowledge tbe cumulative region set R waa originally used in the learning system PLSl (Re83aJ PLSI is the only system shown to discover locally optimal evaluation runctions in an AI context Clearly superior to PLSl its genetic successor PLS2 haa been shown to be more stable more accurate more efficient and more convenient PLS2 employs an unusual genetic algorithm having tbe cumulative set R as a compressed genotype PLS2 extends PLSls limmiddot ited operations or revision (controlled mutation) and dilJerentiation (genotype expansion) to include generalization and other rules (K-sexual mating and genotype reorganization) Credit may be localized to individual gene sequences

These improvements may be viewed as eftectillg greater efficiency or as allowing greater capability Compared with a traditional metbod of optimization PLSl is more efficient (Re85aJ but PLS2 does even better Given a required accuracy PLS2 locates an optimum with lower expected cost In terms or capability PLS2 insushylates the system from inherent iIladequacies and sensitivities or PLSl PLS2 is much more stable and caD use the whole population of regions relishyably to create a bigbly informed heuristic (this popultion performance is not meaningrul in standard genetic systems) Tbis availability or massive data haa important implications for teature interaction lRe83blmiddot

11

Additional refinements or PLS2 may further increase efficiency and power These include rewarding meritorious regions 10 they become immune to damage Future experiments will investigate nonlinear capability ability to disshycover lobal optima and efficiency and effectiveness of focalized credit auinment

This paper has quantitatively affirmed some principles believed to improve efficiency and effectiveness of learning (eg credit localization) The paper has also considered some simple but little explored ideas for realizin these capabilishyties (eg tull but controlled use or each datum)

REFERENCES

(An 73) Ande~bert MR auter Aul IJr Applicamiddot tiIJn AcademicPress 1973

(Be 80J Bethke AD Genetic elgIJritlm bull bullbull uchu IJptimizer PhD Thesis Universitl of Michian 1980

[Br 81) Brindle A Genetic algorithms for function optimintion CS Department Report TR81middot2 (PhD Dissertation) Universitl of Alberta 1081

[Bu 78) Buchanan BG Johnson CR Mitchell TM and Smith RG Models or learnin systems in Belser J (Ed) Encdopclli 0 CIJmper Science 11 TcdnolIJfJ 11 (1978) 201

(Co 8) Coles D and Rendell LA Some issues in trainin learn in systems and an autonomoul desip PrIJc FitA Bicftaiel CIJerece cAe Cbullbullbullli SIJeiet IJ ComputetilJel Stli IJICeUiuce 1984

(De 80) DeJonl KA Adaptive system delip A enetic approach IEEE Trdibullbull a Stem Mbullbull nl C6enadiCl SMC-IO (1080) 566-S7(

(Di81) Dietterich TG and Buchanan BG The role of the critic in learnin systems Stanford Universitl Report STANmiddotC8-81middot801 1081

(Di82) Dietterich TG London B Clarkson K and Drome1 G Learninl and inductin inference STANmiddot C8-82-813 Stanford Universitl also Chapter XIV or Tlc HI6t1o tI Arlifidcl InteUlgenee Cohen PR and Feienbaum EA (Ed) Kaufmann 1082

(Ho 75] Holland JH AIIptti Nturel nll Arlificiel Stem Universitl or Michitan Press 1875

(H0801 Holland JH Adapive a1orithms ror discovshyerin and usin eneral patterns in powin knowlede bases IntI loumel Oil Polic AelbullbullbullIIormtiIJ Stem 4 2 (1080) 217middot240

(Ho 81) Holland JH Genetic a1prithms and adaptampshytion Proc NATO Ad Ret In Alpti CIJntrIJI 0

Ulefillel Sem 1081

(8083) Holland JH Eecapinl brit~lenelS Proc Sccod latemetael Mdae Lunai Workdop 1883 02-05

La83 Lanlley P Bradshaw GL and Simon HA Rediscoverinl chemistry with the Bacon system in Michalsk~ RS Carbonell JG and Mi~chell TM (Ed) MuMne Lunalg Aft Arliiciel Intelligente Appro Tiolamp 1983 307middot329

[Ma8 Mauldin ML Maintainin diversitl in renetic search Proc FnrlA NtiIJael CIJaereftce Oft Arlificicl InteUiece 10S4 247middot250 ~

(Me 851 Medin DL and Wattenmaker WD Catepl1 cohesiveness ~heoriel and copitive archeotshyOQ (as et unpublished manuscript) Dept of PsycholshyOQ Universit of Dlinois at Urbana Champaip 1985

[Mic 831o Michalsk~ RS A theory and methodology or inductive learninl Arlificiel laeUgece RO 2 (1883) 1I1middotlali reprinted in Michalski RS et aI (Ed) M4chine Lumi A Arlificiel InteUigcace Appr04cl Tioa 198383-134

[Mic 83b) Michalski RS an d Stepp RE Le arning rrom observation Conceptual clusteriD iD Michalski RS e~ al (Ed) Madiae Lunaia Aa Arlificial Intelmiddot ligence ApproA TioS 1083331middot383

(Mit 83) MitcheD TM Learnin and problem solvinr Proc EigAtA Ilterulibullbullcl loirampl CfJcrete on Arlificilll lteUieaee UI83 ll3G-lISI

(Ni80) Nilsson NJ Prieipl Arlijieiel Intellimiddot eee Tiolamp 1080

(ae 76) Rendell LA A method ror automatic eneramiddot tion of heuristics ror statemiddotspace problem Dept of Computer Science C8-76-10 Universit of Waterloo 1076

(ae 771 Rendell LA A locall optimal solution or the fifteen punle produced by an automatic evaluation runction senerator Dept or Computer Science C8-77middot 38 Universitl of Waterloo 1977

(ae 81] Rendell LA An adaptive plan for statemiddotspace problems Dept of Computer Science C8-81-13 (PhD thesis) Universit of Waterloo 1081

(ae 83aJ Rendell LA A new basis for state-space learDin systems and a successful implementation ArtajicielateUieee 10 (1883) ( 360-302

(Re 83bJ Rendell LA A learninl II)stem which accommodatet future interactions Prtle EiAtA Intermiddot eli1Il I Cbullbullcreaee Arliiciel Ifttdlilence 1883 6072

IRe 83c) RendeU LA A doubly layered sendic penetrance learnins stem Proc Tlir~ NotiouJ Ooeruce ()Il ArlificiJ l_teUigecc 1083 343-341

(ae 83dl Rendel~ LA Conceptual knowlede acquisishyion in Harch University or Guelph Report 015-83-15 Dept or Computins and Inrormation Science Guelph Canada 1083 (to appear in Bole L (ed) Kutllle~ge B4e4 Le4mi Stem SpringerVerias)

IRe SSa) Rendell LA Utility pattern u criteria for efficient seneraliution learninr Proc Jg8500_erec bullbull Ittlligerat Sytem lid M4diflu (to appear) lOSS

(ae SSbl Rendell LA A scientific approach to applied induction Proc 1985 International Machine Learninr Workshop Rutrers University (to appear) 1085

(ae SSc) Rendell LA Substantial constructive inducshytion ulinS layered information compression Tractable leature formation in search Proc Hitlt ltcm4tiobull J Joit Ooerucc o Artificial Intelligence (to appear) 1985

(Sa63) Samuel AL Some Itudies in machine learnshyins usin the 3me of checkers in Feigenbaum EA and Feldman J (Ed) Oomputer 4114 Thugll McGraw-Hili 1963 1H05

(Sa 61) Samue~ ALbull Some studie in machine learnshyinS usinr the same of checkers U-recent progress IBM J Ru IUI4 Duelop 11 (1961) 601-617

(Sm 80) Smitb SF A learnins stem based on lenetic adaptive alrorithms PhD Dissertation Univershysity of Pittsburgh 1980

[Sm 831 Smith SF Flexible learninr of problem 1I01vshyiDe heuristics throurb adaptive Harch Proc EiglUl Iteruttoul Joit OOflerenc 0_ ArlificiJ I_tellishyuce 1983422middot425

APPENDIX A GLOSSARY OF TERMS

Clustering Cluter anailln has lonl been used as a tool ror induction in statiatics and patshytern recognition (An 73) (See middotinduction) Improvements to basic clustering techniques lenshyerally use more than just the reatures or a datum ([An 73 p194) suggests -external criteria) External criteria in [Mi83 Re76 Re83a Re85b] involve prior sped6cation of the forms clusters may take (this has been called middotconceptual clusshytering (Mi83l) Criteria in (Re76 Re83a Re85bl are based on the data enviroDment (see

-utilit) below)ls This paper uses clustering to create economical compressed genetic structures (enotIlPu)

Feature A feature is an attribute ormiddot proshyperty of an object Features are usually quite abstract (el -center control or -mobility) in a board game The utility (see below) varies smoothly with a feature

Genetic algorithm In a GA a the charshyacter of an individual or a population is called a phenotype The phenotype ia coded as a string of digits called the enotvpe A single digit is a ene Instead of searcbing rule space directly (compare -learning system) a GA searches gene space (ie a GA searches for good genes in the population or genotypes) This search uses the merit of individual genotypes selecting the more successrul individuals to undergo genetic operations ror the production of offspring See sect2 and Fig3

Inductton Induction or eneralization learnin is an important means ror knowledge acquisition Information is actually created as data are compressed into c48fe or cateories in order to predict future events efficiently and effectively Induction may create reature space neighborhoods or clusterl See -clustering and sect41

Learning System Buchanan et al present a general model which distinguishes comshyponents of a learning system [Bu 78) The perforshymance element PE is guided by a control trucshyture H Based on observation or the PE the crishytic assesses H possibly localizing credit to parts or H (Bu 7S Di SI] The learning element LE uses thia inrormatioD to improve H for the next round of task perrormance Lavered systems have multiple PEs critics and LEs (eg PLS2 uses PLSJ as its PE -see FilI) Just as a PE searches for its goal in problem space the LE searches in rule pace DiS2) for an optimal H to control the PE-

To facilitate this higher goal PLS2 uses an intermediate knowledge structure which divides feature space into reion relating feature values to object utiitll (Re83d] and discovering a userul subset or features (ef (Sa63l) In this paper the control structure H is a linear evaluation funcshy

15 A new learnin~ system IRe SSc) introduces highermiddotdimensional clusterinr ror creation or structure

13

tion INiSOI and the rules are feature weights for H Search for accurate regions replaces direct search of rule space ie regions mediate data and H AB explained in sect3 sets of regions middotbecome compressed GA genotypes See also genetic algorithms middotPLS bull region and Fig I

Merit Also called paloff or fitness this is the measure used by a genetic algorithm to select parent genotypes lor prelerential reproducshytion of succesduJ individuals Compare middotutilitymiddot also see genetic algorithms

Object Objects are any data to be genshyeralized into categories Relationships usually depend on task domain See utility

PLS The pro6GiiliBtic learnin BIstem can learn what are sometimes called single conshycepts IDi82] but PLS is capable of much more difficult tasks involving noise management incremental learning and normaliution 01 biased data PLSt uniquely discovered locally optimal heuristics in search IRe 83al and PLS2 is the effective and efficient extension examined in this paper PLS manipulates middotregions (see below) using various inductive operations described in sect4

RegIon or Cell Depending on ones viewpoint the region is PLSs basic structure for clustering or for the genetic algorithm The region is a compressed representation 01 a utility sur race in augmented feature space it is also a compressed genotype representing a utility luncshytion to be optimized AB explained in IRe83d) the region representation is fully expressive proshyviding the features are See sect3 and Figs3amp4

Utll1ty u This is any measure of the useshylulness 01 an object in the perlormance or some task The utilitl provides a lid between the task domain and PLS generalization algorithms Utility can be a probability as in Fig 2 Comshypare merit See sect I 3

APPENDIX B PLSI LIMITATIONS

PLSI alone is inherently limited The problems relate to modi6cation of the main knowledge structure the cumulative region set R = (rue) As mentioned in sect41 R undergoes two basic alterations PLSI gradually changes the meanin of an established feature space recshytangle r by updating ita associated utility u (along with us error e) PLSI also incrementally

refines tbe feature space as rectangles r are conshytinually split

Both or these modifications (utility revision and region refinement) are largely directed by search data but the degree to which newer inrorshymation affects R depends on various choices or system parameters IRe 83al System parameters influence estimates of the error e and determine the degree of region re6nement These in turn affect the relative importance of new versus estashyblished knowledge

Consequently values or these parameters influence task performance For example there is a tradeol between utility revision and region refinement It rrampions are refined too quickly accuracy sulers (this is theoretically predictable) It instead utility revision predominates regions become inert (their estimated errors decline) but sometimes incorrectly18

There are several other problems including difficulties in training limitation in the utility revision algorithm and inaccurate estimation of various errors As a result utility estimations are imperfect and biased in unknown ways

Together the above uncertainties and senshysitivities explain the failure of PLSI always to locate an optimum with static training (Table f) The net effect is that PLSI works rairly well with no parameter tuning and unsophisticated trainshying and close to optimal with mild tuning and informed training (CoS4 as long as the features are well behaved

By nature however PLSI requires reatures exhibiting no worse than mild interactions This is a serious restriction since feature nonlinearity is prevalent On its own then PLSt is inherently limited There is simply no wall to learn utility accurately unless the elects of dilering heuristic runctions H are compared as in PLS2

ACKNOWLEDGEMENTS

I would like to thank Dave Coles for his lasting enthusiasm during the development implementashytion and testing of this system I appreciate the helpful suggestions from Chris Matheus Mike Mauldin and the Conference Reviewers

1amp Althourh lI7stem parameters are given by domain-independent statistieal analysis tunine these parameters nevertheless improves performanee in some eases (This is not required in PLS2)

14

detected and problem difficulty is reduced (ie dynamic training is needed)

Even without this catastrophe PLSI pershyrorms with varying degrees or success depending on the sophistication 01 ita training and other lacton (explained in Appendix B) With minimal human guidance PLSI always achieves a good evaluation lunction H although not always an optimal one Witb static training PLSI succeeds reasonably about balf the time

Stabtllty ot PLSt In contrast one would expect PLS2 to bave a much better success rate Since PLSI is here being run in parallel (Fig 1) and since PLS2 should reject hopeless cases (tbeir middots are small) a complete catastrophe (all Hs lailing) should occur with probability p s qK wbere q is the probability 01 PLSI lailure and K is popUlation size It q is even as large as one hall but K is 7 or more the probability p 01 catastrophe is less than 001

Cost versus benefit a measure Failure plays a part in costs so PLS2 may have an advantage The ultimate criterion for system quality is cost effectiveness is PLS2 worth its extra complexity Since the main cost is in task performance (here solving) the number of nodes developed 0 to attain some perlormance is a good measure of the expense

If training results in catastrophic bUure however all effort is wasted so a better measure is the expected cost Op where p is the probabiamp ity of success For example if 0 == 1gt00 for viable control structures but the probability of finding solutions is only then the aeraampe cost of useful information is 500 ~ == 1000

To extend this argument probability p depends on what is considered a success Is sucshycess the discovery of a perfect evaluation runcshytion H or is performance satisfactory ir 0 departs Irom optimal by no more than 21gt

53 Rulte

Table I shows perrormances and costa with yarious nlues of K Here p is estimated using roughly 38 trbJa or PLSI in a PLS2 context (it K == I 38 distind runl it K == 2 18 runl etc) Since variances is D are high performance testa were made oyer a random sample or 50 puzzles This typicaJJy iyes 91gt con6dence of 0 t 40

Accuraq ot learnlnl Let UI 6rst comshyp31e results 01 PLSI versul PLS2 for rour different success criteria We consider the learning to be successful it the resUlting heuristic H approaches optimal quality within a given margin (01 100 50 21gt and 10)

Columns two to five in the table (the second group) show the proportion of Hs giving performance 0 within a specified percentage or the best ho 0 (the best 0 is around 3SO nodes for the foar leatures used) For example the last row of the table shows that 01 the 36 individual control structures H tested in (two different) popul2tions of size 19 all 36 were within 100 of optimal 0 (column two) This means that aU developed no more than 700 nodes before a solutio was found Similarly column 6ve in the last row shows that 021 01 the 3t5 Hs or 8 or them were within 10 ie required no more than 386 nodes developed

Cost ot accuracy Columns ten and eleyen (the two riamphtmost columns of the lourth croup) show the estimated costs of achieving pershyformance withia 100 and within 10 or optimum respedively The values are based on the expected total number of nodes required (ie Op) with adjustments in favor of PLSI for extra PLS2 overhead (The unit is one thousand nodes developed) As K increases the cost of a given accuracy 6rst increases Nevertheless with just moderate K Yahles the genetic system becomes cheaper particularly ror an accuracy or 10

lb

I

Propolt0 auat Suce c Itbullbull 1

C lalt otlaa1 D lOot S 15 lot

__ lIdo

IRaNoa SI of SliIt t 654 U5 -nl IU sn n ItS 5 114 It n nl 1M Ul 14 I

Coat 1 1 11

__I

US n bull 11le19a lJ

poet Con t 0lIlbullbull ltalbull

1 101

J III na n au1bull nl na Ul II t4

foebullbullbulleobull It~t I

Jllot aot 12

n

1 4

12 IS 19

n J U bullbull4 6t 11 1 n 12 0 n 11 U )1 11

1 11 14 1 n 61 21

The expected cost benefit is not the only PLS2 and traditional optimization In [ReSl] advantage of PLS2 PLSI was round considerably more efficient than

Average pertormance beat perlo mance and population performance Conshysider now the third group of columns in Table I Tbe sixth column gives tbe IIverllle i5 for an H in the sample (or 36) The seventh column gives Db ror the best Hb in the sample These two measures average and be performance are orten used in assessing genetic systems (Br 81) The eighth column however is unusual it indishycates the population performance Dp resulting when all regions from every set in the population are used together in a regression to determine Up This is sensible because regions are indepenshydent and estimate tbe utility an absolute quaashytity ((Re83cJ cr [Br8l H07S))

Several trends are apparent in Table L First whether the criterion is 100 50 25 or 10 of optimum (columns 2-5) the proportion of good Hs increases steadily as population size K rises Similarly average best and popUlation performance measures 6 Db and Dp (columns 6shy8) also improve with K Perhaps most important is that the population performance Dp is so reu ably close to best even with these low K values This means that the whole population ot regions can be used (for Hp) without independent verification of perrormance In contrast indivi-middot dual Hs would require additional testing to disshycover the best (column 7) and the other alternashytive any H is likely not as good as Hp (colllI 6 and 8) Furthermore the entire populath or regions can become an accurate source of masshysive data for determining an evaluation function capturing reature interaction IRe83b

This accuracy advantage of PLS2 is illusshytrated in the 6nal column ot the table where ror a constant cost rough estimates are given or the expected error in population performance Dp relative to the optimal value

It is interesting that such sman popUlations improve perrormance markedly usually populashytion sizes are SO or more

o EFFICmNCY AND CAPABILITY

Based on these empirical observations ror PLS2 OD other comparisons ror PLSl and on various conceptual differences general properties or three competing methods can be compared PLSl

standard optimization and the luuestion was made tbat PLSI made better ase of available inrormation By Itudying sucll behaviors load underlying reasons we Ihould eventually identify principles of efficieat learning Some aspects are considered below

TradItIonal optlmlloatlon verau PLSI First let us consider efficiency of search for an optimal weight vector b in the evaluation runcshytion H = bt One good optimization method is repone lurace fittin (RSF) It can discover a local optimum in weight space by measuring and regressing the response (bere number of nodes developed D) for various values ot b RSF utilshyizes just a single quantit (ie D) ror every probshylem solved This seems like a small amount of inrormation to extract rrom an entire search lince a typical one may develop hundreds or thousands or nodes each possibly containing relevant inrormation In contrast to this tradishytional statistical approach PLSI like [5a63 5a671 uncovers knowledge about ever feature rrom ever node (see 143) PLSl then might be expected to be more efficient than RSF Experishyments veriry this [ReSI

TraditIonal optlmlratloll venta PLSI oM shown in SS PLS2 is more efficient ItiD We can compare it too with RSF The accuracy of RSF is known to improve with VN where N is the number or data (here the number or or probshylema solved) oM a arst approximation a paranel method like PLS2 should also cause accuracy to increase with the square root of the number or data although the data are now regions instead of D values If roughly the same number of regions is present in each individual set R or a population or size K accuracy must thererore improve as VK Since each of these K structures requires N problems in training the accuracy or PLS2 should increase as yenN like RSF

Obviously though PLS2 involves much more than blind parallelism a genetic algorithm extracts accurate knowledge and dismisses incorrect (un6t) inrormation While it is impossishyble ror PLSI alone PLS2 can rehe merit by localizing credit to individual regions (Re 83c] Planned experiments with this should show rurther increases in efficiency since the additional cost is small Another inexpensive improvement

10

will attempt to reward good regions by decreasshying their estimated errors Even without these refinements PLS2 retains meritorious regions (sect4) and should exhibit accuracy improvement better than vN Table I suggests this

PLSZ veraua PLSZ AI di8eussed ill sect41 and Appendix B PLSI is limited necessitating human tuning for optimum performance In conshytrast the second layer learning system PLS2 requires little human intervention The maiD reason is that PLS2 stabilizes knowledge automatically by comparing region sets and dismissing aberrant ones Accurate cumulative sets have a longer lifetime

1 SUMMARY AND CONCLUSIONS

PLS2 is a general learning system (Re83a Re83dJ Given a set or user-de6ned reatures and some measure of the utilil (eg probability or success in tast (errormance) PLS2 rorms aDd rebes an appropriate knowledge structure the cumulative region d R relating utility to reature values and permitting DOise manageshyment This economical and flexible structure mediate data objects and abstract heuristic knowledge

Since individual regions or the cumulative set R are independent or one anotber botb credit localization and reature interaction are possible

This ability to discriminate merit and retain successful data will likely be accentuated with the localization or credit to indiviltlual regions (see H2) Another improvement is to alter dynamically tbe error or a region (estimated by PLSl) as a runction or its merit (found by PLS2) This will have the elJect or protecting a good region rrom imperrect PLSI utility revision once some parallel PLSl has succeeded in discovshyering an accurate value it will be more immune to damage A fit region will bave a very long lirespan

Inherent differences In capablltty RSF PLSl and PLS2 can be cbaracterized dilJerently

From the standpoint or time costs given a chalshylenging requirement such as the location or a local optimum witbin 10 tbe ordering or tbese methods in terms or efficiency is RSF 5 PLSI 5

PLS2 In terms or capdilit tbe same relationshysbip bolds RSF cannot handle reature interacshytions without a more complex model (wbicb would increase its cost drastically) PLSl on the other hand can provide some performance improvement using piecewise linearity with little additional cost [Re83bJ PLS2 is more robust than PLSl While the original system is someshywhat sensitive to training and parameters PLS2 provides stability using competition to overcome deficiencies obviate tuning and increase accushyracy all at once PLS2 bulJers inadequacies inberent in PLSI Moreover PLS2 being genetishycally based may be able to handle bighly interacting reatures and discover 10601 optima Re 83c1 Tbis is very costly with RSF and seems inreasible with PLSl alone

simultaneously Separating tbe tast con trol structure H from the main store or knowledge R allows straigbtforward credit assignment to this dderminant R or H while H iteelt may incorshyporate reature nonlinearities witbout being responsible for tbem

A concise and adequate embodiment of current heuristic knowledge tbe cumulative region set R waa originally used in the learning system PLSl (Re83aJ PLSI is the only system shown to discover locally optimal evaluation runctions in an AI context Clearly superior to PLSl its genetic successor PLS2 haa been shown to be more stable more accurate more efficient and more convenient PLS2 employs an unusual genetic algorithm having tbe cumulative set R as a compressed genotype PLS2 extends PLSls limmiddot ited operations or revision (controlled mutation) and dilJerentiation (genotype expansion) to include generalization and other rules (K-sexual mating and genotype reorganization) Credit may be localized to individual gene sequences

These improvements may be viewed as eftectillg greater efficiency or as allowing greater capability Compared with a traditional metbod of optimization PLSl is more efficient (Re85aJ but PLS2 does even better Given a required accuracy PLS2 locates an optimum with lower expected cost In terms or capability PLS2 insushylates the system from inherent iIladequacies and sensitivities or PLSl PLS2 is much more stable and caD use the whole population of regions relishyably to create a bigbly informed heuristic (this popultion performance is not meaningrul in standard genetic systems) Tbis availability or massive data haa important implications for teature interaction lRe83blmiddot

11

Additional refinements or PLS2 may further increase efficiency and power These include rewarding meritorious regions 10 they become immune to damage Future experiments will investigate nonlinear capability ability to disshycover lobal optima and efficiency and effectiveness of focalized credit auinment

This paper has quantitatively affirmed some principles believed to improve efficiency and effectiveness of learning (eg credit localization) The paper has also considered some simple but little explored ideas for realizin these capabilishyties (eg tull but controlled use or each datum)

REFERENCES

(An 73) Ande~bert MR auter Aul IJr Applicamiddot tiIJn AcademicPress 1973

(Be 80J Bethke AD Genetic elgIJritlm bull bullbull uchu IJptimizer PhD Thesis Universitl of Michian 1980

[Br 81) Brindle A Genetic algorithms for function optimintion CS Department Report TR81middot2 (PhD Dissertation) Universitl of Alberta 1081

[Bu 78) Buchanan BG Johnson CR Mitchell TM and Smith RG Models or learnin systems in Belser J (Ed) Encdopclli 0 CIJmper Science 11 TcdnolIJfJ 11 (1978) 201

(Co 8) Coles D and Rendell LA Some issues in trainin learn in systems and an autonomoul desip PrIJc FitA Bicftaiel CIJerece cAe Cbullbullbullli SIJeiet IJ ComputetilJel Stli IJICeUiuce 1984

(De 80) DeJonl KA Adaptive system delip A enetic approach IEEE Trdibullbull a Stem Mbullbull nl C6enadiCl SMC-IO (1080) 566-S7(

(Di81) Dietterich TG and Buchanan BG The role of the critic in learnin systems Stanford Universitl Report STANmiddotC8-81middot801 1081

(Di82) Dietterich TG London B Clarkson K and Drome1 G Learninl and inductin inference STANmiddot C8-82-813 Stanford Universitl also Chapter XIV or Tlc HI6t1o tI Arlifidcl InteUlgenee Cohen PR and Feienbaum EA (Ed) Kaufmann 1082

(Ho 75] Holland JH AIIptti Nturel nll Arlificiel Stem Universitl or Michitan Press 1875

(H0801 Holland JH Adapive a1orithms ror discovshyerin and usin eneral patterns in powin knowlede bases IntI loumel Oil Polic AelbullbullbullIIormtiIJ Stem 4 2 (1080) 217middot240

(Ho 81) Holland JH Genetic a1prithms and adaptampshytion Proc NATO Ad Ret In Alpti CIJntrIJI 0

Ulefillel Sem 1081

(8083) Holland JH Eecapinl brit~lenelS Proc Sccod latemetael Mdae Lunai Workdop 1883 02-05

La83 Lanlley P Bradshaw GL and Simon HA Rediscoverinl chemistry with the Bacon system in Michalsk~ RS Carbonell JG and Mi~chell TM (Ed) MuMne Lunalg Aft Arliiciel Intelligente Appro Tiolamp 1983 307middot329

[Ma8 Mauldin ML Maintainin diversitl in renetic search Proc FnrlA NtiIJael CIJaereftce Oft Arlificicl InteUiece 10S4 247middot250 ~

(Me 851 Medin DL and Wattenmaker WD Catepl1 cohesiveness ~heoriel and copitive archeotshyOQ (as et unpublished manuscript) Dept of PsycholshyOQ Universit of Dlinois at Urbana Champaip 1985

[Mic 831o Michalsk~ RS A theory and methodology or inductive learninl Arlificiel laeUgece RO 2 (1883) 1I1middotlali reprinted in Michalski RS et aI (Ed) M4chine Lumi A Arlificiel InteUigcace Appr04cl Tioa 198383-134

[Mic 83b) Michalski RS an d Stepp RE Le arning rrom observation Conceptual clusteriD iD Michalski RS e~ al (Ed) Madiae Lunaia Aa Arlificial Intelmiddot ligence ApproA TioS 1083331middot383

(Mit 83) MitcheD TM Learnin and problem solvinr Proc EigAtA Ilterulibullbullcl loirampl CfJcrete on Arlificilll lteUieaee UI83 ll3G-lISI

(Ni80) Nilsson NJ Prieipl Arlijieiel Intellimiddot eee Tiolamp 1080

(ae 76) Rendell LA A method ror automatic eneramiddot tion of heuristics ror statemiddotspace problem Dept of Computer Science C8-76-10 Universit of Waterloo 1076

(ae 771 Rendell LA A locall optimal solution or the fifteen punle produced by an automatic evaluation runction senerator Dept or Computer Science C8-77middot 38 Universitl of Waterloo 1977

(ae 81] Rendell LA An adaptive plan for statemiddotspace problems Dept of Computer Science C8-81-13 (PhD thesis) Universit of Waterloo 1081

(ae 83aJ Rendell LA A new basis for state-space learDin systems and a successful implementation ArtajicielateUieee 10 (1883) ( 360-302

(Re 83bJ Rendell LA A learninl II)stem which accommodatet future interactions Prtle EiAtA Intermiddot eli1Il I Cbullbullcreaee Arliiciel Ifttdlilence 1883 6072

IRe 83c) RendeU LA A doubly layered sendic penetrance learnins stem Proc Tlir~ NotiouJ Ooeruce ()Il ArlificiJ l_teUigecc 1083 343-341

(ae 83dl Rendel~ LA Conceptual knowlede acquisishyion in Harch University or Guelph Report 015-83-15 Dept or Computins and Inrormation Science Guelph Canada 1083 (to appear in Bole L (ed) Kutllle~ge B4e4 Le4mi Stem SpringerVerias)

IRe SSa) Rendell LA Utility pattern u criteria for efficient seneraliution learninr Proc Jg8500_erec bullbull Ittlligerat Sytem lid M4diflu (to appear) lOSS

(ae SSbl Rendell LA A scientific approach to applied induction Proc 1985 International Machine Learninr Workshop Rutrers University (to appear) 1085

(ae SSc) Rendell LA Substantial constructive inducshytion ulinS layered information compression Tractable leature formation in search Proc Hitlt ltcm4tiobull J Joit Ooerucc o Artificial Intelligence (to appear) 1985

(Sa63) Samuel AL Some Itudies in machine learnshyins usin the 3me of checkers in Feigenbaum EA and Feldman J (Ed) Oomputer 4114 Thugll McGraw-Hili 1963 1H05

(Sa 61) Samue~ ALbull Some studie in machine learnshyinS usinr the same of checkers U-recent progress IBM J Ru IUI4 Duelop 11 (1961) 601-617

(Sm 80) Smitb SF A learnins stem based on lenetic adaptive alrorithms PhD Dissertation Univershysity of Pittsburgh 1980

[Sm 831 Smith SF Flexible learninr of problem 1I01vshyiDe heuristics throurb adaptive Harch Proc EiglUl Iteruttoul Joit OOflerenc 0_ ArlificiJ I_tellishyuce 1983422middot425

APPENDIX A GLOSSARY OF TERMS

Clustering Cluter anailln has lonl been used as a tool ror induction in statiatics and patshytern recognition (An 73) (See middotinduction) Improvements to basic clustering techniques lenshyerally use more than just the reatures or a datum ([An 73 p194) suggests -external criteria) External criteria in [Mi83 Re76 Re83a Re85b] involve prior sped6cation of the forms clusters may take (this has been called middotconceptual clusshytering (Mi83l) Criteria in (Re76 Re83a Re85bl are based on the data enviroDment (see

-utilit) below)ls This paper uses clustering to create economical compressed genetic structures (enotIlPu)

Feature A feature is an attribute ormiddot proshyperty of an object Features are usually quite abstract (el -center control or -mobility) in a board game The utility (see below) varies smoothly with a feature

Genetic algorithm In a GA a the charshyacter of an individual or a population is called a phenotype The phenotype ia coded as a string of digits called the enotvpe A single digit is a ene Instead of searcbing rule space directly (compare -learning system) a GA searches gene space (ie a GA searches for good genes in the population or genotypes) This search uses the merit of individual genotypes selecting the more successrul individuals to undergo genetic operations ror the production of offspring See sect2 and Fig3

Inductton Induction or eneralization learnin is an important means ror knowledge acquisition Information is actually created as data are compressed into c48fe or cateories in order to predict future events efficiently and effectively Induction may create reature space neighborhoods or clusterl See -clustering and sect41

Learning System Buchanan et al present a general model which distinguishes comshyponents of a learning system [Bu 78) The perforshymance element PE is guided by a control trucshyture H Based on observation or the PE the crishytic assesses H possibly localizing credit to parts or H (Bu 7S Di SI] The learning element LE uses thia inrormatioD to improve H for the next round of task perrormance Lavered systems have multiple PEs critics and LEs (eg PLS2 uses PLSJ as its PE -see FilI) Just as a PE searches for its goal in problem space the LE searches in rule pace DiS2) for an optimal H to control the PE-

To facilitate this higher goal PLS2 uses an intermediate knowledge structure which divides feature space into reion relating feature values to object utiitll (Re83d] and discovering a userul subset or features (ef (Sa63l) In this paper the control structure H is a linear evaluation funcshy

15 A new learnin~ system IRe SSc) introduces highermiddotdimensional clusterinr ror creation or structure

13

tion INiSOI and the rules are feature weights for H Search for accurate regions replaces direct search of rule space ie regions mediate data and H AB explained in sect3 sets of regions middotbecome compressed GA genotypes See also genetic algorithms middotPLS bull region and Fig I

Merit Also called paloff or fitness this is the measure used by a genetic algorithm to select parent genotypes lor prelerential reproducshytion of succesduJ individuals Compare middotutilitymiddot also see genetic algorithms

Object Objects are any data to be genshyeralized into categories Relationships usually depend on task domain See utility

PLS The pro6GiiliBtic learnin BIstem can learn what are sometimes called single conshycepts IDi82] but PLS is capable of much more difficult tasks involving noise management incremental learning and normaliution 01 biased data PLSt uniquely discovered locally optimal heuristics in search IRe 83al and PLS2 is the effective and efficient extension examined in this paper PLS manipulates middotregions (see below) using various inductive operations described in sect4

RegIon or Cell Depending on ones viewpoint the region is PLSs basic structure for clustering or for the genetic algorithm The region is a compressed representation 01 a utility sur race in augmented feature space it is also a compressed genotype representing a utility luncshytion to be optimized AB explained in IRe83d) the region representation is fully expressive proshyviding the features are See sect3 and Figs3amp4

Utll1ty u This is any measure of the useshylulness 01 an object in the perlormance or some task The utilitl provides a lid between the task domain and PLS generalization algorithms Utility can be a probability as in Fig 2 Comshypare merit See sect I 3

APPENDIX B PLSI LIMITATIONS

PLSI alone is inherently limited The problems relate to modi6cation of the main knowledge structure the cumulative region set R = (rue) As mentioned in sect41 R undergoes two basic alterations PLSI gradually changes the meanin of an established feature space recshytangle r by updating ita associated utility u (along with us error e) PLSI also incrementally

refines tbe feature space as rectangles r are conshytinually split

Both or these modifications (utility revision and region refinement) are largely directed by search data but the degree to which newer inrorshymation affects R depends on various choices or system parameters IRe 83al System parameters influence estimates of the error e and determine the degree of region re6nement These in turn affect the relative importance of new versus estashyblished knowledge

Consequently values or these parameters influence task performance For example there is a tradeol between utility revision and region refinement It rrampions are refined too quickly accuracy sulers (this is theoretically predictable) It instead utility revision predominates regions become inert (their estimated errors decline) but sometimes incorrectly18

There are several other problems including difficulties in training limitation in the utility revision algorithm and inaccurate estimation of various errors As a result utility estimations are imperfect and biased in unknown ways

Together the above uncertainties and senshysitivities explain the failure of PLSI always to locate an optimum with static training (Table f) The net effect is that PLSI works rairly well with no parameter tuning and unsophisticated trainshying and close to optimal with mild tuning and informed training (CoS4 as long as the features are well behaved

By nature however PLSI requires reatures exhibiting no worse than mild interactions This is a serious restriction since feature nonlinearity is prevalent On its own then PLSt is inherently limited There is simply no wall to learn utility accurately unless the elects of dilering heuristic runctions H are compared as in PLS2

ACKNOWLEDGEMENTS

I would like to thank Dave Coles for his lasting enthusiasm during the development implementashytion and testing of this system I appreciate the helpful suggestions from Chris Matheus Mike Mauldin and the Conference Reviewers

1amp Althourh lI7stem parameters are given by domain-independent statistieal analysis tunine these parameters nevertheless improves performanee in some eases (This is not required in PLS2)

14

The expected cost benefit is not the only PLS2 and traditional optimization In [ReSl] advantage of PLS2 PLSI was round considerably more efficient than

Average pertormance beat perlo mance and population performance Conshysider now the third group of columns in Table I Tbe sixth column gives tbe IIverllle i5 for an H in the sample (or 36) The seventh column gives Db ror the best Hb in the sample These two measures average and be performance are orten used in assessing genetic systems (Br 81) The eighth column however is unusual it indishycates the population performance Dp resulting when all regions from every set in the population are used together in a regression to determine Up This is sensible because regions are indepenshydent and estimate tbe utility an absolute quaashytity ((Re83cJ cr [Br8l H07S))

Several trends are apparent in Table L First whether the criterion is 100 50 25 or 10 of optimum (columns 2-5) the proportion of good Hs increases steadily as population size K rises Similarly average best and popUlation performance measures 6 Db and Dp (columns 6shy8) also improve with K Perhaps most important is that the population performance Dp is so reu ably close to best even with these low K values This means that the whole population ot regions can be used (for Hp) without independent verification of perrormance In contrast indivi-middot dual Hs would require additional testing to disshycover the best (column 7) and the other alternashytive any H is likely not as good as Hp (colllI 6 and 8) Furthermore the entire populath or regions can become an accurate source of masshysive data for determining an evaluation function capturing reature interaction IRe83b

This accuracy advantage of PLS2 is illusshytrated in the 6nal column ot the table where ror a constant cost rough estimates are given or the expected error in population performance Dp relative to the optimal value

It is interesting that such sman popUlations improve perrormance markedly usually populashytion sizes are SO or more

o EFFICmNCY AND CAPABILITY

Based on these empirical observations ror PLS2 OD other comparisons ror PLSl and on various conceptual differences general properties or three competing methods can be compared PLSl

standard optimization and the luuestion was made tbat PLSI made better ase of available inrormation By Itudying sucll behaviors load underlying reasons we Ihould eventually identify principles of efficieat learning Some aspects are considered below

TradItIonal optlmlloatlon verau PLSI First let us consider efficiency of search for an optimal weight vector b in the evaluation runcshytion H = bt One good optimization method is repone lurace fittin (RSF) It can discover a local optimum in weight space by measuring and regressing the response (bere number of nodes developed D) for various values ot b RSF utilshyizes just a single quantit (ie D) ror every probshylem solved This seems like a small amount of inrormation to extract rrom an entire search lince a typical one may develop hundreds or thousands or nodes each possibly containing relevant inrormation In contrast to this tradishytional statistical approach PLSI like [5a63 5a671 uncovers knowledge about ever feature rrom ever node (see 143) PLSl then might be expected to be more efficient than RSF Experishyments veriry this [ReSI

TraditIonal optlmlratloll venta PLSI oM shown in SS PLS2 is more efficient ItiD We can compare it too with RSF The accuracy of RSF is known to improve with VN where N is the number or data (here the number or or probshylema solved) oM a arst approximation a paranel method like PLS2 should also cause accuracy to increase with the square root of the number or data although the data are now regions instead of D values If roughly the same number of regions is present in each individual set R or a population or size K accuracy must thererore improve as VK Since each of these K structures requires N problems in training the accuracy or PLS2 should increase as yenN like RSF

Obviously though PLS2 involves much more than blind parallelism a genetic algorithm extracts accurate knowledge and dismisses incorrect (un6t) inrormation While it is impossishyble ror PLSI alone PLS2 can rehe merit by localizing credit to individual regions (Re 83c] Planned experiments with this should show rurther increases in efficiency since the additional cost is small Another inexpensive improvement

10

will attempt to reward good regions by decreasshying their estimated errors Even without these refinements PLS2 retains meritorious regions (sect4) and should exhibit accuracy improvement better than vN Table I suggests this

PLSZ veraua PLSZ AI di8eussed ill sect41 and Appendix B PLSI is limited necessitating human tuning for optimum performance In conshytrast the second layer learning system PLS2 requires little human intervention The maiD reason is that PLS2 stabilizes knowledge automatically by comparing region sets and dismissing aberrant ones Accurate cumulative sets have a longer lifetime

1 SUMMARY AND CONCLUSIONS

PLS2 is a general learning system (Re83a Re83dJ Given a set or user-de6ned reatures and some measure of the utilil (eg probability or success in tast (errormance) PLS2 rorms aDd rebes an appropriate knowledge structure the cumulative region d R relating utility to reature values and permitting DOise manageshyment This economical and flexible structure mediate data objects and abstract heuristic knowledge

Since individual regions or the cumulative set R are independent or one anotber botb credit localization and reature interaction are possible

This ability to discriminate merit and retain successful data will likely be accentuated with the localization or credit to indiviltlual regions (see H2) Another improvement is to alter dynamically tbe error or a region (estimated by PLSl) as a runction or its merit (found by PLS2) This will have the elJect or protecting a good region rrom imperrect PLSI utility revision once some parallel PLSl has succeeded in discovshyering an accurate value it will be more immune to damage A fit region will bave a very long lirespan

Inherent differences In capablltty RSF PLSl and PLS2 can be cbaracterized dilJerently

From the standpoint or time costs given a chalshylenging requirement such as the location or a local optimum witbin 10 tbe ordering or tbese methods in terms or efficiency is RSF 5 PLSI 5

PLS2 In terms or capdilit tbe same relationshysbip bolds RSF cannot handle reature interacshytions without a more complex model (wbicb would increase its cost drastically) PLSl on the other hand can provide some performance improvement using piecewise linearity with little additional cost [Re83bJ PLS2 is more robust than PLSl While the original system is someshywhat sensitive to training and parameters PLS2 provides stability using competition to overcome deficiencies obviate tuning and increase accushyracy all at once PLS2 bulJers inadequacies inberent in PLSI Moreover PLS2 being genetishycally based may be able to handle bighly interacting reatures and discover 10601 optima Re 83c1 Tbis is very costly with RSF and seems inreasible with PLSl alone

simultaneously Separating tbe tast con trol structure H from the main store or knowledge R allows straigbtforward credit assignment to this dderminant R or H while H iteelt may incorshyporate reature nonlinearities witbout being responsible for tbem

A concise and adequate embodiment of current heuristic knowledge tbe cumulative region set R waa originally used in the learning system PLSl (Re83aJ PLSI is the only system shown to discover locally optimal evaluation runctions in an AI context Clearly superior to PLSl its genetic successor PLS2 haa been shown to be more stable more accurate more efficient and more convenient PLS2 employs an unusual genetic algorithm having tbe cumulative set R as a compressed genotype PLS2 extends PLSls limmiddot ited operations or revision (controlled mutation) and dilJerentiation (genotype expansion) to include generalization and other rules (K-sexual mating and genotype reorganization) Credit may be localized to individual gene sequences

These improvements may be viewed as eftectillg greater efficiency or as allowing greater capability Compared with a traditional metbod of optimization PLSl is more efficient (Re85aJ but PLS2 does even better Given a required accuracy PLS2 locates an optimum with lower expected cost In terms or capability PLS2 insushylates the system from inherent iIladequacies and sensitivities or PLSl PLS2 is much more stable and caD use the whole population of regions relishyably to create a bigbly informed heuristic (this popultion performance is not meaningrul in standard genetic systems) Tbis availability or massive data haa important implications for teature interaction lRe83blmiddot

11

Additional refinements or PLS2 may further increase efficiency and power These include rewarding meritorious regions 10 they become immune to damage Future experiments will investigate nonlinear capability ability to disshycover lobal optima and efficiency and effectiveness of focalized credit auinment

This paper has quantitatively affirmed some principles believed to improve efficiency and effectiveness of learning (eg credit localization) The paper has also considered some simple but little explored ideas for realizin these capabilishyties (eg tull but controlled use or each datum)

REFERENCES

(An 73) Ande~bert MR auter Aul IJr Applicamiddot tiIJn AcademicPress 1973

(Be 80J Bethke AD Genetic elgIJritlm bull bullbull uchu IJptimizer PhD Thesis Universitl of Michian 1980

[Br 81) Brindle A Genetic algorithms for function optimintion CS Department Report TR81middot2 (PhD Dissertation) Universitl of Alberta 1081

[Bu 78) Buchanan BG Johnson CR Mitchell TM and Smith RG Models or learnin systems in Belser J (Ed) Encdopclli 0 CIJmper Science 11 TcdnolIJfJ 11 (1978) 201

(Co 8) Coles D and Rendell LA Some issues in trainin learn in systems and an autonomoul desip PrIJc FitA Bicftaiel CIJerece cAe Cbullbullbullli SIJeiet IJ ComputetilJel Stli IJICeUiuce 1984

(De 80) DeJonl KA Adaptive system delip A enetic approach IEEE Trdibullbull a Stem Mbullbull nl C6enadiCl SMC-IO (1080) 566-S7(

(Di81) Dietterich TG and Buchanan BG The role of the critic in learnin systems Stanford Universitl Report STANmiddotC8-81middot801 1081

(Di82) Dietterich TG London B Clarkson K and Drome1 G Learninl and inductin inference STANmiddot C8-82-813 Stanford Universitl also Chapter XIV or Tlc HI6t1o tI Arlifidcl InteUlgenee Cohen PR and Feienbaum EA (Ed) Kaufmann 1082

(Ho 75] Holland JH AIIptti Nturel nll Arlificiel Stem Universitl or Michitan Press 1875

(H0801 Holland JH Adapive a1orithms ror discovshyerin and usin eneral patterns in powin knowlede bases IntI loumel Oil Polic AelbullbullbullIIormtiIJ Stem 4 2 (1080) 217middot240

(Ho 81) Holland JH Genetic a1prithms and adaptampshytion Proc NATO Ad Ret In Alpti CIJntrIJI 0

Ulefillel Sem 1081

(8083) Holland JH Eecapinl brit~lenelS Proc Sccod latemetael Mdae Lunai Workdop 1883 02-05

La83 Lanlley P Bradshaw GL and Simon HA Rediscoverinl chemistry with the Bacon system in Michalsk~ RS Carbonell JG and Mi~chell TM (Ed) MuMne Lunalg Aft Arliiciel Intelligente Appro Tiolamp 1983 307middot329

[Ma8 Mauldin ML Maintainin diversitl in renetic search Proc FnrlA NtiIJael CIJaereftce Oft Arlificicl InteUiece 10S4 247middot250 ~

(Me 851 Medin DL and Wattenmaker WD Catepl1 cohesiveness ~heoriel and copitive archeotshyOQ (as et unpublished manuscript) Dept of PsycholshyOQ Universit of Dlinois at Urbana Champaip 1985

[Mic 831o Michalsk~ RS A theory and methodology or inductive learninl Arlificiel laeUgece RO 2 (1883) 1I1middotlali reprinted in Michalski RS et aI (Ed) M4chine Lumi A Arlificiel InteUigcace Appr04cl Tioa 198383-134

[Mic 83b) Michalski RS an d Stepp RE Le arning rrom observation Conceptual clusteriD iD Michalski RS e~ al (Ed) Madiae Lunaia Aa Arlificial Intelmiddot ligence ApproA TioS 1083331middot383

(Mit 83) MitcheD TM Learnin and problem solvinr Proc EigAtA Ilterulibullbullcl loirampl CfJcrete on Arlificilll lteUieaee UI83 ll3G-lISI

(Ni80) Nilsson NJ Prieipl Arlijieiel Intellimiddot eee Tiolamp 1080

(ae 76) Rendell LA A method ror automatic eneramiddot tion of heuristics ror statemiddotspace problem Dept of Computer Science C8-76-10 Universit of Waterloo 1076

(ae 771 Rendell LA A locall optimal solution or the fifteen punle produced by an automatic evaluation runction senerator Dept or Computer Science C8-77middot 38 Universitl of Waterloo 1977

(ae 81] Rendell LA An adaptive plan for statemiddotspace problems Dept of Computer Science C8-81-13 (PhD thesis) Universit of Waterloo 1081

(ae 83aJ Rendell LA A new basis for state-space learDin systems and a successful implementation ArtajicielateUieee 10 (1883) ( 360-302

(Re 83bJ Rendell LA A learninl II)stem which accommodatet future interactions Prtle EiAtA Intermiddot eli1Il I Cbullbullcreaee Arliiciel Ifttdlilence 1883 6072

IRe 83c) RendeU LA A doubly layered sendic penetrance learnins stem Proc Tlir~ NotiouJ Ooeruce ()Il ArlificiJ l_teUigecc 1083 343-341

(ae 83dl Rendel~ LA Conceptual knowlede acquisishyion in Harch University or Guelph Report 015-83-15 Dept or Computins and Inrormation Science Guelph Canada 1083 (to appear in Bole L (ed) Kutllle~ge B4e4 Le4mi Stem SpringerVerias)

IRe SSa) Rendell LA Utility pattern u criteria for efficient seneraliution learninr Proc Jg8500_erec bullbull Ittlligerat Sytem lid M4diflu (to appear) lOSS

(ae SSbl Rendell LA A scientific approach to applied induction Proc 1985 International Machine Learninr Workshop Rutrers University (to appear) 1085

(ae SSc) Rendell LA Substantial constructive inducshytion ulinS layered information compression Tractable leature formation in search Proc Hitlt ltcm4tiobull J Joit Ooerucc o Artificial Intelligence (to appear) 1985

(Sa63) Samuel AL Some Itudies in machine learnshyins usin the 3me of checkers in Feigenbaum EA and Feldman J (Ed) Oomputer 4114 Thugll McGraw-Hili 1963 1H05

(Sa 61) Samue~ ALbull Some studie in machine learnshyinS usinr the same of checkers U-recent progress IBM J Ru IUI4 Duelop 11 (1961) 601-617

(Sm 80) Smitb SF A learnins stem based on lenetic adaptive alrorithms PhD Dissertation Univershysity of Pittsburgh 1980

[Sm 831 Smith SF Flexible learninr of problem 1I01vshyiDe heuristics throurb adaptive Harch Proc EiglUl Iteruttoul Joit OOflerenc 0_ ArlificiJ I_tellishyuce 1983422middot425

APPENDIX A GLOSSARY OF TERMS

Clustering Cluter anailln has lonl been used as a tool ror induction in statiatics and patshytern recognition (An 73) (See middotinduction) Improvements to basic clustering techniques lenshyerally use more than just the reatures or a datum ([An 73 p194) suggests -external criteria) External criteria in [Mi83 Re76 Re83a Re85b] involve prior sped6cation of the forms clusters may take (this has been called middotconceptual clusshytering (Mi83l) Criteria in (Re76 Re83a Re85bl are based on the data enviroDment (see

-utilit) below)ls This paper uses clustering to create economical compressed genetic structures (enotIlPu)

Feature A feature is an attribute ormiddot proshyperty of an object Features are usually quite abstract (el -center control or -mobility) in a board game The utility (see below) varies smoothly with a feature

Genetic algorithm In a GA a the charshyacter of an individual or a population is called a phenotype The phenotype ia coded as a string of digits called the enotvpe A single digit is a ene Instead of searcbing rule space directly (compare -learning system) a GA searches gene space (ie a GA searches for good genes in the population or genotypes) This search uses the merit of individual genotypes selecting the more successrul individuals to undergo genetic operations ror the production of offspring See sect2 and Fig3

Inductton Induction or eneralization learnin is an important means ror knowledge acquisition Information is actually created as data are compressed into c48fe or cateories in order to predict future events efficiently and effectively Induction may create reature space neighborhoods or clusterl See -clustering and sect41

Learning System Buchanan et al present a general model which distinguishes comshyponents of a learning system [Bu 78) The perforshymance element PE is guided by a control trucshyture H Based on observation or the PE the crishytic assesses H possibly localizing credit to parts or H (Bu 7S Di SI] The learning element LE uses thia inrormatioD to improve H for the next round of task perrormance Lavered systems have multiple PEs critics and LEs (eg PLS2 uses PLSJ as its PE -see FilI) Just as a PE searches for its goal in problem space the LE searches in rule pace DiS2) for an optimal H to control the PE-

To facilitate this higher goal PLS2 uses an intermediate knowledge structure which divides feature space into reion relating feature values to object utiitll (Re83d] and discovering a userul subset or features (ef (Sa63l) In this paper the control structure H is a linear evaluation funcshy

15 A new learnin~ system IRe SSc) introduces highermiddotdimensional clusterinr ror creation or structure

13

tion INiSOI and the rules are feature weights for H Search for accurate regions replaces direct search of rule space ie regions mediate data and H AB explained in sect3 sets of regions middotbecome compressed GA genotypes See also genetic algorithms middotPLS bull region and Fig I

Merit Also called paloff or fitness this is the measure used by a genetic algorithm to select parent genotypes lor prelerential reproducshytion of succesduJ individuals Compare middotutilitymiddot also see genetic algorithms

Object Objects are any data to be genshyeralized into categories Relationships usually depend on task domain See utility

PLS The pro6GiiliBtic learnin BIstem can learn what are sometimes called single conshycepts IDi82] but PLS is capable of much more difficult tasks involving noise management incremental learning and normaliution 01 biased data PLSt uniquely discovered locally optimal heuristics in search IRe 83al and PLS2 is the effective and efficient extension examined in this paper PLS manipulates middotregions (see below) using various inductive operations described in sect4

RegIon or Cell Depending on ones viewpoint the region is PLSs basic structure for clustering or for the genetic algorithm The region is a compressed representation 01 a utility sur race in augmented feature space it is also a compressed genotype representing a utility luncshytion to be optimized AB explained in IRe83d) the region representation is fully expressive proshyviding the features are See sect3 and Figs3amp4

Utll1ty u This is any measure of the useshylulness 01 an object in the perlormance or some task The utilitl provides a lid between the task domain and PLS generalization algorithms Utility can be a probability as in Fig 2 Comshypare merit See sect I 3

APPENDIX B PLSI LIMITATIONS

PLSI alone is inherently limited The problems relate to modi6cation of the main knowledge structure the cumulative region set R = (rue) As mentioned in sect41 R undergoes two basic alterations PLSI gradually changes the meanin of an established feature space recshytangle r by updating ita associated utility u (along with us error e) PLSI also incrementally

refines tbe feature space as rectangles r are conshytinually split

Both or these modifications (utility revision and region refinement) are largely directed by search data but the degree to which newer inrorshymation affects R depends on various choices or system parameters IRe 83al System parameters influence estimates of the error e and determine the degree of region re6nement These in turn affect the relative importance of new versus estashyblished knowledge

Consequently values or these parameters influence task performance For example there is a tradeol between utility revision and region refinement It rrampions are refined too quickly accuracy sulers (this is theoretically predictable) It instead utility revision predominates regions become inert (their estimated errors decline) but sometimes incorrectly18

There are several other problems including difficulties in training limitation in the utility revision algorithm and inaccurate estimation of various errors As a result utility estimations are imperfect and biased in unknown ways

Together the above uncertainties and senshysitivities explain the failure of PLSI always to locate an optimum with static training (Table f) The net effect is that PLSI works rairly well with no parameter tuning and unsophisticated trainshying and close to optimal with mild tuning and informed training (CoS4 as long as the features are well behaved

By nature however PLSI requires reatures exhibiting no worse than mild interactions This is a serious restriction since feature nonlinearity is prevalent On its own then PLSt is inherently limited There is simply no wall to learn utility accurately unless the elects of dilering heuristic runctions H are compared as in PLS2

ACKNOWLEDGEMENTS

I would like to thank Dave Coles for his lasting enthusiasm during the development implementashytion and testing of this system I appreciate the helpful suggestions from Chris Matheus Mike Mauldin and the Conference Reviewers

1amp Althourh lI7stem parameters are given by domain-independent statistieal analysis tunine these parameters nevertheless improves performanee in some eases (This is not required in PLS2)

14

will attempt to reward good regions by decreasshying their estimated errors Even without these refinements PLS2 retains meritorious regions (sect4) and should exhibit accuracy improvement better than vN Table I suggests this

PLSZ veraua PLSZ AI di8eussed ill sect41 and Appendix B PLSI is limited necessitating human tuning for optimum performance In conshytrast the second layer learning system PLS2 requires little human intervention The maiD reason is that PLS2 stabilizes knowledge automatically by comparing region sets and dismissing aberrant ones Accurate cumulative sets have a longer lifetime

1 SUMMARY AND CONCLUSIONS

PLS2 is a general learning system (Re83a Re83dJ Given a set or user-de6ned reatures and some measure of the utilil (eg probability or success in tast (errormance) PLS2 rorms aDd rebes an appropriate knowledge structure the cumulative region d R relating utility to reature values and permitting DOise manageshyment This economical and flexible structure mediate data objects and abstract heuristic knowledge

Since individual regions or the cumulative set R are independent or one anotber botb credit localization and reature interaction are possible

This ability to discriminate merit and retain successful data will likely be accentuated with the localization or credit to indiviltlual regions (see H2) Another improvement is to alter dynamically tbe error or a region (estimated by PLSl) as a runction or its merit (found by PLS2) This will have the elJect or protecting a good region rrom imperrect PLSI utility revision once some parallel PLSl has succeeded in discovshyering an accurate value it will be more immune to damage A fit region will bave a very long lirespan

Inherent differences In capablltty RSF PLSl and PLS2 can be cbaracterized dilJerently

From the standpoint or time costs given a chalshylenging requirement such as the location or a local optimum witbin 10 tbe ordering or tbese methods in terms or efficiency is RSF 5 PLSI 5

PLS2 In terms or capdilit tbe same relationshysbip bolds RSF cannot handle reature interacshytions without a more complex model (wbicb would increase its cost drastically) PLSl on the other hand can provide some performance improvement using piecewise linearity with little additional cost [Re83bJ PLS2 is more robust than PLSl While the original system is someshywhat sensitive to training and parameters PLS2 provides stability using competition to overcome deficiencies obviate tuning and increase accushyracy all at once PLS2 bulJers inadequacies inberent in PLSI Moreover PLS2 being genetishycally based may be able to handle bighly interacting reatures and discover 10601 optima Re 83c1 Tbis is very costly with RSF and seems inreasible with PLSl alone

simultaneously Separating tbe tast con trol structure H from the main store or knowledge R allows straigbtforward credit assignment to this dderminant R or H while H iteelt may incorshyporate reature nonlinearities witbout being responsible for tbem

A concise and adequate embodiment of current heuristic knowledge tbe cumulative region set R waa originally used in the learning system PLSl (Re83aJ PLSI is the only system shown to discover locally optimal evaluation runctions in an AI context Clearly superior to PLSl its genetic successor PLS2 haa been shown to be more stable more accurate more efficient and more convenient PLS2 employs an unusual genetic algorithm having tbe cumulative set R as a compressed genotype PLS2 extends PLSls limmiddot ited operations or revision (controlled mutation) and dilJerentiation (genotype expansion) to include generalization and other rules (K-sexual mating and genotype reorganization) Credit may be localized to individual gene sequences

These improvements may be viewed as eftectillg greater efficiency or as allowing greater capability Compared with a traditional metbod of optimization PLSl is more efficient (Re85aJ but PLS2 does even better Given a required accuracy PLS2 locates an optimum with lower expected cost In terms or capability PLS2 insushylates the system from inherent iIladequacies and sensitivities or PLSl PLS2 is much more stable and caD use the whole population of regions relishyably to create a bigbly informed heuristic (this popultion performance is not meaningrul in standard genetic systems) Tbis availability or massive data haa important implications for teature interaction lRe83blmiddot

11

Additional refinements or PLS2 may further increase efficiency and power These include rewarding meritorious regions 10 they become immune to damage Future experiments will investigate nonlinear capability ability to disshycover lobal optima and efficiency and effectiveness of focalized credit auinment

This paper has quantitatively affirmed some principles believed to improve efficiency and effectiveness of learning (eg credit localization) The paper has also considered some simple but little explored ideas for realizin these capabilishyties (eg tull but controlled use or each datum)

REFERENCES

(An 73) Ande~bert MR auter Aul IJr Applicamiddot tiIJn AcademicPress 1973

(Be 80J Bethke AD Genetic elgIJritlm bull bullbull uchu IJptimizer PhD Thesis Universitl of Michian 1980

[Br 81) Brindle A Genetic algorithms for function optimintion CS Department Report TR81middot2 (PhD Dissertation) Universitl of Alberta 1081

[Bu 78) Buchanan BG Johnson CR Mitchell TM and Smith RG Models or learnin systems in Belser J (Ed) Encdopclli 0 CIJmper Science 11 TcdnolIJfJ 11 (1978) 201

(Co 8) Coles D and Rendell LA Some issues in trainin learn in systems and an autonomoul desip PrIJc FitA Bicftaiel CIJerece cAe Cbullbullbullli SIJeiet IJ ComputetilJel Stli IJICeUiuce 1984

(De 80) DeJonl KA Adaptive system delip A enetic approach IEEE Trdibullbull a Stem Mbullbull nl C6enadiCl SMC-IO (1080) 566-S7(

(Di81) Dietterich TG and Buchanan BG The role of the critic in learnin systems Stanford Universitl Report STANmiddotC8-81middot801 1081

(Di82) Dietterich TG London B Clarkson K and Drome1 G Learninl and inductin inference STANmiddot C8-82-813 Stanford Universitl also Chapter XIV or Tlc HI6t1o tI Arlifidcl InteUlgenee Cohen PR and Feienbaum EA (Ed) Kaufmann 1082

(Ho 75] Holland JH AIIptti Nturel nll Arlificiel Stem Universitl or Michitan Press 1875

(H0801 Holland JH Adapive a1orithms ror discovshyerin and usin eneral patterns in powin knowlede bases IntI loumel Oil Polic AelbullbullbullIIormtiIJ Stem 4 2 (1080) 217middot240

(Ho 81) Holland JH Genetic a1prithms and adaptampshytion Proc NATO Ad Ret In Alpti CIJntrIJI 0

Ulefillel Sem 1081

(8083) Holland JH Eecapinl brit~lenelS Proc Sccod latemetael Mdae Lunai Workdop 1883 02-05

La83 Lanlley P Bradshaw GL and Simon HA Rediscoverinl chemistry with the Bacon system in Michalsk~ RS Carbonell JG and Mi~chell TM (Ed) MuMne Lunalg Aft Arliiciel Intelligente Appro Tiolamp 1983 307middot329

[Ma8 Mauldin ML Maintainin diversitl in renetic search Proc FnrlA NtiIJael CIJaereftce Oft Arlificicl InteUiece 10S4 247middot250 ~

(Me 851 Medin DL and Wattenmaker WD Catepl1 cohesiveness ~heoriel and copitive archeotshyOQ (as et unpublished manuscript) Dept of PsycholshyOQ Universit of Dlinois at Urbana Champaip 1985

[Mic 831o Michalsk~ RS A theory and methodology or inductive learninl Arlificiel laeUgece RO 2 (1883) 1I1middotlali reprinted in Michalski RS et aI (Ed) M4chine Lumi A Arlificiel InteUigcace Appr04cl Tioa 198383-134

[Mic 83b) Michalski RS an d Stepp RE Le arning rrom observation Conceptual clusteriD iD Michalski RS e~ al (Ed) Madiae Lunaia Aa Arlificial Intelmiddot ligence ApproA TioS 1083331middot383

(Mit 83) MitcheD TM Learnin and problem solvinr Proc EigAtA Ilterulibullbullcl loirampl CfJcrete on Arlificilll lteUieaee UI83 ll3G-lISI

(Ni80) Nilsson NJ Prieipl Arlijieiel Intellimiddot eee Tiolamp 1080

(ae 76) Rendell LA A method ror automatic eneramiddot tion of heuristics ror statemiddotspace problem Dept of Computer Science C8-76-10 Universit of Waterloo 1076

(ae 771 Rendell LA A locall optimal solution or the fifteen punle produced by an automatic evaluation runction senerator Dept or Computer Science C8-77middot 38 Universitl of Waterloo 1977

(ae 81] Rendell LA An adaptive plan for statemiddotspace problems Dept of Computer Science C8-81-13 (PhD thesis) Universit of Waterloo 1081

(ae 83aJ Rendell LA A new basis for state-space learDin systems and a successful implementation ArtajicielateUieee 10 (1883) ( 360-302

(Re 83bJ Rendell LA A learninl II)stem which accommodatet future interactions Prtle EiAtA Intermiddot eli1Il I Cbullbullcreaee Arliiciel Ifttdlilence 1883 6072

IRe 83c) RendeU LA A doubly layered sendic penetrance learnins stem Proc Tlir~ NotiouJ Ooeruce ()Il ArlificiJ l_teUigecc 1083 343-341

(ae 83dl Rendel~ LA Conceptual knowlede acquisishyion in Harch University or Guelph Report 015-83-15 Dept or Computins and Inrormation Science Guelph Canada 1083 (to appear in Bole L (ed) Kutllle~ge B4e4 Le4mi Stem SpringerVerias)

IRe SSa) Rendell LA Utility pattern u criteria for efficient seneraliution learninr Proc Jg8500_erec bullbull Ittlligerat Sytem lid M4diflu (to appear) lOSS

(ae SSbl Rendell LA A scientific approach to applied induction Proc 1985 International Machine Learninr Workshop Rutrers University (to appear) 1085

(ae SSc) Rendell LA Substantial constructive inducshytion ulinS layered information compression Tractable leature formation in search Proc Hitlt ltcm4tiobull J Joit Ooerucc o Artificial Intelligence (to appear) 1985

(Sa63) Samuel AL Some Itudies in machine learnshyins usin the 3me of checkers in Feigenbaum EA and Feldman J (Ed) Oomputer 4114 Thugll McGraw-Hili 1963 1H05

(Sa 61) Samue~ ALbull Some studie in machine learnshyinS usinr the same of checkers U-recent progress IBM J Ru IUI4 Duelop 11 (1961) 601-617

(Sm 80) Smitb SF A learnins stem based on lenetic adaptive alrorithms PhD Dissertation Univershysity of Pittsburgh 1980

[Sm 831 Smith SF Flexible learninr of problem 1I01vshyiDe heuristics throurb adaptive Harch Proc EiglUl Iteruttoul Joit OOflerenc 0_ ArlificiJ I_tellishyuce 1983422middot425

APPENDIX A GLOSSARY OF TERMS

Clustering Cluter anailln has lonl been used as a tool ror induction in statiatics and patshytern recognition (An 73) (See middotinduction) Improvements to basic clustering techniques lenshyerally use more than just the reatures or a datum ([An 73 p194) suggests -external criteria) External criteria in [Mi83 Re76 Re83a Re85b] involve prior sped6cation of the forms clusters may take (this has been called middotconceptual clusshytering (Mi83l) Criteria in (Re76 Re83a Re85bl are based on the data enviroDment (see

-utilit) below)ls This paper uses clustering to create economical compressed genetic structures (enotIlPu)

Feature A feature is an attribute ormiddot proshyperty of an object Features are usually quite abstract (el -center control or -mobility) in a board game The utility (see below) varies smoothly with a feature

Genetic algorithm In a GA a the charshyacter of an individual or a population is called a phenotype The phenotype ia coded as a string of digits called the enotvpe A single digit is a ene Instead of searcbing rule space directly (compare -learning system) a GA searches gene space (ie a GA searches for good genes in the population or genotypes) This search uses the merit of individual genotypes selecting the more successrul individuals to undergo genetic operations ror the production of offspring See sect2 and Fig3

Inductton Induction or eneralization learnin is an important means ror knowledge acquisition Information is actually created as data are compressed into c48fe or cateories in order to predict future events efficiently and effectively Induction may create reature space neighborhoods or clusterl See -clustering and sect41

Learning System Buchanan et al present a general model which distinguishes comshyponents of a learning system [Bu 78) The perforshymance element PE is guided by a control trucshyture H Based on observation or the PE the crishytic assesses H possibly localizing credit to parts or H (Bu 7S Di SI] The learning element LE uses thia inrormatioD to improve H for the next round of task perrormance Lavered systems have multiple PEs critics and LEs (eg PLS2 uses PLSJ as its PE -see FilI) Just as a PE searches for its goal in problem space the LE searches in rule pace DiS2) for an optimal H to control the PE-

To facilitate this higher goal PLS2 uses an intermediate knowledge structure which divides feature space into reion relating feature values to object utiitll (Re83d] and discovering a userul subset or features (ef (Sa63l) In this paper the control structure H is a linear evaluation funcshy

15 A new learnin~ system IRe SSc) introduces highermiddotdimensional clusterinr ror creation or structure

13

tion INiSOI and the rules are feature weights for H Search for accurate regions replaces direct search of rule space ie regions mediate data and H AB explained in sect3 sets of regions middotbecome compressed GA genotypes See also genetic algorithms middotPLS bull region and Fig I

Merit Also called paloff or fitness this is the measure used by a genetic algorithm to select parent genotypes lor prelerential reproducshytion of succesduJ individuals Compare middotutilitymiddot also see genetic algorithms

Object Objects are any data to be genshyeralized into categories Relationships usually depend on task domain See utility

PLS The pro6GiiliBtic learnin BIstem can learn what are sometimes called single conshycepts IDi82] but PLS is capable of much more difficult tasks involving noise management incremental learning and normaliution 01 biased data PLSt uniquely discovered locally optimal heuristics in search IRe 83al and PLS2 is the effective and efficient extension examined in this paper PLS manipulates middotregions (see below) using various inductive operations described in sect4

RegIon or Cell Depending on ones viewpoint the region is PLSs basic structure for clustering or for the genetic algorithm The region is a compressed representation 01 a utility sur race in augmented feature space it is also a compressed genotype representing a utility luncshytion to be optimized AB explained in IRe83d) the region representation is fully expressive proshyviding the features are See sect3 and Figs3amp4

Utll1ty u This is any measure of the useshylulness 01 an object in the perlormance or some task The utilitl provides a lid between the task domain and PLS generalization algorithms Utility can be a probability as in Fig 2 Comshypare merit See sect I 3

APPENDIX B PLSI LIMITATIONS

PLSI alone is inherently limited The problems relate to modi6cation of the main knowledge structure the cumulative region set R = (rue) As mentioned in sect41 R undergoes two basic alterations PLSI gradually changes the meanin of an established feature space recshytangle r by updating ita associated utility u (along with us error e) PLSI also incrementally

refines tbe feature space as rectangles r are conshytinually split

Both or these modifications (utility revision and region refinement) are largely directed by search data but the degree to which newer inrorshymation affects R depends on various choices or system parameters IRe 83al System parameters influence estimates of the error e and determine the degree of region re6nement These in turn affect the relative importance of new versus estashyblished knowledge

Consequently values or these parameters influence task performance For example there is a tradeol between utility revision and region refinement It rrampions are refined too quickly accuracy sulers (this is theoretically predictable) It instead utility revision predominates regions become inert (their estimated errors decline) but sometimes incorrectly18

There are several other problems including difficulties in training limitation in the utility revision algorithm and inaccurate estimation of various errors As a result utility estimations are imperfect and biased in unknown ways

Together the above uncertainties and senshysitivities explain the failure of PLSI always to locate an optimum with static training (Table f) The net effect is that PLSI works rairly well with no parameter tuning and unsophisticated trainshying and close to optimal with mild tuning and informed training (CoS4 as long as the features are well behaved

By nature however PLSI requires reatures exhibiting no worse than mild interactions This is a serious restriction since feature nonlinearity is prevalent On its own then PLSt is inherently limited There is simply no wall to learn utility accurately unless the elects of dilering heuristic runctions H are compared as in PLS2

ACKNOWLEDGEMENTS

I would like to thank Dave Coles for his lasting enthusiasm during the development implementashytion and testing of this system I appreciate the helpful suggestions from Chris Matheus Mike Mauldin and the Conference Reviewers

1amp Althourh lI7stem parameters are given by domain-independent statistieal analysis tunine these parameters nevertheless improves performanee in some eases (This is not required in PLS2)

14

Additional refinements or PLS2 may further increase efficiency and power These include rewarding meritorious regions 10 they become immune to damage Future experiments will investigate nonlinear capability ability to disshycover lobal optima and efficiency and effectiveness of focalized credit auinment

This paper has quantitatively affirmed some principles believed to improve efficiency and effectiveness of learning (eg credit localization) The paper has also considered some simple but little explored ideas for realizin these capabilishyties (eg tull but controlled use or each datum)

REFERENCES

(An 73) Ande~bert MR auter Aul IJr Applicamiddot tiIJn AcademicPress 1973

(Be 80J Bethke AD Genetic elgIJritlm bull bullbull uchu IJptimizer PhD Thesis Universitl of Michian 1980

[Br 81) Brindle A Genetic algorithms for function optimintion CS Department Report TR81middot2 (PhD Dissertation) Universitl of Alberta 1081

[Bu 78) Buchanan BG Johnson CR Mitchell TM and Smith RG Models or learnin systems in Belser J (Ed) Encdopclli 0 CIJmper Science 11 TcdnolIJfJ 11 (1978) 201

(Co 8) Coles D and Rendell LA Some issues in trainin learn in systems and an autonomoul desip PrIJc FitA Bicftaiel CIJerece cAe Cbullbullbullli SIJeiet IJ ComputetilJel Stli IJICeUiuce 1984

(De 80) DeJonl KA Adaptive system delip A enetic approach IEEE Trdibullbull a Stem Mbullbull nl C6enadiCl SMC-IO (1080) 566-S7(

(Di81) Dietterich TG and Buchanan BG The role of the critic in learnin systems Stanford Universitl Report STANmiddotC8-81middot801 1081

(Di82) Dietterich TG London B Clarkson K and Drome1 G Learninl and inductin inference STANmiddot C8-82-813 Stanford Universitl also Chapter XIV or Tlc HI6t1o tI Arlifidcl InteUlgenee Cohen PR and Feienbaum EA (Ed) Kaufmann 1082

(Ho 75] Holland JH AIIptti Nturel nll Arlificiel Stem Universitl or Michitan Press 1875

(H0801 Holland JH Adapive a1orithms ror discovshyerin and usin eneral patterns in powin knowlede bases IntI loumel Oil Polic AelbullbullbullIIormtiIJ Stem 4 2 (1080) 217middot240

(Ho 81) Holland JH Genetic a1prithms and adaptampshytion Proc NATO Ad Ret In Alpti CIJntrIJI 0

Ulefillel Sem 1081

(8083) Holland JH Eecapinl brit~lenelS Proc Sccod latemetael Mdae Lunai Workdop 1883 02-05

La83 Lanlley P Bradshaw GL and Simon HA Rediscoverinl chemistry with the Bacon system in Michalsk~ RS Carbonell JG and Mi~chell TM (Ed) MuMne Lunalg Aft Arliiciel Intelligente Appro Tiolamp 1983 307middot329

[Ma8 Mauldin ML Maintainin diversitl in renetic search Proc FnrlA NtiIJael CIJaereftce Oft Arlificicl InteUiece 10S4 247middot250 ~

(Me 851 Medin DL and Wattenmaker WD Catepl1 cohesiveness ~heoriel and copitive archeotshyOQ (as et unpublished manuscript) Dept of PsycholshyOQ Universit of Dlinois at Urbana Champaip 1985

[Mic 831o Michalsk~ RS A theory and methodology or inductive learninl Arlificiel laeUgece RO 2 (1883) 1I1middotlali reprinted in Michalski RS et aI (Ed) M4chine Lumi A Arlificiel InteUigcace Appr04cl Tioa 198383-134

[Mic 83b) Michalski RS an d Stepp RE Le arning rrom observation Conceptual clusteriD iD Michalski RS e~ al (Ed) Madiae Lunaia Aa Arlificial Intelmiddot ligence ApproA TioS 1083331middot383

(Mit 83) MitcheD TM Learnin and problem solvinr Proc EigAtA Ilterulibullbullcl loirampl CfJcrete on Arlificilll lteUieaee UI83 ll3G-lISI

(Ni80) Nilsson NJ Prieipl Arlijieiel Intellimiddot eee Tiolamp 1080

(ae 76) Rendell LA A method ror automatic eneramiddot tion of heuristics ror statemiddotspace problem Dept of Computer Science C8-76-10 Universit of Waterloo 1076

(ae 771 Rendell LA A locall optimal solution or the fifteen punle produced by an automatic evaluation runction senerator Dept or Computer Science C8-77middot 38 Universitl of Waterloo 1977

(ae 81] Rendell LA An adaptive plan for statemiddotspace problems Dept of Computer Science C8-81-13 (PhD thesis) Universit of Waterloo 1081

(ae 83aJ Rendell LA A new basis for state-space learDin systems and a successful implementation ArtajicielateUieee 10 (1883) ( 360-302

(Re 83bJ Rendell LA A learninl II)stem which accommodatet future interactions Prtle EiAtA Intermiddot eli1Il I Cbullbullcreaee Arliiciel Ifttdlilence 1883 6072

IRe 83c) RendeU LA A doubly layered sendic penetrance learnins stem Proc Tlir~ NotiouJ Ooeruce ()Il ArlificiJ l_teUigecc 1083 343-341

(ae 83dl Rendel~ LA Conceptual knowlede acquisishyion in Harch University or Guelph Report 015-83-15 Dept or Computins and Inrormation Science Guelph Canada 1083 (to appear in Bole L (ed) Kutllle~ge B4e4 Le4mi Stem SpringerVerias)

IRe SSa) Rendell LA Utility pattern u criteria for efficient seneraliution learninr Proc Jg8500_erec bullbull Ittlligerat Sytem lid M4diflu (to appear) lOSS

(ae SSbl Rendell LA A scientific approach to applied induction Proc 1985 International Machine Learninr Workshop Rutrers University (to appear) 1085

(ae SSc) Rendell LA Substantial constructive inducshytion ulinS layered information compression Tractable leature formation in search Proc Hitlt ltcm4tiobull J Joit Ooerucc o Artificial Intelligence (to appear) 1985

(Sa63) Samuel AL Some Itudies in machine learnshyins usin the 3me of checkers in Feigenbaum EA and Feldman J (Ed) Oomputer 4114 Thugll McGraw-Hili 1963 1H05

(Sa 61) Samue~ ALbull Some studie in machine learnshyinS usinr the same of checkers U-recent progress IBM J Ru IUI4 Duelop 11 (1961) 601-617

(Sm 80) Smitb SF A learnins stem based on lenetic adaptive alrorithms PhD Dissertation Univershysity of Pittsburgh 1980

[Sm 831 Smith SF Flexible learninr of problem 1I01vshyiDe heuristics throurb adaptive Harch Proc EiglUl Iteruttoul Joit OOflerenc 0_ ArlificiJ I_tellishyuce 1983422middot425

APPENDIX A GLOSSARY OF TERMS

Clustering Cluter anailln has lonl been used as a tool ror induction in statiatics and patshytern recognition (An 73) (See middotinduction) Improvements to basic clustering techniques lenshyerally use more than just the reatures or a datum ([An 73 p194) suggests -external criteria) External criteria in [Mi83 Re76 Re83a Re85b] involve prior sped6cation of the forms clusters may take (this has been called middotconceptual clusshytering (Mi83l) Criteria in (Re76 Re83a Re85bl are based on the data enviroDment (see

-utilit) below)ls This paper uses clustering to create economical compressed genetic structures (enotIlPu)

Feature A feature is an attribute ormiddot proshyperty of an object Features are usually quite abstract (el -center control or -mobility) in a board game The utility (see below) varies smoothly with a feature

Genetic algorithm In a GA a the charshyacter of an individual or a population is called a phenotype The phenotype ia coded as a string of digits called the enotvpe A single digit is a ene Instead of searcbing rule space directly (compare -learning system) a GA searches gene space (ie a GA searches for good genes in the population or genotypes) This search uses the merit of individual genotypes selecting the more successrul individuals to undergo genetic operations ror the production of offspring See sect2 and Fig3

Inductton Induction or eneralization learnin is an important means ror knowledge acquisition Information is actually created as data are compressed into c48fe or cateories in order to predict future events efficiently and effectively Induction may create reature space neighborhoods or clusterl See -clustering and sect41

Learning System Buchanan et al present a general model which distinguishes comshyponents of a learning system [Bu 78) The perforshymance element PE is guided by a control trucshyture H Based on observation or the PE the crishytic assesses H possibly localizing credit to parts or H (Bu 7S Di SI] The learning element LE uses thia inrormatioD to improve H for the next round of task perrormance Lavered systems have multiple PEs critics and LEs (eg PLS2 uses PLSJ as its PE -see FilI) Just as a PE searches for its goal in problem space the LE searches in rule pace DiS2) for an optimal H to control the PE-

To facilitate this higher goal PLS2 uses an intermediate knowledge structure which divides feature space into reion relating feature values to object utiitll (Re83d] and discovering a userul subset or features (ef (Sa63l) In this paper the control structure H is a linear evaluation funcshy

15 A new learnin~ system IRe SSc) introduces highermiddotdimensional clusterinr ror creation or structure

13

tion INiSOI and the rules are feature weights for H Search for accurate regions replaces direct search of rule space ie regions mediate data and H AB explained in sect3 sets of regions middotbecome compressed GA genotypes See also genetic algorithms middotPLS bull region and Fig I

Merit Also called paloff or fitness this is the measure used by a genetic algorithm to select parent genotypes lor prelerential reproducshytion of succesduJ individuals Compare middotutilitymiddot also see genetic algorithms

Object Objects are any data to be genshyeralized into categories Relationships usually depend on task domain See utility

PLS The pro6GiiliBtic learnin BIstem can learn what are sometimes called single conshycepts IDi82] but PLS is capable of much more difficult tasks involving noise management incremental learning and normaliution 01 biased data PLSt uniquely discovered locally optimal heuristics in search IRe 83al and PLS2 is the effective and efficient extension examined in this paper PLS manipulates middotregions (see below) using various inductive operations described in sect4

RegIon or Cell Depending on ones viewpoint the region is PLSs basic structure for clustering or for the genetic algorithm The region is a compressed representation 01 a utility sur race in augmented feature space it is also a compressed genotype representing a utility luncshytion to be optimized AB explained in IRe83d) the region representation is fully expressive proshyviding the features are See sect3 and Figs3amp4

Utll1ty u This is any measure of the useshylulness 01 an object in the perlormance or some task The utilitl provides a lid between the task domain and PLS generalization algorithms Utility can be a probability as in Fig 2 Comshypare merit See sect I 3

APPENDIX B PLSI LIMITATIONS

PLSI alone is inherently limited The problems relate to modi6cation of the main knowledge structure the cumulative region set R = (rue) As mentioned in sect41 R undergoes two basic alterations PLSI gradually changes the meanin of an established feature space recshytangle r by updating ita associated utility u (along with us error e) PLSI also incrementally

refines tbe feature space as rectangles r are conshytinually split

Both or these modifications (utility revision and region refinement) are largely directed by search data but the degree to which newer inrorshymation affects R depends on various choices or system parameters IRe 83al System parameters influence estimates of the error e and determine the degree of region re6nement These in turn affect the relative importance of new versus estashyblished knowledge

Consequently values or these parameters influence task performance For example there is a tradeol between utility revision and region refinement It rrampions are refined too quickly accuracy sulers (this is theoretically predictable) It instead utility revision predominates regions become inert (their estimated errors decline) but sometimes incorrectly18

There are several other problems including difficulties in training limitation in the utility revision algorithm and inaccurate estimation of various errors As a result utility estimations are imperfect and biased in unknown ways

Together the above uncertainties and senshysitivities explain the failure of PLSI always to locate an optimum with static training (Table f) The net effect is that PLSI works rairly well with no parameter tuning and unsophisticated trainshying and close to optimal with mild tuning and informed training (CoS4 as long as the features are well behaved

By nature however PLSI requires reatures exhibiting no worse than mild interactions This is a serious restriction since feature nonlinearity is prevalent On its own then PLSt is inherently limited There is simply no wall to learn utility accurately unless the elects of dilering heuristic runctions H are compared as in PLS2

ACKNOWLEDGEMENTS

I would like to thank Dave Coles for his lasting enthusiasm during the development implementashytion and testing of this system I appreciate the helpful suggestions from Chris Matheus Mike Mauldin and the Conference Reviewers

1amp Althourh lI7stem parameters are given by domain-independent statistieal analysis tunine these parameters nevertheless improves performanee in some eases (This is not required in PLS2)

14

IRe 83c) RendeU LA A doubly layered sendic penetrance learnins stem Proc Tlir~ NotiouJ Ooeruce ()Il ArlificiJ l_teUigecc 1083 343-341

(ae 83dl Rendel~ LA Conceptual knowlede acquisishyion in Harch University or Guelph Report 015-83-15 Dept or Computins and Inrormation Science Guelph Canada 1083 (to appear in Bole L (ed) Kutllle~ge B4e4 Le4mi Stem SpringerVerias)

IRe SSa) Rendell LA Utility pattern u criteria for efficient seneraliution learninr Proc Jg8500_erec bullbull Ittlligerat Sytem lid M4diflu (to appear) lOSS

(ae SSbl Rendell LA A scientific approach to applied induction Proc 1985 International Machine Learninr Workshop Rutrers University (to appear) 1085

(ae SSc) Rendell LA Substantial constructive inducshytion ulinS layered information compression Tractable leature formation in search Proc Hitlt ltcm4tiobull J Joit Ooerucc o Artificial Intelligence (to appear) 1985

(Sa63) Samuel AL Some Itudies in machine learnshyins usin the 3me of checkers in Feigenbaum EA and Feldman J (Ed) Oomputer 4114 Thugll McGraw-Hili 1963 1H05

(Sa 61) Samue~ ALbull Some studie in machine learnshyinS usinr the same of checkers U-recent progress IBM J Ru IUI4 Duelop 11 (1961) 601-617

(Sm 80) Smitb SF A learnins stem based on lenetic adaptive alrorithms PhD Dissertation Univershysity of Pittsburgh 1980

[Sm 831 Smith SF Flexible learninr of problem 1I01vshyiDe heuristics throurb adaptive Harch Proc EiglUl Iteruttoul Joit OOflerenc 0_ ArlificiJ I_tellishyuce 1983422middot425

APPENDIX A GLOSSARY OF TERMS

Clustering Cluter anailln has lonl been used as a tool ror induction in statiatics and patshytern recognition (An 73) (See middotinduction) Improvements to basic clustering techniques lenshyerally use more than just the reatures or a datum ([An 73 p194) suggests -external criteria) External criteria in [Mi83 Re76 Re83a Re85b] involve prior sped6cation of the forms clusters may take (this has been called middotconceptual clusshytering (Mi83l) Criteria in (Re76 Re83a Re85bl are based on the data enviroDment (see

-utilit) below)ls This paper uses clustering to create economical compressed genetic structures (enotIlPu)

Feature A feature is an attribute ormiddot proshyperty of an object Features are usually quite abstract (el -center control or -mobility) in a board game The utility (see below) varies smoothly with a feature

Genetic algorithm In a GA a the charshyacter of an individual or a population is called a phenotype The phenotype ia coded as a string of digits called the enotvpe A single digit is a ene Instead of searcbing rule space directly (compare -learning system) a GA searches gene space (ie a GA searches for good genes in the population or genotypes) This search uses the merit of individual genotypes selecting the more successrul individuals to undergo genetic operations ror the production of offspring See sect2 and Fig3

Inductton Induction or eneralization learnin is an important means ror knowledge acquisition Information is actually created as data are compressed into c48fe or cateories in order to predict future events efficiently and effectively Induction may create reature space neighborhoods or clusterl See -clustering and sect41

Learning System Buchanan et al present a general model which distinguishes comshyponents of a learning system [Bu 78) The perforshymance element PE is guided by a control trucshyture H Based on observation or the PE the crishytic assesses H possibly localizing credit to parts or H (Bu 7S Di SI] The learning element LE uses thia inrormatioD to improve H for the next round of task perrormance Lavered systems have multiple PEs critics and LEs (eg PLS2 uses PLSJ as its PE -see FilI) Just as a PE searches for its goal in problem space the LE searches in rule pace DiS2) for an optimal H to control the PE-

To facilitate this higher goal PLS2 uses an intermediate knowledge structure which divides feature space into reion relating feature values to object utiitll (Re83d] and discovering a userul subset or features (ef (Sa63l) In this paper the control structure H is a linear evaluation funcshy

15 A new learnin~ system IRe SSc) introduces highermiddotdimensional clusterinr ror creation or structure

13

tion INiSOI and the rules are feature weights for H Search for accurate regions replaces direct search of rule space ie regions mediate data and H AB explained in sect3 sets of regions middotbecome compressed GA genotypes See also genetic algorithms middotPLS bull region and Fig I

Merit Also called paloff or fitness this is the measure used by a genetic algorithm to select parent genotypes lor prelerential reproducshytion of succesduJ individuals Compare middotutilitymiddot also see genetic algorithms

Object Objects are any data to be genshyeralized into categories Relationships usually depend on task domain See utility

PLS The pro6GiiliBtic learnin BIstem can learn what are sometimes called single conshycepts IDi82] but PLS is capable of much more difficult tasks involving noise management incremental learning and normaliution 01 biased data PLSt uniquely discovered locally optimal heuristics in search IRe 83al and PLS2 is the effective and efficient extension examined in this paper PLS manipulates middotregions (see below) using various inductive operations described in sect4

RegIon or Cell Depending on ones viewpoint the region is PLSs basic structure for clustering or for the genetic algorithm The region is a compressed representation 01 a utility sur race in augmented feature space it is also a compressed genotype representing a utility luncshytion to be optimized AB explained in IRe83d) the region representation is fully expressive proshyviding the features are See sect3 and Figs3amp4

Utll1ty u This is any measure of the useshylulness 01 an object in the perlormance or some task The utilitl provides a lid between the task domain and PLS generalization algorithms Utility can be a probability as in Fig 2 Comshypare merit See sect I 3

APPENDIX B PLSI LIMITATIONS

PLSI alone is inherently limited The problems relate to modi6cation of the main knowledge structure the cumulative region set R = (rue) As mentioned in sect41 R undergoes two basic alterations PLSI gradually changes the meanin of an established feature space recshytangle r by updating ita associated utility u (along with us error e) PLSI also incrementally

refines tbe feature space as rectangles r are conshytinually split

Both or these modifications (utility revision and region refinement) are largely directed by search data but the degree to which newer inrorshymation affects R depends on various choices or system parameters IRe 83al System parameters influence estimates of the error e and determine the degree of region re6nement These in turn affect the relative importance of new versus estashyblished knowledge

Consequently values or these parameters influence task performance For example there is a tradeol between utility revision and region refinement It rrampions are refined too quickly accuracy sulers (this is theoretically predictable) It instead utility revision predominates regions become inert (their estimated errors decline) but sometimes incorrectly18

There are several other problems including difficulties in training limitation in the utility revision algorithm and inaccurate estimation of various errors As a result utility estimations are imperfect and biased in unknown ways

Together the above uncertainties and senshysitivities explain the failure of PLSI always to locate an optimum with static training (Table f) The net effect is that PLSI works rairly well with no parameter tuning and unsophisticated trainshying and close to optimal with mild tuning and informed training (CoS4 as long as the features are well behaved

By nature however PLSI requires reatures exhibiting no worse than mild interactions This is a serious restriction since feature nonlinearity is prevalent On its own then PLSt is inherently limited There is simply no wall to learn utility accurately unless the elects of dilering heuristic runctions H are compared as in PLS2

ACKNOWLEDGEMENTS

I would like to thank Dave Coles for his lasting enthusiasm during the development implementashytion and testing of this system I appreciate the helpful suggestions from Chris Matheus Mike Mauldin and the Conference Reviewers

1amp Althourh lI7stem parameters are given by domain-independent statistieal analysis tunine these parameters nevertheless improves performanee in some eases (This is not required in PLS2)

14

tion INiSOI and the rules are feature weights for H Search for accurate regions replaces direct search of rule space ie regions mediate data and H AB explained in sect3 sets of regions middotbecome compressed GA genotypes See also genetic algorithms middotPLS bull region and Fig I

Merit Also called paloff or fitness this is the measure used by a genetic algorithm to select parent genotypes lor prelerential reproducshytion of succesduJ individuals Compare middotutilitymiddot also see genetic algorithms

Object Objects are any data to be genshyeralized into categories Relationships usually depend on task domain See utility

PLS The pro6GiiliBtic learnin BIstem can learn what are sometimes called single conshycepts IDi82] but PLS is capable of much more difficult tasks involving noise management incremental learning and normaliution 01 biased data PLSt uniquely discovered locally optimal heuristics in search IRe 83al and PLS2 is the effective and efficient extension examined in this paper PLS manipulates middotregions (see below) using various inductive operations described in sect4

RegIon or Cell Depending on ones viewpoint the region is PLSs basic structure for clustering or for the genetic algorithm The region is a compressed representation 01 a utility sur race in augmented feature space it is also a compressed genotype representing a utility luncshytion to be optimized AB explained in IRe83d) the region representation is fully expressive proshyviding the features are See sect3 and Figs3amp4

Utll1ty u This is any measure of the useshylulness 01 an object in the perlormance or some task The utilitl provides a lid between the task domain and PLS generalization algorithms Utility can be a probability as in Fig 2 Comshypare merit See sect I 3

APPENDIX B PLSI LIMITATIONS

PLSI alone is inherently limited The problems relate to modi6cation of the main knowledge structure the cumulative region set R = (rue) As mentioned in sect41 R undergoes two basic alterations PLSI gradually changes the meanin of an established feature space recshytangle r by updating ita associated utility u (along with us error e) PLSI also incrementally

refines tbe feature space as rectangles r are conshytinually split

Both or these modifications (utility revision and region refinement) are largely directed by search data but the degree to which newer inrorshymation affects R depends on various choices or system parameters IRe 83al System parameters influence estimates of the error e and determine the degree of region re6nement These in turn affect the relative importance of new versus estashyblished knowledge

Consequently values or these parameters influence task performance For example there is a tradeol between utility revision and region refinement It rrampions are refined too quickly accuracy sulers (this is theoretically predictable) It instead utility revision predominates regions become inert (their estimated errors decline) but sometimes incorrectly18

There are several other problems including difficulties in training limitation in the utility revision algorithm and inaccurate estimation of various errors As a result utility estimations are imperfect and biased in unknown ways

Together the above uncertainties and senshysitivities explain the failure of PLSI always to locate an optimum with static training (Table f) The net effect is that PLSI works rairly well with no parameter tuning and unsophisticated trainshying and close to optimal with mild tuning and informed training (CoS4 as long as the features are well behaved

By nature however PLSI requires reatures exhibiting no worse than mild interactions This is a serious restriction since feature nonlinearity is prevalent On its own then PLSt is inherently limited There is simply no wall to learn utility accurately unless the elects of dilering heuristic runctions H are compared as in PLS2

ACKNOWLEDGEMENTS

I would like to thank Dave Coles for his lasting enthusiasm during the development implementashytion and testing of this system I appreciate the helpful suggestions from Chris Matheus Mike Mauldin and the Conference Reviewers

1amp Althourh lI7stem parameters are given by domain-independent statistieal analysis tunine these parameters nevertheless improves performanee in some eases (This is not required in PLS2)

14