12
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. SMC-4, NO. 4, JULY 1974 323 Learning Automata A Survey KUMPATI S. NARENDRA, SENIOR MEMBER, IEEE, AND M. A. L. THATHACHAR Abstract-Stochastic automata operating in an unknown random can be considered to show learning behavior. Tsypkin environment have been proposed earlier as models of learning. These [GT1] has recently argued that seemingly diverse problems automata update their action probabilities in accordance with the inputs . . . received from the environment and can improve their own performance in pa t rec i o idenfatio n lering. during operation. In this context they are referred to as learning auto- can be treated ii a unified manner as problems in learning mata. A survey of the available results in the area of learning automata using probabilistic iterative methods. has been attempted in this paper. Attention has been focused on the Viewed in a purely mathematical context the goal of a norms of behavior of learning automata, issues in the design of updating learning system is the optimization of a functional not schemes, convergence of the action probabilities, and interaction of .. several automata. Utilization of learning automata in parameter known expicily,as functoeaml with athemati daltexpeta-n optimization and hypothesis testing is discussed, and potential areas o tion of a random functional with a probability distribution application are suggested. function not known in advance. An approach that has been used in the past is to reduce the problem to the determina- I. INTRODUCTION tion of an optimal set of parameters and then apply stochastic hillclimbing techniques [GT1]. An alternative rN CLASSICAL deterministic control theory, the control of a process is always preceded by complete knowledge approach gaining attention recently is to regard the problem as one of finding an optimal action out of a set of allowable ofrthe of the proce;a r the m athmtia actions and to achieve this using stochastic automata [LN2]. description of the process is assumedito bknon, an the The following example of the learning process of a student inputs to the process are deterministic functions of time. wit a-rbblsi ece lutae h uoao Later developments in stochastic control theory took into approach. account uncertainties that might be present in the process; Conti s stochastic control was effected by assuming that the thestden andanite st aer nAtivn isp probabilistic characteristics of the uncertainties are known. p he student an set on alternative s, Frequently, the uncertainties are of a higher order, and even foling w he teacher respond in a ary anner the probabilistic characteristics such as the distribution indicaing wheth the selcter is ight or wrong. functions may not be completely known. It is then necessary teac thri howeve,poabster is a nonzr to make observations on the process as it is in operation and probabilit ficither esponse zfra gain further knowledge of the process. In other words, a of the answers selected by the student. The saving feature of distinctive feature of such problems is that there is little the an sethat tik that The tachr' neative a priori information, and additional information is to be acquired on line. One viewpoint is to regard these as responses have the least probability for the correct answer. acquired on learnine. Oeipnitrgdhsa Under these circumstances the interest is in finding the aro mings learing. . manner in which the student should plan a choice of a Learning is defined as any relatively permanent changin sequence of alternatives and process the information behavior resulting from past experience, and a learning obtained from the teacher so that he learns the correct system is characterized by its ability to improve its behavior answer. with time, in some sense tending towards an ultimate goal. In stochastic automata models the stochastic automaton In mathematical psychology, models of learning systems corresponds to the student, and the random environment in [GBI], [GLI] have been developed to explain behavior which it operates represents the probabilistic teacher. The patterns among living organisms. These models in turn have actions (or states) of the stochastic automaton are the lately been adapted to synthesize engineering systems, which various alternative answers that are provided. The responses of the environment for a particular action of the stochastic Manuscript received January 15, 1974; revised February 13, 1974. automaton are the teacher's probabilistic responses. The This work was supported by the National Science Foundation under problem is to obtain the optimal action that corresponds to Grant GK-20580. K. S. Narendra is with the Becton Center, Yale University, New thcortanw. Haven, Conn. The stochastic automaton attempts a solution of this M. A. L. Thathachar is with the Becton Center, Yale University, problem as follows. To start with, no information as to New Haven, Conn., on leave from the Indian Institute of Science, hc n steotmlato sasmd n qa Bangalore, India.' hconisteotmlato isas edadeql

Learning Automata A Survey - dklevine.com · Learning Automata ASurvey ... has been attempted in this paper. Attention has been focused on the Viewed in a purely mathematical context

Embed Size (px)

Citation preview

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. SMC-4, NO. 4, JULY 1974 323

Learning Automata A SurveyKUMPATI S. NARENDRA, SENIOR MEMBER, IEEE, AND M. A. L. THATHACHAR

Abstract-Stochastic automata operating in an unknown random can be considered to show learning behavior. Tsypkinenvironment have been proposed earlier as models of learning. These [GT1] has recently argued that seemingly diverse problemsautomata update their action probabilities in accordance with the inputs . . .received from the environment and can improve their own performance inpa t rec i oidenfatio n lering.during operation. In this context they are referred to as learning auto- can be treated ii a unified manner as problems in learningmata. A survey of the available results in the area of learning automata using probabilistic iterative methods.has been attempted in this paper. Attention has been focused on the Viewed in a purely mathematical context the goal of anorms of behavior of learning automata, issues in the design of updating learning system is the optimization of a functional notschemes, convergence of the action probabilities, and interaction of . .several automata. Utilization of learning automata in parameter known expicily,asfunctoeamlwith athemati daltexpeta-noptimization and hypothesis testing is discussed, and potential areas o tion of a random functional with a probability distributionapplication are suggested. function not known in advance. An approach that has been

used in the past is to reduce the problem to the determina-I. INTRODUCTION tion of an optimal set of parameters and then apply

stochastic hillclimbing techniques [GT1]. An alternativerN CLASSICAL deterministic control theory, the controlof a process is always preceded by complete knowledge approach gaining attention recently is to regard the problemas one of finding an optimal action out of a set of allowable

ofrthe of the proce;ar the m athmtia actions and to achieve this using stochastic automata [LN2].description of the process is assumedito bknon, an the The following example of the learning process of a studentinputs to the process are deterministic functions of time. wit a-rbblsi ece lutae h uoaoLater developments in stochastic control theory took into approach.account uncertainties that might be present in the process; Conti sstochastic control was effected by assuming that the thestden andanite st aer nAtivn ispprobabilistic characteristics of the uncertainties are known. p he student an set on alternative s,Frequently, the uncertainties are of a higher order, and even foling w he teacher respond in a ary annerthe probabilistic characteristics such as the distribution indicaing wheth the selcter isight or wrong.functions may not be completely known. It is then necessary teac thri howeve,poabster is a nonzrto make observations on the process as it is in operation and probabilit ficither esponse zfragain further knowledge of the process. In other words, a of the answers selected by the student. The saving feature ofdistinctive feature of such problems is that there is little the an sethat tik that Thetachr' neativea priori information, and additional information is to beacquiredon line. One viewpoint is to regard these as

responses have the least probability for the correct answer.

acquired on learnine.Oeipnitrgdhsa Under these circumstances the interest is in finding thearo mingslearing. . manner in which the student should plan a choice of aLearning is defined as any relatively permanent changin sequence of alternatives and process the information

behavior resulting from past experience, and a learning obtained from the teacher so that he learns the correctsystem is characterized by its ability to improve its behavior answer.with time, in some sense tending towards an ultimate goal. In stochastic automata models the stochastic automatonIn mathematical psychology, models of learning systems corresponds to the student, and the random environment in[GBI], [GLI] have been developed to explain behavior which it operates represents the probabilistic teacher. Thepatterns among living organisms. These models in turn have actions (or states) of the stochastic automaton are thelately been adapted to synthesize engineering systems, which various alternative answers that are provided. The responses

of the environment for a particular action of the stochastic

Manuscript received January 15, 1974; revised February 13, 1974. automaton are the teacher's probabilistic responses. TheThis work was supported by the National Science Foundation under problem is to obtain the optimal action that corresponds toGrant GK-20580.

K. S. Narendra is with the Becton Center, Yale University, New thcortanw.Haven, Conn. The stochastic automaton attempts a solution of thisM. A. L. Thathachar is with the Becton Center, Yale University, problem as follows. To start with, no information as to

New Haven, Conn., on leave from the Indian Institute of Science, hc n steotmlato sasmd n qaBangalore, India.' hconisteotmlato isas edadeql

324 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, JULY 1974

probabilities are attached to all the actions. One action is Union and elsewhere has followed the trend set by hisselected at random, the response of the environment to this source paper. No attempt, however, has been made in thisaction is observed, and based on this response the action paper to review all these studies.probabilities are changed. Now a new action is selected Varshavskii and Vorontsova [LVY] observed that the useaccording to the updated action probabilities, and the of stochastic automata with updating of action probabilitiesprocedure is repeated. A stochastic automaton acting in could reduce the number of states in comparison withthis manner to improve its performance is referred to as a deterministic automata. This idea has proved to be verylearning automaton in this paper. fruitful and has been exploited in a series of investigations,

Stochastic hillclimbing methods (such as stochastic the results of which form the subject of this paper.approximation) and stochastic automata methods represent Fu and his associates [LFI]-[LF6] were among the firsttwo distinct approaches to the learning problem. Though to introduce stochastic automata into the control literature.both approaches involve iterative procedures, updating at A variety of applications to parameter optimization,every stage is done in the parameter space in the first method pattern recognition, and game theory were considered byand probability space in the second. It is, of course, possible this school. McLaren [LM1] explored the properties ofthat they lead to equivalent descriptions in some examples. linear updating schemes and suggested the concept of aThe automata methods have two distinct advantages over "growing" automaton [LM2]. Chandrasekaran and Shenstochastic hillclimbing methods in that the action space [LC1]-[LC3] made useful studies of nonlinear updatingneed not be a metric space (i.e., no concept of neighborhood schemes, nonstationary environments, and games ofis needed), and since at every stage any element of the automata. Tsypkin and Poznyak [LTI] attempted to unifyaction set can be chosen, global rather than local optimum the updating schemes by focusing attention on an inversecan be obtained. optimization problem. The present authors and their

Experimental simulation of automata methods carried associates [LS1], [LS2], [LV3]-[LV10], [LN1], [LN2],out during the last few years has indicated the feasibility of [LL1]-[LL5] have studied the theory and applications ofthe automaton approach in the solution of interesting learning automata and also carried out simulation studiesexamples in parameter optimization, hypothesis testing, and in the area.game theory. The automaton approach also appears The survey papers on learning control systems byappropriate in the study of hierarchical systems and in Sklansky [GS1] and Fu [GF1] have devoted part of theirtackling certain nonstationary optimization problems. attention to learning automata. The topic also finds a placeFurthermore, several other avenues to learning can be in some books and collections of articles on learninginterpreted as iterative procedures in the probability space, systems [GM2], [GF2], [LF6]. The literature on theand the learning automaton provides a natural mathematical two-armed bandit problem is relevant in the present contextmodel for such situations and serves as a unifying theme but is not referred to in detail as the approach taken isamong diverse techniques [GM3]. rather different [LC5], [LW2]. References to other

Previous studies on learning automata have led to a contributions will be made at appropriate points in the bodycertain understanding of the basic issues involved and have of the paper.provided guidelines for the design of algorithms. Anappreciation of the fundamental problems in the field has Organizationalso taken place. It appears that research in this area has This paper has been divided into nine sections. Followingreached a stage where the power and applicability of the the introduction, the basic concepts and definitions ofapproach needs to be made widely known in order that it stochastic automata and random environments are given incan be fully exploited in solving problems in relevant areas. Section II. The possible ways in which the behavior ofIn this paper we review recent results in the area of learning learning automata can be judged are defined in Section III.automata, reexamine some of the theoretical questions that Section IV deals with reinforcement schemes (or updatingarise, and suggest potential areas where the available results algorithms) and their properties and includes a discussionmay find application. of convergence. Section V describes collective behavior of

automata in terms of games between automata and multi-level structures of automata. Nonstationary environments

Historically, the first learning automata models were are briefly considered in Section VI. Possible uses ofdeveloped in mathematical psychology. Early work in this learning automata in optimization and hypothesis testingarea has been well documented in the book by Bush and form the subject matter of Section VII. A short descriptionMosteller [GBl]. More recent results can be found in of the fields of application of learning automata is given inAtkinson et al. [GAl]. A rigorous mathematical framework Section VIII. A comprehensive bibliography is provided inhas been developed for the study of learning problems by the Reference section and is divided into three subsectionsIosifescu and Theodorescu [GIl] as well as by Norman dealing with 1) general references in the literature pertinent[GNl]. to the topic considered, 2) some important papers on de-

Tsetlin [DT1] introduced the concept of using determi- terministic automata that provided the impetus for stochas-nistic automata operating in random environments as tic automata models, and 3) publications wholly devoted tomodels of learning. A great deal of work in the Soviet learning automata.

NARENDRA AND THATHACHAR: LEARNING AUTOMATA 325

INPUT | STOCHASTIC ACTION(OUTPUT) Learning Automaton (Stochastic Automaton in a Random}{O,I)AUTOMATON a 4E a,,a2,-ar Environment)Fig. 1. Stochastic automaton. Fig. 3 represents a feedback connection of a stochastic

automaton and an environment. The actions of the autom-

PENALTY PROBABILITY SET aton in this case form the inputs to the environment. The{CpC2,*. CrO responses of the environment in turn are the inputs to the

automaton and influence the updating of the actionINPUT ENVI,...arONMEN OUTPU RESPONE probabilities. As these responses are random, the action

Fig. 2. Environment. probability vector p(n) is also random.In psychological learning experiments the organism under

PENALTY PROBABILITY SET study is said to learn when it improves the probability of{Cp C,2 Cr) correct response as a result of interaction with its environ-

ment. Since the stochastic automaton being considered inthis paper behaves in a similar fashion, it appears proper torefer to it as a learning automaton. Thus a learningautomaton is a stochastic automaton that operates in a

{p,A) random environment and updates its action probabilities inACTION STOCHASTIC INPUT accordance with the inputs received from the environment

a E {a,,.. ar) AUTOMATON xE {O,l} so as to improve its performance in some specified sense.Fig. 3. Learning automaton. In the context of psychology, a learning automaton may

be regarded as a model of the learning behavior of the

II. STOCHAsTic AUTOMATA AND RANDom ENVIRONMENTS organism under study and the environment as controlled bythe experimenter. In an engineering application such as the

Stochastic Automaton control of a process, the controller corresponds to theA stochastic automaton is a sextuple {x,0,oc,p,A,G} where learning automaton, while the rest of the system with all

x is the input set, 4 = {01,02,.- * ,l} is the set of internal uncertainties constitutes the environment.states, os = {01,0C2, *Cr } with r < s is the output or It is useful to note the distinction between several modelsaction set, p is the state probability vector governing the based on the nature of the input to the learning automaton.choice of the state at each stage (i.e., at each stage n, If the input set is binary, e.g., {0, 1}, the model is known asp(n) = (pl(n),p2(n), - *,p.(n))), A is an algorithm (also a P-model. On the other hand it is called a Q-model if thecalled an updating scheme or reinforcement scheme) which input set is a finite collection of distinct symbols as, forgenerates p(n + 1) from p(n), and G: 0 -ao is the output example, obtained by quantization and an S-model if thefunction. G could be a stochastic function, but there is no input set is an interval [0,1]. Each of these models appearsloss of generality in assuming it to be deterministic [GP1]. appropriate in certain situations.In this paper G is taken to be deterministic and one-to-one A remark on the terminology is relevant here. Following(i.e., r = s, states and actions are regarded synonymous) Tsetlin [DT1], deterministic automata operating in randomand s < xo. Fig. 1 shows a stochastic automaton with its environments have been proposed as models of learninginputs and actions. behavior. Thus they are also contenders to the term

It may be noted that the states of a stochastic automaton "learning automata." However, in the view of the presentcorrespond to the states of a discrete-state discrete- authors the stochastic automaton with updating of actionparameter Markov process. Occasionally, it may be probabilities is a general model from which the deterministicconvenient to regard the pi(n) themselves as states of a automaton can be obtained as a special case having acontinuous-state Markov process. 0-1-state transition matrix, and it appears reasonable toEnvironment apply the term learning automaton to the more general

model. In cases where it is felt necessary to emphasize theOnly an environment (also called a medium) with random learning properties of a deterministic automaton one can

response characteristics is of interest in the problems use a qualifying term such as "deterministic learningconsidered. The environment (shown in Fig. 2) has inputs automaton." It may also be noted that learning automatac(n) = {f,xl. 'r} and outputs (responses) belonging to a of this paper have been referred to as "variable-structureset x. Frequently the responses are binary {0,1 } with zero stochastic automata," in earlier literature [LVI].being called the nonpenalty response and one as the penaltyresponse. The probability of emitting a particular output III. NORMS OF BEHAVIOR OF LEARNING AUTOMATAsymbol (say, 1) depends on the input and is denoted by The basic operation carried out by a learning automatonci(i = 1,..* ,r). The ci are called the penalty probabilities, is the updating of the action probabilities on the basis of theIf the ci do not depend on n, the environment is said to be responses of the environment. A natural question here is tostationary. Otherwise it is nonstationary. It is assumed that examine whether the updating is done in such a manner asthe ci are unknown initially; the problem would be trivial to result in a performance compatible with intuitive notionsif they are known a priori. of learning.

326 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, JULY 1974

One quantity useful in judging the behavior of a learning In practice, the penalty probabilities are often completelyautomaton is the average penalty received by the automaton. unknown, and it would be necessary to have desirableAt a certain stage n, if the action (i is selected with prob- performance whatever be the values of ci, that is, in allability pi(n) the average penalty conditioned on p(n) is stationary random media. The performance would also be

superior if the decrease of E[M(n)] is monotonic. BothM(n) = E{x(n) p(n)} these requirements are considered in the following definition

r [LL3].= pi(n)ci. (1) Definition 4: A learning automaton is said to be absolutelyi- 1 expedient if

If no a priori information is available, and the actions are E[M(n + 1) p(n)] < M(n) (6)chosen with equal probability (i.e., at random), the value ofthe average penalty is denoted by Mo and is given by for all n, all pk(n) E (0, 1)(k = 1,.*.*,r), and all possible

values2 of ci(i= 1, ,r). Absolute expediency impliesM = Cl + C2 + + Cr (2) that M(n) is a supermartingale and that E[M(n)] is strictly

r monotonically decreasing with n in all stationary randomenvironments. If M(n) < Mo initially, absolute expediency

The average penaltyaisimadetlessothanMe atilea implies expediency. It is thus a stronger requirement on theif the average penalty iS made less than MO, at leastasymptotically. Such a behavior is called expediency and is learning automaton. Furthermore, it can be shown that

defined as follows [DTI], [LCI]. absolute expediency implies E-optimality in all stationaryDefinition 1: A learning automaton is called expedient' if random environments [LL4]. It is not at present known

whether the reverse implication is true. lowever, everylim E[M(n)] < Mo. (3) learning automaton presently known to be e-optimal in alln- cc stationary media is also absolutely expedient. Hence

When a learning automaton is expedient it only does better s-optimality and absolute expediency will be treated asthan one which chooses actions in a purely random manner. synonymous in the sequel.It would be desirable if the average penalty could be The definitions in this section have been given withminimized by a proper selection of the actions. In such a reference to a P-model but can be applied with minorcase the learning automaton is called optimal. From (1) it changes to Q- and S-models [LV3], [LV8], [LCl].can be seen that the minimum value of M(n) is mini {c'}.

Definition 2: A learning automaton is called optinmal if IV. REINFORCEMENT SCHEMESHaving decided on the norms of behavior of learning

lim E[M(n)] = cl (4) automata we can now focus attention on the means ofwhere achieving the desired performance. It is evident from the

cl = min {c-}. description of the learning automaton that the cruciali factor that affects the performance is the reinforcement

Optimality implies that asymptotically the action associated scheme for the updating of the action probabilities. It thus

with the minimum penalty probability is chosen with becomes necessary to relate the structure of a reinforcementprobability one. While optimality appears a very desirable scheme and the performance of the automaton using the

property, certain conditions in a given situation may scheme r apreclude its achievement. In such a case one would aim at a generasuboptimal performance. One such property is given by sented bye-optimality [LV4]. p(n + 1) = T[p(n),c(n),x(n)] (7)

Definition 3: A learning automaton is called c-optimal if where T is an operator; x(n) and x(n) represent the action

lim E[M(n)] < cl +c (5) of the automaton and the input to the automaton at instant

n 8 X00 n, respectively. One can classify the reinforcement schemescan be obtained for any arbitrary c > 0 by a suitable either on the basis of the property exhibited by a learningchoice of the parameters of the reinforcement scheme. automaton using the scheme (as, for example, the automatons-optimality implies that the performance of the automaton being expedient or optimal) or on the basis of the nature ofcan be made as close to the optimal as desired. the functions appearing in the scheme (as, for example,

It is possible that the preceding properties hold only when linear, nonlinear, or hybrid). Ifp(n + 1) is a linear functionfpenalty probabilities c satisfy certain restric- of the components of p(n), the reinforcement scheme is said

tions, for example, that they should lie in certain intervals, to be linear, otherwise it is nonlinear. Sometimes it isIn such cases the properties are said to be conditional. advantageous to update p(n) according to different schemes

depending on the intervals in which the value of p(n) lies.

1 Since pi(n), limn . p1(n), and consequently M(n) are, in general,random variables, the expectation operator is needed in the definition 2 It is usually assumed that the set {c1} has unique maximum andto represent the average penalty. minimum elements.

NARENDRA AND THATHACHAR: LEARNING AUTOMATA 327

In such a case the combined reinforcement scheme is called It is known that an automaton using the LR-P scheme isa hybrid scheme. expedient in all stationary random environments. Expres-The basic idea behind any reinforcement scheme is rather sions for the rate of learning and the variance of the action

simple. If the learning automaton selects an action ici at probabilities are also available.instant n and a nonpenalty input occurs, the action By settingprobability pi(n) is increased, and all the other components fj(p) = apj gj(p) 0, for all] (10)ofp(n) are decreased. For a penalty input, pi(n) is decreased,and the other components are increased. These changes in we get the linear reward-inaction (LR-I) scheme. Thispi(n) are known as reward and penalty, respectively. scheme was considered first in mathematical psychologyOccasionally the action probabilities may be retained at the [GBl] but was later independently conceived and intro-previous values, in which case the status quo is known as duced into the engineering literature by Shapiro and"inaction." Narendra [LSI], [LS2].

In general, when the action at n is o The characteristic of the scheme is that it ignores penaltyinputs from the environment so that the action probabilities

pi(n + 1) = p/n)-f/p(n)), for x(n) = 0 remain unchanged under these inputs. Because of thispj(n + 1) = pJ(n) + gj(p(n)), for x(n) = 1. property a learning autoinaton using the scheme has been

called a "benevolent automaton" by Tsypkin and Poznyak(. 7& i) (8a) [LTI].

The algorithm for pi(n + 1) is to be fixed so that Pk(n + 1) The LR--I scheme was originally reported to be optimal in(k = 1,* * ,r) add to unity. Thus all stationary random environments, but it is now known

that it is only c-optimal [LV4], [LL4]. It is significant,p1(n + 1) = pi(n) + E f1(p(n)), for x(n) = 0 however, that replacing the penalty by inaction in the LR-P

scheme totally changes the performance from expediency topi(n + 1) = pi(n) - E gsp(n)), for x(n) = 1 (8b) c-optimality.

3heretheonnegativecontinuousfunctionsOther possible combinations such as the linear reward-where the nonnegative3 continuous functions fj() and reward, penalty-penalty, and inaction-penalty schemesgj( ) are such that Pk(n + 1) E (0,1), for all k = 1, * ,r have been considered in [LV9], but these are, in general,whenever every pk(nl) E (0,1). The latter requirement 5s inferior to the LR-I and LR-P schemes. The effect of varyingnecessary to prevent the automaton from getting trapped the parameters a and b with n has also been studied inprematurely in an absorbing barrier. [LV9]

Varshavskii and Vorontsova [LVI] were the first tosuggest such reinforcement schemes for two-state automata Nonlinear Schemesand thus set the trend for later developments. They con- As mentioned earlier, the first nonlinear scheme for asidered two schemes-one linear and the other nonlinear- two-state automaton was proposed by Varshavskii andin terms of updating of the state-transition probabilities. Vorontsova [LVI] in terms of transition probabilities. TheFu, McLaren, and McMurtry [LFI], [LF2] simplified the total-probability version of the scheme corresponds to theprocedure by considering updating of the total action choiceprobabilities as dealt with here.

gj(p) = fj(p) = apj(l - pj), j = 1,2. (11)Linear Schemes This scheme is c-optimal in a restricted random environmentThe earliest known scheme can be obtained by setting satisfying either c, < 1/2 < c2 or c2 < 1/2 < cl. Chan-

fj-p = pj j(p=pj blrdrasekaran and Shen [LCI] have studied nonlinearfj(p)= apj g/(p) = bp1 + b/r- 1, schemes with power-law nonlinearities. Several nonlinearfor all = 1,j ,r (9) schemes, which are c-optimal in all stationary random

where 0 < a, b < 1.4 This is known as a linear reward- environments, have been suggested by Viswanathan andpenalty (denoted LR-P) scheme. Early studies of the scheme, Narendra [LV9] as well as by Lakshmivarahan andprincipally dealing with the two-state case, were made by Thathachar [LLI], [LL3]. A simple scheme of this type forBush and Mosteller [GB1] and Varshavskii and Vorontsova the two-state case is[LVY]. McLaren [LMI] made a detailed investigation of f (p) = apj2(l - pj) gj(p) = bpj( - pj),the multistate case, and this work was continued by

j = 1,2 (12)Chandrasekaran and Shen [LCI] as well as by Viswana- where 0 < a < 4, 0 < b < .than and Narendra [LV9]. Norman [LN4] establishedseveral results pertaining to the ergodic character of the Acobntnofleaadnniertrm otnscheme. appears advantageous [LL3]. Extensive simulation resultson a variety of schemes utilizing several possible combina-

tions of reward, penalty, and inaction are available in3The nonnegativity condition need be imposed only if the "*reward" [LV10]. A result that unifies most of the preceding rein-character of f,(.) and the "penalty" character of g>( ) are to be forcement schemes has been reported in [LL3] and is given4g9j() for this scheme is not nonnegative for all values of pj. by the following.

328 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, JULY 1974

Theorem: A necessary and sufficient condition for the whereas they play a crucial role in the case of s-optimallearning automaton using (8) to be absolutely expedient is schemes.

It has been shown that when the LR-P scheme is used,fi(P) =M_P) - - fr(P) =-(p) p(n) converges to a random variable with a continuousPt P2 Pr distribution in which the mean and variance can be

g1(P) 9g2(P) _ _ g~.(p)computed. The variance can be made as small as desired by

91(P) = #2(P= .. = =u(p) (13) a proper choice of the parameters of the scheme [LM1],Pi P2 Pr [LC1].

where A(-) and 4(*) are arbitrary continuous functions In s-optimal schemes the action probability vector p(n)

satisfying5 converges to the set of absorbing states with probabilityone. As there are at least two such states and only one of

O < A(p) < 1 the states is the desired one (i.e., the state associated withthe minimum penalty probability) one can only say that

O < ,u(p) < min (pa/i - pj) (14) p(n) converges to the desired state with a positive prob-ability. It is important to quantify this probability.

for all j = 1, * ,r and all pj E (0,1). For simplicity, consider a two-state case. If c1 < c2, we

In simple terms, the theorem suggests that to obtain require p1(n) -÷ 1, and if c1 > c2, p1(n) -+ 0. When an

absolute expediency one type of updating should be mae c-optimal scheme is used the only conclusion that can beabslut exedincYonetyp ofupdtin shuldbemade drawn is p1(n) -+{0, 1} with probability one, hence, hfor the probability of the action selected by the automaton desirable event happens with a specific probability. Further-at the instant considered and a different type of updating more, this probability depends on the initial valuepi (). Infor all the other action probabilities. No distinction should ore, to bain dendsonthe inithevalue the

be adeamogte atios nt sleced y te atomtonorder to gain confidence in the use of the schemes the

be made among the actions not selected by the automaton probability of convergence to the desired state has to bein the sense thththe ratiopn(n + 1)/(ps(n)) should be the determined as a function of the initial probability.same for all these actions. For fixing ideas, let us assume c1 < c2. It is necessary to

All the schemes known so far, which are s-optimal in all find a function6 y(p) defined bystationary random environments, are also absolutelyexpedient; hence the functions appearing in these schemes y(p) = Pr [p1(e) = 1 P1(O) = p].satisfy the conditions of the theorem. The theorem also The available results in the theory of stochastic stability areprovides guidelines for the choice of new schemes. not of much use here as they treat, for the most part, the

case of one absorbing state, whereas there are two suchConvergence of Action Probabilities states here. Pioneering work on the present problem has

So far only the gross behavior of the automaton based on been done by Norman [LN3], who has shown that y(p)changes in the average penalty has been considered. It is of satisfies a functional equationinterest to probe deeper into the nature of asymptotic Uy(p) = y(p)behavior of the action probability vector p(n).

There are two distinct types of convergence associated with suitable boundary conditions at p = 0 and p = 1,with the reinforcement schemes. In one case the distribution where the operator U is defined byfunctions of the sequence of action probabilities converge Uy(p) = E[y(pl(n ± 1)) pl(n) p].to a distribution function at all points of continuity of thelatter function. This mode of convergence occurs typically However, this functional equation is extremely difficult toin the case of expedient schemes such as the LR-P scheme. A solve. Hence the next best thing that can be done is to estab-different mode of convergence occurs in the case Of e- lish upper and lower bounds on y(p). These can be computedoptimal schemes (such as the LR-I scheme). Here it can be by finding two functions ii(p) and +'2(P) such thatproved using the martingale convergence theorem that thesequence of action probabilities converges to a limiting Ut/I (p) . /I (p)random variable with probability one. Thus a stronger mode andof convergence is exhibited by s-optimal schemes [LN4]. U2(p) < +12(P)The difference between the two modes of convergence can

be understood by the fact that expedient schemes (such as for all p E [0,1] with appropriate boundary conditions.LR-P) generate Markov processes that are ergodic but have Satisfaction of these inequalities yieldsno absorbing barrier, whereas e-optimal schemes result inMarkov processes with more than one absorbing barrier. #f2(P) . 7(P) . tf1(p)Initial values of action probabilities do not affect the The functions 4,1e( ) and {1X2( ) are called subregular andasymptotic behavior in the case of expedient schemes, superregular, respectively, and are comparatively easy to

find as it involves only establishment of inequalities. Use of

sThese conditions broadly arise because the pj(fl) are probabilitiesand are required to lie in (0,1) and sum to unity. For a more precisestatement see [LL3]. 6p represents a scalar in the remainder of this subsection.

NARENDRA AND THATHACHAR: LEARNING AUTOMATA 329

exponential functions appears particularly appropriate present, is the case when all the roots of the martingalehere. With equation correspond to these absorbing states.

((xip)

In view of the preceeding clarifications, some of theoi( ) exp (xip- ii= 1,2 conclusions drawn in the earlier literature have to be

exp (xi- 1) modified. The nonlinear schemes regarded as optimal in

Norman [LN3] has established tight bounds for y(p) in the restricted environments by Varshavskii and Vorontsovacase of the LR-I scheme with two states. The technique has [LV1] as well as by Chandrasekaran and Shen [LCI] arebeen extended to cover nonlinear schemes and multistate now seen to be only c-optimal. Similarly the LR-I scheme ofcases [LL4]. Shapiro and Narendra [LS1] and the nonlinear schemesComments on Convergence Results: Recognizing the given by Viswanathan and Narendra [LV9] Lakshmivara-

importance of the study of convergence of the action han and Thathachar [LL1] are only c-optimal in allprobabilities, several researchers have attempted to simplify stationary random environments. The nonlinear schemes inthe procedures involved. Some intuitively attractive ideas [LCI] where there are roots of the martingale equationwere employed extensively in this context, but on closer other than those corresponding to the absorbing states havescrutiny they have been found to be incorrect. The to be studied in greater detail in order to make definitivemechanism of convergence now appears to be more complex statements about their convergence. In connection with thethan what was thought of earlier. Attention will be drawn LR Dscheme, the asymptotic values of the state probabilitiesto some of these fallacious arguments in the following, given in [LCI] actually refer to their expectations.

1) In the case of a two-state automaton, suppose pi(n) Some General Remarks on Reinforcement Schemessatisfies

1) All the schemes available in the literature so far areAp1(n) = E[pl(n + 1) - pl(n) pi(n)] > O, either expedient or c-optimal. Vorontsova [LV2] stated

sufficient conditions for a scheme to be optimal for afor all pi(n) e (0,1) two-state automaton operating in continuous time. It is not

= 0, for p (n) = 0 and I. known at present whether these conditions ensure optimalityfor the discrete-time automaton of our interest.

It follows immediately that lim,, . pl(n) = {0,1} and that 2) The question whether expedient or c-optimal schemesE[p1(n)] is strictly monotonically increasing. It has further are to be used in a particular situation is of considerablebeen contended [LSI], [LLI] that since E[pl(n)] is interest. While c-optimal schemes show definite advantagesbounded above by unity, E[pi(n)] -+ 1 as n -- oo. This in in stationary random environments, the situation is notturn implies that pi(n) --+1 with probability one. However, clear in the case of more general environments. A detailedthe conclusion that E[p1(n)] --+1 is not necessarily true comparison of various schemes has been made by Viswana-even though E[pl(n)] is strictly monotonically increasing. than and Narendra [LV6].Neither is the conclusion about convergence of p1(n) to 3) Only schemes suitable for the P-model have beenunity with probability one. discussed here. With appropriate modifications each of

2) A stability argument to be described has been widely these schemes can be made suitable for Q- and S-modelsused [LVl], [LCl] in gaining insight into the nature of [LCI], [LV3], [LV8].convergence. Let P i (i = 1,2, * * *) be the roots (in the unit 4) For the sake of simplicity, examples mostly of two-interval) of the martingale equation state schemes have been given. These can, however, be

extended in a simple manner for use with multistateAp1(n) = E[p1(n + 1) - pi(n) p1(n)] = 0. automata [LV9], [LL4].

The roots satisfying 5) Several linear and nonlinear schemes have beendiscussed in this section. Often improved rates of con-

dAp, vergence can be obtained by a combination of linear anddp1P1=1 < 0 nonlinear terms [LL3], [LL4]. c-optimal schemes are

usually slow when the action probabilities are close to theirare called stable roots and the rest unstable. It is then final values. The convergence can be speeded up by using aargued, on the basis of regarding Ap, as an increment in hybrid scheme constructed from a judicious combination ofpl(n), that p1(n) converges to the set of stable roots with an c-optimal scheme and an expedient scheme [LV1O].probability one. When Apl(n) is sign definite the argument 6) In any particular scheme, the speed of convergencereduces to that in the previous comment 1). can be increased by changing the values of the parameters

On deeper probing, no justification of this argument has in the scheme. However, this is almost invariably accom-been found. It does not generally appear possible to prove .........panied by an increase in the probability of convergence toconvergence with probability one when there are roots of the undesired action(s) [LL4]. We meet the classicalthe martingale equation other than those corresponding to problem of speed versus accuracy. To score on both thethe absorbing states. Indeed what conclusions can be drawn counts, it appears that a careful selection of the nonlinearin such a situation are at present unclear. The only situation, terms is necessary. Development of an analytical measurewhich can be handled presently (following Norman's of the speed of convergence is necessary for further progressapproach outlined earlier) when absorbing states are in this regard.

330 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, JULY 1974

A+3_ In the classical theory of games developed by VonINPUT ACTIONIPT aIM Neumann [GM1], [GL2], [GV1] it is assumed that the

players are provided with complete information regardingREFEREE the game, such as the number of players, their possible

ORgaeplyr, psilENVIRONMENT strategies, and the payoff matrix. The players can, therefore,

compute their optimal strategies using this prior informa-INPUT ACTION tion. In the automata games being considered here such

x,2)s_x(l) A a11 prior knowledge is not available, and the automata have to

choose their strategies on-line. The payoff function is aA A2 LEARNING AUTOMATA random variable with an unknown distribution.Fig. 4. Game between learning automata. The concept of automata games was originally suggested

by Krylov and Tsetlin [DKI], who considered competitivegames of deterministic automata. These results were

7) While the reinforcement scheme updates the action extended by the introduction of learning automata byprobabilities, a simultaneous estimation of the penalty Chandrasekaran and Shen [LC3], who used the LR--P andprobabilities of the environment using the Bayesian nonlinear schemes. More recently Viswanathan andtechnique is proposed by Lakshmivarahan and Thathachar Narendra [LV7] discussed such games using c-optimal[LL2]. This leads to a fast estimate of the optimal action schemes. It was demonstrated in [LV7] that when theand also provides a confidence level on this estimate. payoff matrix has a unique saddle point, the value of the

game coincides with that obtained in the deterministicV. INTERACTION OF AUTOMATA situation, even though the characteristics of the game have

The discussions made thus far have been centered around to be learned as the game evolves. Since the automataa single stochastic automaton interacting with a random operate entirely on the basis of their own strategies and theenvironment. We shall now consider interactions between corresponding response of the referee, without anyseveral automata. In particular, two types of interactions knowledge of the other automata participating in the game,are of interest. In one case several automata are operating this result also has implications in the context of decen-together in an environment either in a competitive or a tralized control systems.cooperative manner so that we have a game situation. In The preceding discussion pertains to competitive gamethe other case we consider a hierarchical system where problems, because each player, in trying to maximize histhere are various levels of automata, and there is interaction gain, is also attempting to minimize the gain of the otherbetween different levels. Many interesting aspects of player. When the participants of a game cooperate withinteraction, mostly with reference to deterministic automata, each other, we have a cooperative game. Stefanyuk andhave been brought out in a series of papers by Varshavskii Vaisboard [DSl], [DVI], [DV2] considered cooperative[DV5]-[DV8]. games using deterministic automata. Viswanathan and

Games of Automata Narendra [LV7] have shown that the value of the game canbe made as close to the Von Neumann value as desired by

Consider two automata operating in the same environ- using s-optimal schemes when the payoff matrix has ament (Fig. 4). Each automaton selects its actions without unique saddle point.any knowledge of the operation of the other automaton. The results available so far on learning automata gamesFor each pair of actions selected by the automata, the are few and limited. It appears that there is a broad scopeenvironment responds in a random manner with outcomes for further study.that form zero-sum inputs to the two automata. The actionprobabilities of the two automata are now independently Multilevel Automataupdated according to suitable reinforcement schemes, and A multilevel system of automata consists of severalthe procedure is repeated. The interest here is in the levels, each comprising many automata. Each action of anasymptotic behavior of the action probabilities and inputs. automaton at a certain level triggers automata at the next

This problem may be regarded as a game between the lower level, and thus the complete system has a treetwo automata. In this context several quantities can be structure. At each stage, decisions are made at variousredefined as follows. The event corresponding to the levels and communicated to lower levels in the hierarchy.selection of a pair of actions by the automata is called a The purpose of organizing such a multilevel system may beplay. A game consists of a sequence of plays. Each action to achieve a performance that cannot be obtained using achosen by an automaton in a play is called a strategy. The single automaton or to overcome the high dimensionality ofeniVronment iS known as a referee, and the input received the decision space. The main problem in such multilevelby each automaton corresponds to the payoff. Since the sum systems is the coordination of activities at different levelsof the payoffs is zero, the game corresponds to a zero-sum or, in other words, to ensure convergence of the entiretwo-person game. The asymptotic value of the expectation structure towards the desired behavior.of the payoff to one of the players is called the value of the A two-level system has been proposed by Narendra andgame. Viswanathan [LN 1] for the optimization of-performance of

NARENDRA AND THATHACHAR: LEARNING AUTOMATA 331

automata operating in a periodic random environment. It mances of the expedient linear scheme and c-optimalis assumed that the penalty probabilities ci(n) are periodic schemes with the suggested modification is not available atfunctions of n with an unknown period. The upper bound on the present time.the period is known.The first level consists of one automaton in which the Periodic Environments

actions correspond to the various possible values of the If it is known a priori that the penalty probabilities of theperiod. The second level consists of a number of automata, environment vary periodically in time with a commonone automaton corresponding to each of these first-level period T, the period T may be divided into N intervals. Aactions. Following the selection of a period T(n) by the system ofN automata with a suitable arrangement can thenfirst level automaton, the first T(n) automata in the second be used so that one automaton is in operation at any instantlevel are initiated to operate in the environment for one of time and each automaton operates only once in everycycle. The average output of the environment in this cycle period. This is equivalent to each automaton operating in ais used as the input to the first level to make the next stationary environment, so that over many cycles of opera-selection of the value of the period. It has been shown tion the automata converge to the desired actions. In thethrough simulations [LNI] that the use of s-optimal case in which the environment is known to be periodic butschemes leads to the convergence of the period to the true the value of tfie period is unknown, a two-level automatonvalue. can be used as already described in Section V.

VI. NONSTATIONARY ENVIRONMENTS VII. UTILIZATION OF AUTOMATA

Most of the available work relates to the behavior of The learning automata discussed in earlier sections can belearning automata in stationary environments. The problem utilized to perform certain specific functions in a systemsof behavior in nonstationary environments appears difficult, context. In particular, they may be used as optimizers or -asand only a few and specialized results are known [LL6], decision-making devices and consequently can prove useful[LC4]. The interaction of automata in game situations and in a large class of practical problems.in hierarchical systems is one such special case. Some of theother known results will be described in this section. Parameter Optimization

As remarked earlier many problems of adaptive control,pattern recognition, filtering, and identification can, under

Tsetlin [DT1] considered a composite environment, proper assumptions, be regarded as parameter optimizationwhich switches between a number of stationary environ- problems. It appears that a learning automaton can bements in accordance with a Markov chain, and investigated fruitfully applied to solve such problems especially underthe behavior of deterministic automata [DT2]. Varshavskii noisy conditions when the a priori information is small. Inand Vorontsova [LVI] applied the learning automaton to fact, the possibility of using a stochastic automaton as athe same problem and showed that expediency can be model for a learning controller provided the first motivationobtained for the entire range of possible values of the for studies in this area.parameter of the switching transition probability matrix. Given a system with only noisy measurements g(x,co) on

In a very limited situation where the Markov chain the performance function I(a) = E{g(o,co)}, where 2 is angoverning switching of the environment reaches its m-vector of parameters and co is the measurement noise, thestationarity quickly in comparison with the time taken for parameter optimization problem is to determine thethe convergence of the reinforcement scheme, the problem optimal vector of parameters OCopt such that the systemessentially reduces to the operation in a stationary environ- performance function is extremized. It is assumed that anment, hence, c-optimal schemes would perform well. analytical solution is not possible because of lack of

sufficient information concerning the structure of thesystem and its performance function or because of math-

When the penalty probabilities of the environment vary ematical intractability. The performance function I(x) may"slowly" in time, an c-optimal scheme tends to lock on to a also be multimodal.certain action, thereby losing its ability to change. As When this problem is tackled by gradient methods, bothstudied by Chandrasekaran and Shen [LC2], if the deterministic and stochastic, a search in the parameterfrequency of variation is sufficiently small, then the LR-P space is carried out resulting in convergence to a localscheme, which is expedient, seems to function satisfactorily. optimum. In the automaton approach the adaptive con-If prior information about the rate of variation is available, troller is the automaton in which the actions correspond tothe performance of c-optimal schemes can be improved by different values of a. The automaton updates the prob-introducing reinitialization of action probabilities (that is, abilities of the various parameter values based on theresetting them to be equal) at regular intervals of time. measurements of the performance function.Another possible approach appears to be to bound the As observed earlier, the gradient methods are in a sense

action probabilities so that they fall short of attaining the inhibited by the fact that at each instant a new value of theabsorbing barriers and are thus free to change according to parameter is to be chosen close to the previous value. Therechanges in the environment. A comparison of the perfor- is no similar restriction in the automaton approach, for each

332 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, JULY 1974

parameter value has a certain probability of being selected number (say, r) of regions v1,v2, * ,vr. There is one autom-and only these probabilities are updated. Thus the learning aton at the first level having r actions each corresponding toautomaton has the desired flexibility not to get locked on one region, and this automaton acts as a supervisor govern-to a local optimum, and this difference makes automata ing the choice of one of the regions. When discretization ofmethods particularly attractive for use in systems having the parameter space is permissible, the second level has rmultimodal performance criteria. automata, each automaton having control over one region.

Several studies have been made on applying automata If, on the contrary, the parameter space is to be treated asmethods to parameter optimization. When a P-model continuous, the second level consists of a local search suchautomaton is used, one problem is to relate the performance as stochastic approximation. In this case it is necessary thatmeasurement, which is usually a continuous real variable, the number of regions be large enough so that each regionto a binary-valued response of the environment. Fu and has at most one local extremum if the two-level system isMcLaren [LF2] avoided this by defining the penalty- to act as a global optimizer. Narendra and Viswanathanstrength (S-) model where the environmental response can [LV5] demonstrated through simulation that the two-leveltake any value in [0,1]. McMurtry and Fu [LM3] applied system exhibits a faster convergence rate than a singlea learning automaton using the LR-P scheme to find the automaton.global minimum of a multimodal surface and showed that 2) Cooperative Games of Automata: The results ofthe automaton chooses the global minimum with the cooperative games of automata can be naturally applied tohighest probability. The use of c-optimal schemes such as the parameter optimization problem as mentioned inthe LR-I scheme was advocated by Shapiro and Narendra Section V. Consider the problem where there are m[LSI], [LWI], who reported success on the difficult parameters each of which is discretized into r values. Theproblem of handling a relatively flat performance surface parameter space now has rm points. If r and m are large, thealong with the superposition of a high variance noise. use of a single automaton controller would lead to a slowWhen the bounds on the performance measurements are rate of convergence. Instead, if the controller consists of m

known it is simple to normalize them to [0,1] and then use automata each of which chooses the value of one parameter,an S-model scheme. However, if these bounds are unknown, then the number of probabilities to be updated at eachit is suggested in [LV3] that the current estimates of the stage would only be rm. The m automata used in thisbounds can be used for normalization, and it is further manner could be regarded as playing a cooperative gameshown experimentally that s-optimal convergence can still with the common object of extremizing the performancebe achieved. function. It follows from the results of Viswanathan andA problem of higher complexity was considered by Narendra [LV5], [LV7] that if the performance function is

Jarvis [LJ1], [LJ2] who studied, through simulation, the unimodal, the use of c-optimal schemes by the m-automatonoperation of a learning automaton using the LR-P scheme controller leads to convergence of parameter values to theas a global optimizer in a nonstationary environment. A optimum with as high a probability as desired. Furtherpattern recognizer was used for sensing the changes in the research is needed to extend this approach to multimodalenvironment. search problems.The restriction that the set of parameter values considered

must be finite is sometimes undesirable. To overcome this Statistical Decision-MakingMcLaren [LM2] has proposed a "growing automaton" As the learning automaton selects the desirable actionwhere the number of actions of the automaton can grow from the set of all actions on the basis of probabilisticcountably to cx. A comparison of several on-line learning inputs from the environment, one can regard the automatonalgorithms, which include the growing automaton algorithm, as a device for making decisions under uncertainty. It canwas recently made by Saridis [GS2]. thus be expected to be used as a tool for solving statistical

Problem of High Dimensionality: A basic problem decision problems.associated with the use of automata methods in parameter Many problems in control and communication can beoptimization is that of high dimensionality. The problem is posed as the fundamental problem of deciding between twocaused by the fact that the number of control actions of the hypotheses H1 and H2 on the basis of noisy observationsautomaton rapidly increases with the number of parameters x(n). The conditional densities p(x H1) and p(x H2) areand the "fineness" of the grid employed in the search. The given, and the problem is to decide whether H1 or H2 isspeed of convergence of the updating schemes is to a large true so that the probabilities of making errors of the twoextent dependent upon the number of control actions, and kinds are less than prespecified values.thus when the parameter space contains a large number of In order to apply the learning automaton approach topoints the convergence is too slow to be of any practical this situation it is necessary to make certain identifications.use. In case the parameter space is continuous, the automata The two hypotheses H1 and H2 are made to correspond tomethod cannot be applied directly. the two actions of an automaton, and the observations x(n)There are two methods of overcoming this problem of are regarded as the responses from the environment. Binary

high dimensionality, as presented in the following discus- responses required for a P-model are obtained by using asion. Both the methods employ several interacting automata. threshold.

1) Two-Level Controller: Thg controller has a two-level As no a priori information on the true hypothesis isstructure here. The parameter space is divided into a finite available, the initial action probabilities are set at 0.5 each,

NARENDRA AND THATHACHAR: LEARNING AUTOMATA 333

and the automaton is allowed to operate according to an Learning automata provide a novel and computationallyc-optimal scheme. The hypothesis corresponding to the attractive mode of attacking a large class of problemsaction in which the probability attains unity is taken as the involving uncertainties of a high order. As such theytrue one. A design procedure for choosing the threshold and constitute an alternative approach to the well-knownthe parameters of the reinforcement scheme so as to satisfy parameter optimization method using stochastic approx-any prespecified bounds on the error probabilities has been imation. It is the opinion of the authors that a judiciousworked out by Lakshmivarahan and Thathachar [LL5]. combination of the two approaches will find increasingExtension to multiple-hypothesis testing is also straight- application in many practical problems in the future.forward [LL4].

ACKNOWLEDGMENTTes-ting Reinforcement Schemes

The authors would like to thank R. Viswanathan andAttempts to develop new learning models in recent years S. Lakshmivarahan who worked with them on several

have resulted in a variety of linear and nonlinear reinforce- aspects of learning automata.ment schemes with adjustable parameters. It has conse-quently become increasingly difficult to compare two REFERENCESschemes for any given application. A novel approach Generalsuggested by Viswanathan and Narendra [LV6] is to set up [GAI] R. C. Atkinson, G. H. Bower, and E. J. Crothers, An Introduc-

a competitive game between two automata using the given tion to Mathematical Learning Theory. New York: Wiley,a competitive game between two automata using the given 1965.schemes. A suitable game matrix with a unique saddle point [GB1] R. R. Bush and F. Mosteller, Stochastic Models for Learning.is chosen, and the automata are allowed to play the game. [GF1]New York: Wiley, 1958.

GF]K. S. Fu, "Learning control systems-Review and outlook,"The behavior of the two automata may then be used to IEEE Trans. Automat. Contr., vol. 15, pp. 210-221, Apr. 1970.determine the effects of parameters as well as the superiority [GF2] K. S. Fu, Ed., Pattern Recognition and Machine Learnin7.New York: Plenum, 1971.of one of the schemes. Several such comparisons have been [GIl] M. Iosifescu and R. Theodorescu, Random Processes andmade in [LV6], and the superiority of c-optimal schemes [GLI] Learning. New York: Springer, 1969.GL]R. D. Luce, Individual Choice Behavior. New York: Wiley,over expedient schemes in stationary environments has been 1959.demonstrated. [GL2] R. D. Luce and H. Raiffa, Games and Decisions. New York:Wiley, 1957.

[GM1] J. McKinsey, Introduction to the Theory of Games. NewVIII. APPLICATIONS York: McGraw-Hill, 1952.

[GM2] J. M. Mendel and K. S. Fu, Eds., Adaptive, Learning andAs mentioned in earlier sections, the theory of learning Pattern Recognition Systems. New York: Academic, 1970.

automata is still in its infancy, and the few significant [GM3] J. M. Mendel, "Reinforcement learning models and theirapplications to control problems," Learning Systems-Aresults that are available were obtained only in recent Symposium of the 1973 Joint Automatic Control Conf. (ASMEyears. Naturally, the application of these newly developed [GNI publication), pp. 3-18, 1973.

GN]M. F. Norman, Markov Processes and Learning Models.concepts to real-world problems has lagged somewhat, and New York: Academic, 1972.only a few attempts have so far been reported. [GPI] A. Paz, Introduction to Probabilistic Automata. New York:Academic, 1971.An early application of the automaton approach was [GSl] J. Sklansky, "Learning systems for automatic control," IEEE

studied by Riordon [LRl], [LR2] in connection with the Trans. Automat. Contr., vol. AC-li, pp. 6--19, Jan. 1966.[GS2] G. N. Saridis, "On-line learning control algorithms," Learningcontrol of heat treatment regarded as a Markov process Systems-A Symposium of the 1973 Joint Automatic Controlwith unknown dynamics. The automaton acts as the Conf. (ASME publication), pp. 19-52, 1973.

IGTI] Ya. Z. Tsypkin, Adaptation and Learning in Automatic Systems.adaptive controller in modeling the process as well as New York: Academic, 1971.generating the appropriate control signals. Glorioso et al. [GVI] J. Von Neumann and 0. Morgenstern, Theory of Games and

.LGI,*LG2] havediscussed adaptive routing in largeEconomic Behavior. Princeton, N.J.: Princeton Univ., 1947.[LGl], [LG2] have discussed adaptive routing inl large DtriitcAtmtcommunicatiOn etworks using heu-ristic ideas of prob- Deterministic Automatacommunication networks using heuristic ideas of prob- [DG1] S. L. Ginzburg and M. L. Tsetlin, "Some examples ofabilistic iteration. Recently these ideas have been extended simulation of the collective behavior of automata," Probl.by Narendra, Tripathi, and Mason [LN5] to the telephone Peredachi Jnformatsii, vol. 1, no. 2, pp. 54-62, 1965.

[DK1] V. Yu Krylov and M. L. Tsetlin, "Games between automata,"traffic routing problem for a real large-scale telephone Avtomat. Telemekh., vol. 24, pp. 975-987, July 1963.network in British Columbia. Li and Fu [LL8] have applied [DRl] L. E. Radyuk and A. F. Terpugov, "Effectiveness of applying

automata with linear tactic in signal detection systems,"a stochastic automaton to select features in agricultural Avtomat. Telemekh., vol. 32, pp. 99-107, Apr. 1971.data obtained by remote sensing. Several interesting [DS1] V. L. Stefanyuk, "Example of a problem in the joint behavior

of two automata," Avtomat. Telemekh., vol. 24, pp. 781-784,applications to numerical analysis, random counters, and June 1963.modeling appear to be contained in a recent Russian [DS2] N. L. Stefanyuk and M. L. Tsetlin, "Power regulation in a

group of radio stations," Probl. Peredachi Informatsii, vol. 3,publication edited by Lorentz and Levin [LL7]. no. 4, pp. 49-57, 1967.Other attempts have been made in the Soviet Union to [DT1] M. L. Tsetlin, "On the behavior of finite automata in random

. . . . ~~~~~~~~~~~~~~~media,"Avntomat. Telemekh., vol. 22, pp. 1345-1354, Oct. 1961.apply automaton methods to a variety of situations, such as [DT2] H. Tsuji, M. Mizumoto, J. Toyoda, and K. Tanaka, "Anreliability [DGl], power regulation [DS2], switching automaton in the non-stationary random environment,"cicis_D3, quein sytm [D4] an_eeto D1 Inform. Sci., vol. 6, no. 2, pp. 123-142, Apr. 1973.clrcuts J ,qeu1n sysems D42, nd dtecton [V1]B. M. Vaisbord, "Game of two automata with differing[D1.While most of these studies are concerned with memory depths," Avtomat. Telemekh., vol. 29, pp. 91-102,

deterministic automata models, there does not appear to be [DV2]- "Game of many automata with various depths ofany reason why learning automata cannot be used in memory," Avtomat. Telemekh., vol. 29, pp. 52-58, Dec. 1968.siia sitatins [DV3J V. I. Varshavskii, M. V. Meleshina, and M. L. Tsetlin,slmllar sltuatlons. ~~~~~~~~"Behavior of automata in periodic random media and the

334 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, JULY 1974

problem of synchronization in the presence of noise," Probl. [LM1] R. W. McLaren, "A stochastic automaton model for synthesisPeredachy Informatsii, vol. 1, no. 1, pp. 65-71, 1965. of learning systems," IEEE Trans. Syst. Sci. Cybern., vol.

[DV4] , "Priority organization in queuing systems using a model SSC-2, pp. 109-114, Dec. 1966.of collective behavior," Probl. Peredachi Informatsii, vol. 4, [LM2] , "A stochastic automaton model for a class of learningno. 1, pp. 73-76, 1968. controllers," in 1967 Joint Automatic Control Conf., Preprints.

[DV5] V. I. Varshavskii, "Collective behavior and control problems," [LM3] G. J. McMurtry and K. S. Fu, "A variable-structure automatonin Machine Intelligence, 3, D. Michie, Ed. Edinburgh: used as a multimodal search technique," IEEE Trans. Automat.Edinburgh Univ., 1968. Contr., vol. AC-lI, pp. 379-387, July 1966.

[DV6] , "The organization of interaction in collectives of [LM4] L. G. Mason, "Self optimizing allocation systems," Ph.D.automata," Machine Intelligence, 4, B. Meltzer and D. Michie, dissertation, University of Saskatchewan, Saskatoon, Sask.,Eds. Edinburgh: Edinburgh Univ., 1969. Canada, 1972.

[DV7] , "Some effects in the collective behavior of automata," [LM5] "An optimal learning algorithm for S-model environ-Machine Intelligence, 7, B. Meltzer and D. Michie, Eds. ments," IEEE Trans. Automat. Contr., vol. AC-18, pp. 493-496,Edinburgh: Edinburgh Univ., 1972. Oct. 1973.

[DV8] , "Automata games and control problems," presented at [LN1] K. S. Narendra and R. Viswanathan, "A two-level system ofthe 5th World Congr. IFAC, Paris, France, June 12-17, 1972. stochastic automata for periodic random environments," IEEE

Trans. Syst., Man, Cybern., vol. SMC-2, pp. 285-289, Apr.1972.

Learning Automata [LN2] , "Learning models using stochastic automata," in Proc.[LC1] B. Chandrasekaran and D. W. C. Shen, "On expediency and 1972 Int. Conf. Cybernetics and Society, Washington, D.C.,

convergence in variable-structure automata," IEEE Trans. Oct. 9-12.Syst. Sci. Cybern., vol. SSC-4, pp. 52-60, Mar. 1968. [LN3] M. F. Norman, "On linear models with two absorbing

[LC2] , "Adaptation of stochastic automata in nonstationary barriers," J. Math. Psychol., vol. 5, pp. 225-241, 1968.environments," Proc. NEC, vol. 23, pp. 39-44, 1967. [LN4] , "Some convergence theorems for stochastic learning

[LC31 , "Stochastic automata games," IEEE Trans. Syst. Sci. models with distance diminishing operators," J. Math.Cybern., vol. SSC-5, pp. 145-149, Apr. 1969. Psychol., vol. 5, pp. 61-101, 1968.

[LC4] L. D. Cockrell and K. S. Fu, "On search techniques in [LN5] K. S. Narendra, S. S. Tripathi, and L. G. Mason, "Applicationswitching environment," Proc. 9th Symposium Adaptive of learning automata to telephone traffic routing problems,"Processes, Austin, Tex., Dec. 1970. Becton Center, Yale Univ., New Haven, Conn., Tech. Rep.

[LC5] T. M. Cover and M. E. Hellman, "The two-armed bandit CT-60, Jan. 1974.problem with time-invariant finite memory," IEEE Trans. [LR1] J. S. Riordon, "Optimal feedback characteristics from stochas-Inform. Theory, vol. IT-16, pp. 185-195, Mar. 1970. tic automaton models," IEEE Trans. Automat. Contr., vol.

[LF1] K. S. Fu and G. J. McMurtry, "A study of stochastic automata AC-14, pp. 89-92, Feb. 1969.as models of adaptive and learning controllers," Purdue Univ., [LR2] , "An adaptive automaton controller for discrete-timeLafayette, Ind., Tech. Rep., TR-EE 65-8, 1965. Markov processes," Automatica (IFAC Journal), vol. 5, no. 6,

[LF2] K. S. Fu and R. W. McLaren, "An application of stochastic pp. 721-730, Nov. 1969.automata to the synthesis of learning systems," Purdue Univ., [LS1] I. J. Shapiro and K. S. Narendra, "Use of stochastic automataLafayette, Ind., Tech. Rep. TR-EE 65-17, 1965. for parameter self-optimization with multimodal performance

[LF3] K. S. Fu, "Stochastic automata, stochastic languages and criteria," IEEE Trans. Syst. Sci. Cybern., vol. SSC-5, pp.pattern recognition," J. Cybern., vol. 1, no. 3, pp. 31-49, 352-360, Oct. 1969.July-Sept. 1971. [LS2] I. J. Shapiro, "The use of stochastic automata in adaptive

[LF4] K. S. Fu and T. J. Li, "Formulation of learning automata and control," Ph.D. dissertation, Dep. Eng. and Applied Sci., Yaleautomata games," Inform. Sci., vol. 1, no. 3, pp. 237-256, July Univ., New Haven, Conn., 1969.1969. [LS3] Y. Sawaragi and N. Baba, "A consideration on the learning

[LF5] , "On stochastic automata and languages," Inform. Sci., behavior of variable structure stochastic automata," IEEEvol. 1, no. 4, pp. 403-419, Oct. 1969. Trans. Syst., Man, Cybern., vol. SMC-3, pp. 644-647, Nov.

[LF6] K. S. Fu, "Stochastic automata as models of learning systems," 1973.Computer and Information Sciences II, J. T. Tou, Ed. New [LT1] Ya. Z. Tsypkin and A. S. Poznyak, "Finite learning automata,"York: Academic, 1967. Eng. Cybern., vol. 10, pp. 478-490, May-June 1972.

[LG1] R. M. Glorioso, G. R. Grueneich, and J. C. Dunn, "Self [LVI] V. I. Varshavskii and I. P. Vorontsova, "On the behavior oforganization and adaptive routing for communication net- stochastic automata with variable structure," Avtomat.works," 1969 EASCON Rec., pp. 243-250. Telemekh., vol. 24, pp. 353-360, Mar. 1963.

[LG2] R. M. Glorioso and G. R. Grueneich, "A training algorithm [LV2] I. P. Vorontsova, "Algorithms for changing stochastic autom-for systems described by stochastic transition matrices," IEEE ata transition probabilities," Probl. Peredachi Informatsii, vol.Trans. Syst., Man, Cybern., vol. SMC-1, pp. 86-87, Jan. 1971. 1, no. 3, pp. 122-126, 1965.

[LG3] B. R. Gaines, "Stochastic computing systems," in Advances in [LV3] R. Viswanathan and K. S. Narendra, "Stochastic automataInformation Sciences 2. New York: Plenum, 1969, pp. 37-172. models with applications to learning systems," IEEE Trans.

[LJ1] R. A. Jarvis, "Adaptive global search in a time-invariant Syst., Man, Cybern., vol. SMC-3, Jan. 1973.environment using a probabilistic automaton," Proc. IREE, [LV4] , "A note on the linear reinforcement scheme for variable-Australia, pp. 210-226, 1969. structure stochastic automata," IEEE Trans. Syst., Man,

[LJ2] - , "Adaptive global search in a time-variant environment Cybern., vol. SMC-2, pp. 292-294, Apr. 1972.using a probabilistic automaton with pattern recognition [LV5] R. Viswanathan, "Learning automata: models and applica-supervision," IEEE Trans. Syst. Sci. Cybern., vol. SSC-6, pp. tions," Ph.D. dissertation, Dep. Eng. and Applied Sci., Yale209-217, July 1970. Univ., New Haven, Conn., 1972.

[LL1] S. Lakshmivarahan and M. A. L. Thathachar, "Optimal [LV6] R. Viswanathan and K. S. Narendra, "Comparison ofnonlinear reinforcement schemes for stochastic automata," expedient and optimal reinforcement schemes for learningInform. Sci., vol. 4, pp. 121-128, 1972. systems," J. Cybern., vol. 2, no. 1, pp. 21-37, 1972.

[LL2] , "Bayesian learning and reinforcement schemes for [LV7] , "Competitive and cooperative games of variable-stochastic automata," Proc. Int. Conf. Cybernetics and Society, structure stochastic automata," in Joint Automatic ControlWashington, D.C., Oct. 1972. Conf., Preprints, Aug. 1972; also available as Becton Center,

[LL3] , "Absolutely expedient learning algorithms for stochastic Yale Univ., New Haven, Conn., Tech. Rep. CT-44, Nov. 1971.automata," IEEE Trans. Syst. Man, Cybern., vol. SMC-3, pp. [LV8] , "Application of stochastic automata models to learning281-286, May 1973. systems with multimodal performance criteria," Becton

[LL4] S. Lakshmivarahan, "Learning algorithms for stochastic Center, Yale Univ., New Haven, Conn., Tech. Rep. CT-40,automata," Ph.D. dissertation, Dep. Elec. Eng., Indian June 1971.Institute of Science, Bangalore, India, Jan. 1973. [LV9] , "Expedient and optimal variable-structure stochastic

[LL5] S. Lakshmivarahan and M. A. L. Thathachar, "Hypothesis automata," Dunham Lab., Yale Univ., New Haven, Conn.testing using variable-structure stochastic automata," in 1973 Tech. Rep. CT-31, Apr. 1970.IEEE Decision and Control Conf., Preprints, San Diego, Calif. [LY10] , "Simulation studies of stochastic automata models,

[LL6] G. Langholtz, "'Behavior of automata in a nonstationary Part I-Reinforcement schemes," Becton Center, Yale Univ.,environment," Electron. Lett., vol. 7, no. 12, pp. 348-349, New Haven, Conn., Tech. Rep. CT-45, Dec. 1971.June 17, 1971. [LW1] I. H. Witten, "Comments on 'Use of stochastic automata for

[LL7] A. A. Lorentz and V. I. Lenin, Eds., Probabilistic Automata parameter self-optimization with multimodal performanceand Their Applications (in Russian). Riga, Latvia, USSR: criteria,' "IEEE Trans. Syst., Man, Cybern., vol. SMC-2, Apr.Zinatne, 1971. 1972.

(LL8] T. J. Li and K. S. Fu, "Automata games, stochastic automata [LW2] , "Finite-time performance of some two-armed banditand formal languages," Purdue Univ., Lafayette, Ind., Tech. controllers," IEEE Trans. Syst., Man, Cybern., vol. SMC-3,Rep. TR-EE69-1, Jan. 1969. pp. 194-197, Mar. 1973.