[IEEE 2012 Joint 6th Intl. Conference on Soft Computing and Intelligent Systems (SCIS) and 13th Intl. Symposium on Advanced Intelligent Systems (ISIS) - Kobe, Japan (2012.11.20-2012.11.24)]

Decentralized Pursuit Learning Automatain Batch Mode

VIDYA BHUSHAN SINGHDepartment of Computer and

Information ScienceIndiana University

Purdue University IndianapolisEmail: [email protected]

SNEHASIS MUKHOPADHYAYDepartment of Computer and

Information ScienceIndiana University

Purdue University IndianapolisEmail: [email protected]

MEGHNA BABBAR-SEBENSDepartment of Earth Sciences

Indiana UniversityPurdue University Indianapolis

Email: [email protected]

Abstract—Learning Automata (LA) and Genetic Algorithms(GA) have been used for a long time to solve problems in differentdomains. However, there is criticism that LA has slow rate ofconvergence and both LA and GA have the problem of gettingstuck in local optima. In this paper we tried to solve the multi-objective problems using LA in batch mode to make the learningfaster and more accurate. We used Decentralized pursuit learningautomaton as LA and NSGA2 as GA. Problems where evaluationof fitness function is a bottleneck like SWAT, evaluation of indi-viduals in parallel can give considerable speed-up. In the multi-objective LA, different weight pairs and individual designs canbe evaluated independently. So we created their parallel versionsto make them practically faster in learning and computationsand extended the parallelization concept with the batch modelearning.

I. INTRODUCTION

Learning Automaton has been used for a long time forsolving problems [3] like identical pay-off games, etc. In thispaper we are using the Learning Automaton to solve the Multi-Objective problems as done in [1]. The Learning Automatonwe used is Decentralized Pursuit Learning Gaming Algorithm,as used by [1]. It is an indirect learning method, which tryto model the environment. The use of learning automatonhas been applied in ground water monitoring problems inenvironmental sciences [7]. We have used NSGA2 [5] as thegenetic algorithm, which is one of the most popular multi-objective GA available today. We implemented multi-objectiveversion in LA by assigning multiple weights to the differentfunctions and evaluating it over a period of time as donein [1]. It gives the result comparable to GA, we call it era,as each weight pair set is independent of other weight pairs.Since the objective functions and their weight pairs, i.e. eras,are independent, we can use PRAM (Parallel Random AccessMachine) based computers to run the entire multi-objectiveeras in parallel. Based on availability of computational powerwe can use smaller weight pairs, i.e. smaller slices of eras,for more precision and accuracy. GA is often criticized asslow in the cases where the evaluation of fitness function isvery slow. The fitness function we are using in this work isvery slow. So to make the work faster, we created distributedversion of the NSGA2, called distributed NSGA2 based onthe work of [18]. We implemented the parallel work at two

levels, one at processor level and other at machine level. Torun the tests initially we used six latest computers, later wegot the access of IU FutureGrid Project’s Tempest, a WindowsHPC Cluster[24] which helped us in running our experimentsfaster. We ran the test for NSGA2, Dist. NSGA2, DPLA, Dist.DPLA, RLBM Type I, Type II and Type III.

II. RELATED WORK

Reinforcement Learning is used by [3] and [19] as LAlearning technique. LA takes a lot of time to converge to goodsolution, so multiple studies have investigated faster learningapproaches via parallelization [2]. The parallelization tech-niques were done for common pay-off games, parameterizedlearning automata and pattern classification problems. The mo-tivation [2] was N independent Agents acting simultaneouslyshould speed-up a process roughly by a factor of N, similar toa population as in GA. But this parallelization was tested fordirect learning algorithm such as LRI Algorithm. In our work,the N LA’s take M independent actions at a time. Combiningall the automatons and their actions will create a set of actionsto be performed by the system, this is like an individualin GA or one design of SWAT model. In Batch Mode, wecreated multiple such individuals. As execution and rewardfor each individual is independent of each other, a PRAMbased computers can be used to evaluate them independently.Batch mode learning is one of the popular learning techniquesused in Neural Networks. In batch mode [11], the learningof the neural network is done by taking the average over alltraining patterns before changing the weights. Choosing a verysmall learning parameter is not realistic as smaller the learning,the longer it takes to converge. So based on availability ofresources and time, learning parameter can be chosen. BatchMode learning is also done in text categorization [13], wherebatch of text documents are used instead of just one. BatchMode learning is used in medical Image classifications [15]and Content based Image retrieval [17] where instead ofselecting a single unlabeled example, a number of unlabeledexamples were selected for manual labeling. A Discriminativebatch mode active learning is done by [16]. People also havetried combining the LA and GA [22] to escape the problem oflocal optima. In StGA [23], the authors created a small number

SCIS-ISIS 2012, Kobe, Japan, November 20-24, 2012

978-1-4673-2743-5/12/$31.00 ©2012 IEEE 1567

of actions of the Learning Automata which were sampled toconstruct a population, from which the sampling action wasdone adaptively by genetic operations. Some authors have alsotried to combine the GA with other algorithms like SimulatedAnnealing to solve some NP-Hard Problems [25].

III. METHODOLOGY

The methodology used here is similar to the one usedby [1] as we are extending the same work for the EagleCreek watershed in Central Indiana, USA, located 10 milesNorthwest of downtown Indianapolis, shown in Fig. 1. One ofthe objectives is to minimize the flooding of water by creatingsmall watersheds throughout the Eagle creek by using mini-mum area. The entire Eagle Creek is divided into total 2953potential wetlands. A distributed hydrology model was builtusing SWAT (Soil and Water Assessment Tool)[20] and [21].The total wetlands are divided into 108 aggregated wetlands.For a binary problem of choosing or not choosing a wetland,the total search space will be 2108 which is computationallyinfeasible, so it was divided into 8 regions. Currently we ranthe tests for region 7, Fig. 2 to test our algorithms.

Fig. 1. Eagle Creek Watershed and its counties, reservoir, streams and 130sub-basins.

The multiple objectives were to minimize the root mean-square error between flows in streams when all the regionswere installed, with the minimum area. Learning automata areimplemented [1], by assigning one learning automaton for eachsub-basin, which decides if it will participate or not in thesystem. It is similar to identical pay-off game model. Twoscaling parameters, Si

Area and SiF low, for region i, to scale

the values of area and flow is done as in [1]. SiArea is total

area of all the aggregated wetlands as shown below.

SiArea =

∑j

Areaij (1)

Fig. 2. Left figure shows the 130 sub-basins and 2953 potential wetlandpolygons in the 8 regions (pink polygons) divided for optimization. Rightfigure shows the enlarged view of potential wetlands (blue polygons) in thewatershed area surrounded by black box in left figure.

SiF low is calculated based on baseline flow dataset (Baseline)

and running the SWAT program, and using following equa-tions:

SiF low =

∑region

ln(1+[Baseline−OutputnoWetlands]2) (2)

P iF low = 1−

∑region

ln(1+[Baseline−OutputsubsetOfWetlands]2)

(3)

P iArea = 1−

∑region

(flagij ×Areaij) (4)

Where, flagij is 1 if wetland j in region i is installed or 0,otherwise. The common pay-off is calculated using:

P iTotal =WArea × P i

Area +WFlow × P iF low (5)

In the DPLA algorithm only one individual design was createdat random; for each iteration. The design was created byrandomly selecting the actions of the automaton based on theiraction probability vector. We call it as individual, as in GA. Inthe distributed version of the DPLA, called Dist DPLA, all theweight pairs were run separately as they were independent ofeach other. Applying the similar parallelization concept and thebatch mode technique we created different distributed versionsof DPLA, we call it RLBM (Reinforcement Learning in BatchMode) where multiple set of individual designs were created.We can get more accuracy as more and more individuals arebeing evaluated and learning is done from multiple individuals.It is similar to taking opinion from many individuals instead ofjust one individual. The opinion can be taken in many differentways. We tried three different ways of taking the opinion anddo the learning. For each weight pairs, all LAs together createa set of designs, instead of just one as in the previous work.The multiple set of designs are similar to multiple individualsof the GA.Now there are multiple ways we can use these set of designsto do the learning.


978-1-4673-2743-5/12/$31.00 ©2012 IEEE 1568

1) Type I: Do the learning based on average of all thesolutions.

2) Type II: Do multiple learning instead on just one time.(Converge faster, but compromise accuracy).

3) Type III: Do the learning based on best solution fromall the individual designs (optimal weight pair). In our currentresearch, we tried type I, type II and type III.

IV. THE ALGORITHMS

We are using the DPLA algorithm as proposed by [1].

A. The DPL game algorithm proceeds as follows

1) At every time step, the ith automaton chooses action αi

at instant k, based on the probability distribution pi(k).2) Each automaton i obtains a common pay-off β(k) based

on the set of all actions.3) Based on the pay-off, each automaton i updates its own

R, Z and D̂ matrices.where, R, Z and D̂ are vectors used for total reinforcement

received, number of times a particular action is chosen and tomodel the environment and do the learning, respectively.

The action probability vector is updated as follows:p(k + 1) = p(k) + λ(eM(k) − p(k)), where 0 < λ < 1 isthe learning parameter, eM(k) is a Unit Vector used to storeinformation about various actions, action has value of 1 if itcorrespond to maximum element of d̂, otherwise its value is0, index M(k) is determined by D̂M(k) = maxjD̂

ij(k).

We created the distributed version of the algorithm, in thefollowing three ways:

B. Reinforcement Learning in Batch Mode, Type I

1) For each weight pairs of the Multi-Objective, run inparallel.

2) Create a set of n Individual designs at instant k, eachIndividual design has

3) The ith automata randomly chooses action αi, based onthe probability distribution pi(k).

4) Evaluate the individual designs in parallel.5) Take the average of the all individual designs.6) Each automaton i obtains a common pay-off β(k) based

on the set of all actions.7) The actions which was chosen most is rewarded.8) Based on the pay-off, each automaton i updates its own

R, Z and D̂ matrices.The action probability vector is updated as the above

algorithm. The number n must be an odd number to makesure that one action will be selected most, in our test we usedn = 7, based on computational power we got.

C. Reinforcement Learning in Batch Mode, Type II

1) For each weight pairs of the Multi-Objective, run inparallel



4) Evaluate the individual designs in parallel.5) Use all the individuals to do the learning one by one

randomly or sequentially.6) Each automaton i obtains a common pay-off β(k) based


R, Z and D̂ matrices.

The action probability vector is updated as the above algo-rithm.

D. Reinforcement Learning in Batch Mode, Type III

1) For each weight pairs of the Multi-Objective, run inparallel



4) Evaluate the individual designs in parallel.5) Use a set of best or average of best individuals to do

the learning one by one randomly or sequentially.6) Each automaton i obtains a common pay-off β(k) based


R, Z and D̂ matrices.

The action probability vector is updated as the abovealgorithm. In our current experiments we used the top threebest individuals based on their pay-off.

Using Type I, we can get more accuracy as more andmore individuals are being evaluated and learning is donefrom multiple individuals. It is similar to taking opinion frommany individuals instead of just one individual. As we aretaking the average, the final reward is given to that actionof automaton which is selected the most. Each automatonis capable of performing only two actions [1]. Like in thevoting system, the most selected action of design set isbeing rewarded. The process goes to multiple generations akaiterations, unless some convergence criteria are met. Also, thelearning parameter and convergence criteria can be varied, toget more accuracy or speed.

Using Type II, we can get a faster convergence. Learningis performed from the opinion of multiple individuals, whoare siblings, instead of just one. Instead of taking theiraverage opinion, the automata are rewarded directly from allthe individuals. It is similar to taking opinion of everybody,considering all the opinions to be equally true. The problemwould be some opinions might not be correct and somewill be completely false. But this will help the search fromgetting trapped into local optima, as there will always besome automaton which will take the system away from localoptima. There will be some loss of accuracy as multiplelearning steps are performed using the same action probability,instead of getting one feedback; the automata are gettingmultiple feedbacks. In this way the convergence parametermoves faster. To compensate the loss of accuracy, we candecrease the step size of learning.


978-1-4673-2743-5/12/$31.00 ©2012 IEEE 1569

Using Type III, we can get a compromise of speed andaccuracy. It is kind of hybrid of Type I and Type II. Theopinion can be taken as average or all of the best. Theselection of best Individual is big task in this algorithm. Inour current experiments we used top three, based on pay-off, but other techniques can also be used to find the bestset of individuals. To perform the learning, the individualscan be selected randomly from all the individuals, generatedfor learning, or simply perform the leaning in the order theindividuals are generated, i.e sequentially.

V. RESULTS AND DISCUSSION

We did different experiments based on the above method-ology for the region 7 of Eagle Creek Watershed Project.

A. Optimal Solution

Since the search space was small so we exhaustively ob-tained all the possible solutions as shown in Fig. 3. Forlarge search space, we won’t have any optimal solution tocompare to though the algorithms will be same. From thosesolutions using Non-Dominated Sorting Algorithm of NSGA2,we generated the optimal pareto-front as shown, denoted asOptimal. Based on this we compared the results of NSGA2,Dist. NSGA2, DPLA, Dist. DPLA, Dist. RLBM (Reinforce-ment Learning in Batch Mode) Type I, Type II and Type III.

Fig. 3. All Solutions

B. NSGA2 Vs Dist NSGA2

Test results are shown in Fig. 4 for the NSGA2 and Dist.NSGA2 Algorithms. The tests were done for twenty eightIndividuals for over 20 generations. The Dist. NSGA2 wasmuch faster in execution as the entire generation was evaluatedat a time, but in NSGA2 the individuals were evaluated in se-quence. We evaluated two different models of swat programs;one was taking 1 min 20 sec while other was taking 7 min.As shown in the 4 you can see Dist. NSGA2 gave similarresult as NSGA2. On Futuregrid cluster, at full capacity, basedon first swat program we can achieve 448X speed-up and theother can achieve 358.4X speed-up, theoretically. Recently we

ran another set of simulations of SWAT programs using Dist.NSGA2 which was finished by it in two days, that is equivalentto 146 days of sequential processing. So, Dist. NSGA2 ispractically more suitable to obtain results very faster.

Fig. 4. NSGA2 Vs Dist. NSGA2

C. DPLA

Tests were performed for DPLA and Dist. DPLA. In Dist.DPLA all the weight pairs were run separately as they wereindependent of each other, which gave almost 9X speed-up toactual DPLA algorithm which was sequential. As show in Fig.5 we can see that Dist.DPLA gave better result than DPLA atsome points while the results are almost same for both.

Fig. 5. DPLA

D. NSGA2 Vs DPLA

Comparing the results of NSGA2 with DPLA as shown inFig. 6, we can see that NSGA2 gave better results comparedto DPLA. But the running time of NSGA2 is much higher inthis case. When we use Dist. NSGA2, its running time wasmuch less than DPLA, while Dist. DPLA was much fasterthan all others.Based on the above observations and applying the concept of


978-1-4673-2743-5/12/$31.00 ©2012 IEEE 1570

Fig. 6. NSGA2 Vs DPLA

Batch mode learning, the following set of RLBM algorithmshave been tested.

E. RLBM

As we can see in the Fig. 7 , based on points which liesclose to optimal, of all the three types, Type I gave better resultthan all others, while the result of Type III is in between TypeI and Type II.

Fig. 7. RLBM

F. DPLA vs RLBM

Here we compared the test results for DPLA, Dist RLBMType I and Dist. RLBM Type II, as shown in the Fig. 8 Aswe can see that most results of RLBM Types are better thanDPLA. As shown in Fig. 9 and Fig. 10 when we compare thenumber of iterations and time of convergence, we find thatRLBM Type II was the fastest of all while RLBM Type I wasslowest.

G. Comparing All

Overall, all the algorithms gave comparatively similar re-sults as shown in Fig. 11. Some are better in convergencetime and some are better in accuracy.

Fig. 8. DPLA vs RLBM

Fig. 9. No. Of Iterations

VI. CONCLUSION

In this paper, we successfully tested the performance ofBatch Mode learning technique in Decentralized DistributedPursuit Learning Automata in multiple different ways.

VII. FUTURE WORK

Also, we can try the following different ways of combiningGAs and LAs.

A. Different ways of combining GA and LA

1) LA calling GA: LA will be the main algorithm and GAwill be called inside the LA.

Fig. 10. Time to Converge


978-1-4673-2743-5/12/$31.00 ©2012 IEEE 1571

Fig. 11. Compare All

2) LA helping GA: LA will assist GA in finding bestsolutions.

3) LA itself is additional fitness function in GA: LA can bemade as an additional fitness function.

4) LA and GA: Both are run in parallel and result will bebased on the best one.

5) LA vs GA: LA and GA are called alternatively to sameset of Individual designs.

B. Batch Mode active learning can be performed to real timeoptimization problem like interactive algorithms.

We can continue the implementation of Batch Mode learn-ing techniques in other different ways. Its multi objectiveversion can be applied on more than two objectives. Based onavailability of computational resources the slicing of weightscan be increased to get more accuracy. We can run thesestochastic algorithms on powerful machines like supercom-puters or GPUs to attain more precision at very high speedand accuracy.

ACKNOWLEDGMENT

This project was funded by National Science FoundationGrant No. 1014693.

REFERENCES

[1] Tilak, O.; Babbar-Sebens, M.; Mukhopadhyay, S.; , ”Decentralized andpartially decentralized reinforcement learning for designing a distributedwetland system in watersheds,” Systems, Man, and Cybernetics (SMC),2011 IEEE International Conference on , vol., no., pp.271-276, 9-12 Oct.2011

[2] Thathachar, M.A.L.; Arvind, M.T.; , ”Parallel algorithms for modules oflearning automata,” Systems, Man, and Cybernetics, Part B: Cybernetics,IEEE Transactions on , vol.28, no.1, pp.24-33, Feb 1998

[3] K. S. Narendra and M. A. L. Thathachar, Learning Automata: AnIntroduction. Prentice-Hall, 1989.

[4] C. M. Fonseca and P. J. Fleming, Genetic algorithms for multiobjectiveoptimization: Formulation, discussion and generalization. Proceedings ofthe Fifth International Conference on Genetic Algorithms, p. 416423,1993.

[5] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, A fast and elitist multi-objective genetic Algorithm: Nsga-ii, IEEE Trans. Evol. Comput., vol. 6,pp. 182197, 2002.

[6] M. Babbar-Sebens and B. Minsker, Case-based micro interactive geneticalgorithm (cbmiga) for interactive learning: Methodology and applicationto groundwater monitoring design, Environmental Modelling and Soft-ware, vol. 25, pp. 11761187, 2010.

[7] M. Babbar-Sebens and S. Mukhopadhyay, Reinforcement learning forhuman-machine collaborative optimization: Application in ground watermonitoring, Proceedings of the IEEE Systems, Man, and Cybernetics(SMC) Conference, p. 3563 3568, 2009.

[8] M. Babbar-Sebens and B. Minsker, Standard interactive genetic algorithm(siga): A comprehensive optimization framework for long-term groundwater monitoring design, J. of Water Resources Planning and Manage-ment, pp. 538547, 2008.

[9] O. Tilak, R. Martin, and S. Mukhopadhyay, A decentralized indirectmethod for learning automata games, IEEE Systems, Man., and Cyber-netics B, vol. Accepted and In Print., 2011.

[10] O. J. Tilak and S. Mukhopadhyay, Decentralized and partially decen-tralized reinforcement learning for distributed combinatorial optimizationproblems, Ninth International Conference on Machine Learning andApplications (ICMLA), pp. 389 394, 2010.

[11] Heskes, T.; Wiegerinck, W.; , ”A theoretical comparison of batch-mode, on-line, cyclic, and almost-cyclic learning,” Neural Networks,IEEE Transactions on , vol.7, no.4, pp.919-925, July 1996

[12] Damien Ernst, Pierre Geurts, and Louis Wehenkel. 2005. Tree-BasedBatch Mode Reinforcement Learning. J. Mach. Learn. Res. 6 (December2005), 503-556.

[13] Steven C. H. Hoi , Rong Jin , Michael R. Lyu, Large-scale textcategorization by batch mode active learning, Proceedings of the 15thinternational conference on World Wide Web, May 23-26, 2006, Edin-burgh, Scotland

[14] Hoi, S.C.H.; Rong Jin; Lyu, M.R.; , ”Batch Mode Active Learning withApplications to Text Categorization and Image Retrieval,” Knowledge andData Engineering, IEEE Transactions on , vol.21, no.9, pp.1233-1248,Sept. 2009

[15] S.C.H. Hoi, R. Jin, J. Zhu and M.R. Lyu, Batch Mode Active Learningand Its Application to Medical Image Classification, Proc. 23rd Int’l Conf.Machine Learning, pp. 417-424, 2006.

[16] Y. Guo and D. Schuurmans, Discriminative Batch Mode Active Learn-ing, Proc. Conf. Advances in Neural Information Processing Systems(NIPS ’07), 2007.

[17] Hoi, S.C.H.; Rong Jin; Jianke Zhu; Lyu, M.R.; , ”Semi-supervised SVMbatch mode active learning for image retrieval,” Computer Vision andPattern Recognition, 2008. CVPR 2008. IEEE Conference on , vol., no.,pp.1-7, 23-28 June 2008

[18] Durillo, J.J.; Nebro, A.J.; Luna, F.; Alba, E.; , ”A study of master-slaveapproaches to parallelize NSGA-II,” Parallel and Distributed Processing,2008. IPDPS 2008. IEEE International Symposium on , vol., no., pp.1-8,14-18 April 2008

[19] O. Tilak, S. Mukhopadhyay, M. Tuceryan, and R. Raje, A novelreinforcement learning framework for sensor subset selection, in Proc.IEEE ICNSC, Chicago, IL, 2010, pp. 95100.

[20] J. G. Arnold, R. Srinivasan, R. S. Muttiah, and J. R. Williams, Largearea hydrologic modeling and assessment. part i: Model development, J.Am. Water Resour. Assoc., vol. 34(1), p. 73 89, 1998.

[21] S. L. Neitsch, J. G. Arnold, J. R. Kiniry, and J. R. Williams, Soiland water assessment tool - theoretical documentation - version 2005,Grassland, Soil and Water Research Laboratory, Agricultural ResearchService and Blackland Research Center, Texas Agricultural ExperimentStation, Temple, TX., 2005.

[22] Howell, M.N.; Gordon, T.J.; Brandao, F.V.; , ”Genetic learning automatafor function optimization,” Systems, Man, and Cybernetics, Part B:Cybernetics, IEEE Transactions on , vol.32, no.6, pp. 804- 815, Dec 2002

[23] M. Munetomi, Y. Takai, and Y. Sato, StGA: An application of geneticalgorithm to stochastic learning automata, Syst. Comput. Jpn., vol. 27, p.6878, Sept. 1996.

[24] This material is based upon work supported in part by the NationalScience Foundation under Grant No. 0910812, the FutureGrid project.

[25] Feng-Tse Lin; Cheng-Yan Kao; Ching-Chi Hsu; , ”Applying the geneticapproach to simulated annealing in solving some NP-hard problems,”Systems, Man and Cybernetics, IEEE Transactions on , vol.23, no.6,pp.1752-1767, Nov/Dec 1993


978-1-4673-2743-5/12/$31.00 ©2012 IEEE 1572

Documents

[IEEE 2012 Joint 6th Intl. Conference on Soft Computing and Intelligent Systems (SCIS) and 13th Intl. Symposium on Advanced Intelligent Systems (ISIS) - Kobe, Japan (2012.11.20-2012.11.24)]