Swarm Intelligence Approaches for Grid Load Balancingsiludwig/Publish/papers/JGC2011.pdf · Swarm Intelligence Approaches for Grid Load Balancing ... and scheduling are key Grid services,

J Grid Computing (2011) 9:279–301DOI 10.1007/s10723-011-9180-5

Swarm Intelligence Approaches for Grid Load Balancing

Simone A. Ludwig · Azin Moallem

Received: 3 August 2010 / Accepted: 2 February 2011 / Published online: 22 February 2011© Springer Science+Business Media B.V. 2011

Abstract With the rapid growth of data andcomputational needs, distributed systems andcomputational Grids are gaining more and moreattention. The huge amount of computations aGrid can fulfill in a specific amount of time can-not be performed by the best supercomputers.However, Grid performance can still be improvedby making sure all the resources available in theGrid are utilized optimally using a good loadbalancing algorithm. This research proposes twonew distributed swarm intelligence inspired loadbalancing algorithms. One algorithm is based onant colony optimization and the other algorithmis based on particle swarm optimization. A sim-ulation of the proposed approaches using a Gridsimulation toolkit (GridSim) is conducted. Theperformance of the algorithms are evaluated us-ing performance criteria such as makespan andload balancing level. A comparison of our pro-posed approaches with a classical approach calledState Broadcast Algorithm and two random ap-proaches is provided. Experimental results showthe proposed algorithms perform very well in aGrid environment. Especially the application ofparticle swarm optimization, can yield better per-

S. A. Ludwig (B) · A. MoallemDepartment of Computer Science, North DakotaState University, Fargo, ND, USAe-mail: [email protected]

formance results in many scenarios than the antcolony approach.

Keywords Ant colony optimization ·Particle swarm optimization

1 Introduction

The computational speed of individual computershas increased by about one million times in thepast fifty years. However, they are still not fastenough for more and ever more scientific prob-lems. For example, in a few physics applications,data is produced in large quantities. The analysisof this data would need much more computationalpower than presently available when run on su-percomputers. Therefore, in the mid 1990s IanFoster and Carl Kesselman proposed a distrib-uted computing infrastructure for advanced sci-ence and engineering, which they called the Grid.The vision behind the Grid is to supply computingand data resources over the Internet seamlessly,transparently and dynamically when needed, suchas the power Grid supplies electricity to end users.

The resource management system is the centralcomponent of a Grid system. Its basic responsi-bilities are to accept requests from users, matchuser requests to available resources for whichthe user has permission to use and schedule thematched resources [12]. To be able to fully benefit

280 S.A. Ludwig, A. Moallem

from such Grid systems, resource managementand scheduling are key Grid services, where issuesof task allocation and load balancing representa common challenge for most Grids [27]. In acomputational Grid, at a given time, the task is toallocate the user defined jobs efficiently both bymeeting the deadlines and making use of all theavailable resources [10].

Grid systems are classified into two categories:compute and data Grids. In compute Grids themain resource that is being managed by the re-source management system is compute cycles (i.e.processors), while in data Grids the focus is tomanage data distributed over geographical loca-tions. The architecture and the services providedby the resource management system are affectedby the type of Grid system it is deployed in. Re-sources which are to be managed could be hard-ware (computation cycle, network bandwidth anddata stores) or software resources (applications)[12].

In traditional computing systems, resourcemanagement is a well-studied problem. Resourcemanagers such as batch schedulers, workflowengines, and operating systems exist for manycomputing environments. These resource man-agement systems are designed to work under theassumption that they have complete control of aresource and thus can implement the mechanismsand policies needed for the effective use of thatresource. Unfortunately, this assumption does notapply to the Grid. When dealing with the Gridwe must develop methods for managing Grid re-sources across separately administered domains,with the resource heterogeneity, loss of absolutecontrol, and inevitable differences in policy that isthe result of heterogeneity. The underlying Gridresource set is typically heterogeneous [9].

The term “load balancing” refers to the tech-nique that tries to distribute work load betweenseveral computers, network links, CPUs, harddrives, or other resources, in order to get opti-mal resource utilization, throughput, or response.The load balancing mechanism aims to equallyspread the load on each computing node, maxi-mizing their utilization and minimizing the totaltask execution time. In order to achieve thesegoals, the load balancing mechanism should be“fair” in distributing the load across the com-

puting nodes; by being “fair” we mean that thedifference between the “heaviest-loaded” nodeand the “lightest-loaded” node should be mini-mized [20].

Load balancing has always been an issue sincethe emergence of distributed systems. In a dis-tributed system there might be scenarios in whicha task waits for a service at the queue of oneresource, while at the same time another resourcewhich is capable of serving the task is idle. Thepurpose of a load balancing algorithm is to pre-vent these scenarios as much as possible [16].

For parallel applications, load balancing at-tempts to distribute the computational load acrossmultiple processors or machines as evenly aspossible with the objective to improve per-formance. Generally, a load balancing schemeconsists of three phases: information collection,decision making and data migration. During theinformation collection phase, the load balancergathers the information of the distribution ofworkload and the state of computing environmentand detects whether there is a load imbalance. Thedecision making phase focuses on calculating anoptimal data distribution, while the data migrationphase transfers the excess amount of workloadfrom one overloaded processor to another under-loaded processor [15].

Load balancing algorithms can be classifiedinto sub categories from various perspectives.They can be divided into static, dynamic or adap-tive algorithms. In static algorithms, the decisionsrelated to balancing the load are made at compiletime. This means, these decisions are made whenresource requirements are estimated [30]. On theother hand, a load balancer with dynamic load bal-ancing allocates/re-allocates resources at runtimeand uses the system-state information to make itsdecisions. Adaptive load balancing algorithms area special class of dynamic algorithms. They adapttheir activities by dynamically changing their para-meters, or even their policies, to suit the changingsystem state [24]. Derbal [32] proposes a scalabledistributed Entropy-based scheduling approachthat utilizes a Markov chain model to capturethe dynamics of the service capacity state of Gridservices. Another example is proposed in [34] withthe introduction of the Generational Schedulingwith Task Replication algorithm. This algorithm

Swarm Intelligence Approaches for Grid Load Balancing 281

adapts to changes in performance by reschedulingtasks.

Furthermore, methods used in load balancingcan be divided into three classes, i.e., centralized,distributed (decentralized) and hierarchical [12].In a centralized approach, all jobs are submittedto a single scheduler. This single scheduler is re-sponsible for scheduling the jobs on the availableresources. Since all the scheduling informationis available at once, the scheduling decisions areoptimal but this approach is not very scalable in aGrid system [12]. As the size of the Grid increases,keeping all the information about the state ofall the resources is a bottleneck. Therefore, scal-ability is an issue in centralized approaches, inaddition to the single point of failure problem.

In a decentralized model there is no centralscheduler and scheduling is done by the resourcerequesters and owners independently. This ap-proach is scalable, distributed in nature, andsuits Grid systems well. But individual schedulersshould cooperate with each other in schedulingdecisions and the schedule generated may notbe the optimal schedule. This category of loadbalancing is perfect for peer-to-peer architecturesand dynamic environments. Based on whetheror not schedulers cooperate with each other, de-centralized approaches can be further classifiedas cooperative or non-cooperative [12]. A generaland extensible scheduling architecture that ad-dresses isssues such as resource utilitzation, re-sponse time, global and local allocation policies,and scalability are addressed in [33].

In a hierarchical model, the schedulers are or-ganized in a hierarchy. High level resource en-tities are scheduled at higher levels and lowerlevel smaller sub-entities are scheduled at lowerlevels of the scheduler hierarchy. This model is acombination of the above two models [12].

Each of these classes has its advantages anddisadvantages according to a number of factors,e.g., the size of a system, dynamic behavior, etc.[31]. However, all centralized approaches have thefollowing common disadvantages:

1. A central scheduler (load balancer) needs cur-rent knowledge about the entire state of thesystem at each point in time. This makes it

scale poorly with the growth in the size ofthe system.

2. Failure of the scheduler results in failure ofthe whole system, while in a distributed ap-proach only some of the work is lost.

3. Distributed schedulers are much more dy-namic and flexible to changes than centralizedapproaches, because they do not need thestate of the system at each step in order toperform their job.

There has been a great effort in recent years indeveloping distributed load balancing algorithms,while trying to minimize all the communicationneeds resulting from the distributed nature. In thisresearch, we have focused on designing distrib-uted load balancing algorithms with the inspira-tion taken from swarm intelligence.

Swarm intelligence approaches are increasinglybeing used to solve optimization problems. Theyhave proven themselves to be good candidatesin these areas. The notion of complex collectivebehavior emerging from the behavior of many rel-atively simple units, and the interactions betweenthem, is fundamental to the field of swarm intel-ligence. The understanding of such systems offersnew ideas in creating artificial systems which arecontrolled by such emergent collective behav-ior; in particular, the exploitation of this conceptmight lead to completely new approaches for themanagement of distributed systems, such as loadbalancing in Grids [23].

As swarm intelligence techniques have provedto be useful in optimization problems they aregood candidates for load balancing, where the aimis to minimize the load difference between theheaviest and lightest node. The benefit of thesetechniques stems from their capability in search-ing large search spaces very efficiently, whicharise in many combinatorial optimization prob-lems [26]. Load balancing is known to be NP-complete when aiming to solve the problem usinga single processor, therefore the use of heuristicsis definitely necessary in order to cope in practicewith this difficulty [10].

This research proposes, implements and com-pares two new approaches for distributed loadbalancing inspired by Ant-Colony and Parti-cle Swarm Optimization [17]. There are several


objectives a good load balancer should addresssuch as fairness, robustness and distribution; a de-tailed description of each is provided in Section 3.These requirements are addressed with the designof our algorithms. In the Ant-Colony approacheach job submitted to the Grid invokes an antand the ant searches through the network to findthe best node to deliver the job to. Ants leaveinformation related to the nodes they have seen aspheromone in each node which helps other ants tofind lighter resources more easily. In the particleswarm approach, each node in the network isconsidered to be a particle and tries to optimizeits load locally by sending or receiving jobs to andfrom its neighbors. This process being done locallyfor each node, results in a move toward the globaloptima in the overall network.

The remainder of this paper is organized as fol-lows: Section 2 is dedicated to related work. Loadbalancing algorithms that are decentralized aresummarized. The requirements for the design ofthe distributed load balancing algorithms and thebenefits are discussed in Section 3. It is followedby the proposed approaches which are describedin detail. Section 4 focuses on the setup of the sim-ulation and the experimental results. Performancecriteria and environmental settings are introducedin this section, and a thorough comparison of theperformance of the algorithms with other classicalapproaches is provided. Finally, Sections 5 and 6are dedicated to conclusion and future work.

2 Related Work

This section will provide an account of relatedwork only of decentralized approaches. Researchin the area of distributed load balancing is diverse,and many researchers have used Ant colony forrouting and load balancing.

2.1 Classical Approaches

There are several classical approaches in the areaof load balancing which have been around sincethe emergence of networks. In sender-initiatedalgorithms, load distributing activity is initiatedby an overloaded node (sender) trying to senda task to an underloaded node (receiver) [24].

In receiver-initiated algorithms, load distributingactivity is initiated from an underloaded node(receiver), which tries to get a task from an over-loaded node (sender) [24]. A stable symmetricallyinitiated adaptive algorithm uses the informationgathered during polling (instead of discarding it,as the previous algorithms do) to classify thenodes in the system as sender/overloaded, re-ceiver/underloaded, or OK (nodes having man-ageable load). The information about the state ofthe nodes is maintained at each node by a datastructure composed of a senders list, a receiverslist, and an OK list. These lists are maintainedusing an efficient scheme and list-manipulativeactions, such as moving a node from one list toanother, or determining to which list a node be-longs. These actions impose a small and constantoverhead, irrespective of the number of nodes inthe system. Consequently, this algorithm scaleswell to large distributed systems [24].

The Random approach is a simple schedulingalgorithm in which the jobs being sent to theGrid are assigned randomly to different resources.Although, obviously this approach does not makea very good load balancing algorithm but it hassome benefits. It poses no decision making over-head on the system and it gives a good benchmarkin order to compare and see how our proposedalgorithms improve the performance of load bal-ancing compared to a single random assignment.

The other approach we use to evaluate the per-formance of our proposed algorithms is the StateBroadcast Algorithm (SBA). This algorithm iscommon in networking, and is based on broadcastmessages which are exchanged between resources.Whenever the state of a node changes, due to thearrival or departure of a task, the node broadcastsa status message that describes its new state. Thisinformation policy enables each node to hold itsown updated copy of the System State Vector(SSV) and guarantees that all the copies are iden-tical. When a job is sent to a resource at the time ofscheduling, the resource searches through its ownstate vector to find the best resource available todeliver the job to at that particular time. SBA isa good benchmark to evaluate the performance ofour algorithms as it resembles central approachesin which the status of the whole Grid is known atthe time of scheduling, although being technically


a distributed approach. SBA performs like centralapproaches, which by nature always outperformdistributed ones [31], however, it has its disadvan-tages as mentioned earlier.

2.2 Ant Colony Optimization Approaches

Ant colony optimization has been widely used inboth routing and load balancing [14]. Ant ColonyOptimization (ACO) is considered a subset ofsocial insect system approaches. The main ideaunderlying this approach is the indirect communi-cation ability of ants depositing pheromone trails,which are then used by other ants.

One of the research work which is very similarto the ant colony algorithm we propose in thispaper is the Messor system [18]. Montresor et al.have used an ant colony approach to develop aframework called Anthill, which provides an en-vironment for designing and implementing peer-to-peer systems. They have developed Messor asa distributed load balancing application based onAnthill and have performed simulations to showhow well Messor works. In the algorithm, theauthors propose ants that can be in one of the twostates: Search-Max or Search-Min. In the Search-Max state the ants try to find an overloaded nodein the network and in the Search-Min state theysearch for underloaded nodes. Finally, the antsswitch jobs between overloaded and underloadednodes and hence achieve the balancing of theload. However, the authors have not addressedthe problem of topology changes in the networkand do not provide evidence to show how goodtheir approach is in comparison to other distrib-uted load balancing approaches.

In [4], a very similar approach to Messor is pro-vided. In this work agent-based self-organizationis proposed to perform complementary load bal-ancing for batch jobs with no explicit executiondeadlines. In particular, an ant-like self-organizingmechanism is introduced and is shown to be ableto yield good results in achieving overall Gridload balancing through a collection of very simplelocal interactions. Ant-like agents move throughthe network to find the most overloaded and un-derloaded nodes, but the difference to previousresearch is they only search 2m + 1 steps beforemaking a decision and then they balance the

load. Different performance optimization strate-gies are carried out, however, they do not comparetheir results with other distributed load balancingstrategies.

Salehi and Deldari [19], have done simi-lar research to [18] and [4] with some smallmodifications. They present an ecosystem of in-telligent, autonomous and cooperative ants. Theants in this environment can reproduce offspringswhen they realize that the system is unbalanced.They may also commit suicide when the equilib-rium in the environment is reached. The ants wan-der m steps instead of 2m + 1 and they balancek overloaded nodes and k underloaded nodes in-stead of one at a time. A new concept called Antlevel load balancing is presented for improving theperformance of the mechanism. When the antsmeet each other at the same node they exchangethe information they carry with them, and con-tinue on their way.

Sim et al. [14, 25], present a Multiple AntColony Optimization (MACO) for load balancingcircuit-switched networks. In MACO, more thanone colony of ants are used to search for optimalpaths and each colony of ants deposits a differenttype of pheromone represented by a differentcolour. MACO optimizes the performance of acongested network by routing calls via severalalternative paths to prevent possible congestionalong an optimal path.

Another related and similar research to theant colony approach we propose in this paper isdone by Al-Dahoud et al. [2]. In their research,each node sends a coloured colony through thenetwork; this approach helps in preventing antsof the same nest from following the same route,and hence, enforcing them to be distributed allover the nodes in the network. However, theauthors’ experimental results are confined to asmall number of nodes and all the jobs have thesame properties.

Martin Heusee et al. [11], have used multi-agent systems which have some similarity to antsto solve the problem of routing and load balancingin dynamic communication networks. They haveproposed two kinds of routing agents dependingon when the distance vector update occurs. Theupdate can be performed while agents are findingtheir way to their destination (forward routing)


or when they backtrack their way back to theirsource (backward routing).

Other similar research which benefits from theAnt colony’s approach mostly focus on load bal-ancing in routing problems [22] and [14]. In [14],the research provides a survey of four differentrouting algorithms: ABC, Ant-Net, ASGA. ABCis an Ant-Based Control system. A network witha typical distribution of calls between nodes issimulated, and nodes with an excess of trafficcan become congested and can cause calls tobe lost. Using the ant colony concept, the antsmove randomly between nodes, selecting a pathat each intermediate node based on the distribu-tion of simulated pheromones at each node. Asthey move, they deposit simulated pheromonesas a function of their distance from their sourcenode, and the congestion encountered on theirway [22]. In AntNet, they have applied ideas ofthe ant colony paradigm to solve the routing prob-lem in datagram networks. Ants collect informa-tion about the congestion status of the followedpaths and leave this information locally in thenodes. On the way back from the destination tothe source, the local visiting table of each visitednodes are modified accordingly [5]. ASGA inte-grates ant colony systems with genetic algorithms.Each agent in the ASGA system encodes twoparameters - the sensitivity to link and sensitiv-ity to pheromone parameters. Each agent in thepopulation has to solve the problem using an antsystem and each agent has a fitness according tothe solution found [29].

2.3 Summary

Existing decentralized approaches, which aremostly based on Ant colonies, are not accom-panied with various performance measures tostate how they perform in different scenarios andsituations. Still, decentralized approaches in theGrid infrastructure are fewer in number thanapproaches designed for networks and peer-to-peer systems. In this research, we introduce twonew load balancing algorithms, one based on Antcolony optimization and the other based on par-ticle swarm optimization. The Ant Colony ap-proach is similar to some approaches we reviewedin this section, while the particle swarm approach

is a completely new design. We will investigate,amongst other measures, in particular their per-formance in different scenarios to gain a goodunderstanding of their responsiveness.

3 Approaches

3.1 Requirements

In designing each load balancing algorithm severalimportant characteristics should be kept in mind.A list of these requirements is provided here:

– Optimum resource utilization. A load balanc-ing algorithm should optimize the utilizationof resources by optimizing time or cost relatedto these resources. Since the Grid environ-ment provides a dynamic search space, thisoptimality is inevitably a partial optimality ofthe performance.

– Fairness. A load balancing algorithm is said tobe fair, meaning that the difference betweenthe heaviest loaded node and lightest loadednode in the network is minimized, keeping inmind that the search space is dynamic. Theload is defined by the number of jobs assignedto each resource relative to its computationalpower.

– Flexibility. It means that as the topology of thenetwork or the Grid changes, the algorithmshould be flexible enough to adhere to thechanges in the network.

– Robustness. Robustness refers to the fact thatwhen failures in the system occur the algo-rithm should have a way to deal with thefailure and be able to cope with the situation,i.e. not to break down because of a failure. Onthe contrary, the algorithm should be able todeal with the problem.

– Distribution. Distribution for managing re-sources and running the load balancing algo-rithm has the benefit of leaving out the singlepoint of failure which centralized approachesare affected by.

– Simplicity. By simplicity we try to point outboth the size of single software units whichare being transferred among resources in theGrid, and also the overhead that these units


bring to resources in order to make load bal-ancing decisions. The size of software units areimportant as they take up bandwidth whenthey want to transfer themselves between re-sources. Since their units are being executedin Grid nodes, there is a preference to keepnecessary computations as simple as possible.

3.2 Proposed Approaches

In this research, we are suggesting a new approachfor applying Ant colony optimization to the prob-lem of load balancing. In the previous approaches,ants act independently from jobs being submittedwhile in our approach there is a close bindingbetween jobs and load balancing ants. On theother hand, particle swarm has not been used fordistributed load balancing in the Grid before andwe are proposing a new way to design our loadbalancing algorithms.

3.2.1 Ant Colony Load Balancing: AntZ

In this section, a new load balancing algorithmwhich is developed based on the concepts of antcolony optimization is described. This algorithm(AntZ) is developed by merging the idea of howants cluster objects with their ability to leavetrails on their paths so that it can be a guide forother ants passing their way. We are using theinspiration of how ants are able to cluster objectsusing an inverse version to spread the jobs in theGrid. In particular, we are making use of the mainalgorithm proposed by Montresor et. al [18], aswell as the adoption of the number of steps be-fore decision making as proposed by Cao [27]. Inaddition, we add the decay rate from the classicalant approach, as well as we introduce a mutationrate which is specifically added to the problem ofload balancing.

A pseudo-code of the AntZ approach is pro-vided in Algorithm 3.1. AntZ is a distributedalgorithm and each ant can be considered as anagent working independently. The pseudo-codeaddresses the main functions that an ant performsduring its life cycle. Collectively, all the ants showthe desired behaviour by following these steps.

As shown in the pseudo-code, when a job issubmitted to a local node in the Grid an ant is

initialized and starts working. In each iteration,the ant collects the load information of the node itis visiting (getNodeLoadInformation()) and addsit to its history. The ant also updates the loadinformation table of the visited nodes (localLoad-Table.update()). This load information table of anode contains information of its own load, butalso load information of other nodes, which wereadded to the table when ants visited the node.

When moving to the next node the ant has twochoices. One choice is to move to a random nodewith a probability of mutation rate (mutRate). Theother choice is to use the load table informationin the node to choose where to go. The mutationrate decreases with a DecayRate factor as timepasses, thus, the ant will be more dependent toload information than to random choice. This iter-ative process is repeated until the finishing criteriais met which is a predefined number of steps.Finally, the ant delivers its job to the node andfinishes its task.

When an ant visits a node, it updates the node’sload information table with the information ofother nodes, but at the same time collects theinformation already provided by the table of thatnode, if information exists. The load informationtable acts as a pheromone trail an ant leaves whileit is moving, in order to guide other ants to choosebetter paths rather than wandering randomly inthe network. Entries of each local table are thenodes that ants have visited on their way to delivertheir jobs together with their load information.

Reading the information in the load table ineach node and choosing a direction, which isrepresented as the chooseNextStep() procedure inAlgorithm 3.1, the ant uses a simple policy. Itchooses the lightest loaded node in the table. Thecorresponding pseudo-code is provided in Algo-rithm 3.2. As shown in the algorithm, each entryof the load information table is being evaluatedand compared to whether the current load ofthe visited node is smaller than any other nodeprovided in the load information table. The antthen chooses the node with the smallest load,and in case of a tie, the ant chooses one with anequal probability.

Since the number of jobs submitted to thenetwork increases, the ants can take up a hugeamount of bandwidth of the network, thus moving


ants should be as simple and small-sized as possi-ble. To account for this, instead of carrying the jobwhile the ant is searching for a “light” node, it cansimply carry the source node information to whichthe job was delivered to, including a unique job idof the source node. Thus, whenever an ant reachesits destination the job can be downloaded from thesource as necessary.

The algorithm has some parameters which canbe set according to the specific scheduling require-ments (i.e. size of the network, job specifications,etc.). The effect of these parameters and theirvalues on the performance of the algorithms areinvestigated. One of the parameters is MaxStepswhich defines how many steps an ant should bemoving around until it delivers the assigned jobto a node in the Grid. If the ant wanders toomany steps before delivering the job, it causesan increase in the execution time of each job,and hence, it decreases the performance. On theother hand, if the ant gives up too quickly with-out moving around then the pheromone (loadtable information) which it leaves behind de-creases, and this in turn decreases the perfor-mance of the algorithm. In addition, the ant mightnot have enough time to encounter a good andlight node. Thus, all these parameters should bechosen carefully.

Another effective parameter which influencesthe performance of the AntZ algorithm is Mu-tRate. As the ants are moving and they are us-ing the load table information to decide whichway to go, they sometimes randomly choose anarbitrary node in the Grid to move toward to.The probability of choosing their way randomlyis controlled by MutRate. MutRate decreases withthe decay rate, DecayRate, while the ant is aliveand is searching. This parameter, DecayRate, canalso have an effect on the performance of theAntZ algorithm.

3.2.2 Particle Swarm Optimization: ParticleZ

Particle swarm optimization has roots in twomethodologies. Obvious is its relation to swarmintelligence in general, and to bird flocking, fishschooling, and swarming theory in particular. Itis also related to evolutionary computation, andhas ties to both genetic algorithms (GA) andevolutionary programming. The system is initial-ized with a population of random solutions andsearches for the optimum solution by updatingitself through generations. However, unlike GA,particle swarm optimization (in its standard form)has no evolutionary operators such as crossoverand mutation. In particle swarm optimization, thepotential solutions, called particles, fly throughthe problem space by following the current opti-mum particles [8]. Relationships, similarities anddifferences between particle swarm optimizationand GA are briefly reviewed in [13].

In a particle swarm optimization system, mul-tiple candidate solutions coexist and collaboratesimultaneously. Each solution candidate, called aparticle, flies in the problem search space (similarto the search process for food of a bird swarm)


looking for the optimal position to land. A par-ticle, as time passes through its quest, adjusts itsposition, according to its own experience, as wellas according to the experience of neighboring par-ticles [6]. There are two main characteristics foreach particle in a particle swarm optimization al-gorithm: its position which defines where the par-ticle lies relative to other solutions in the searchspace; and its velocity, which defines the directionand how fast the particle should move to improveits fitness. As in any evolutionary algorithm, thefitness of a particle is a number representing howclose that particle is to the optimum point com-pared to other particles in the search space.

One of the advantages of the particle swarmoptimization technique over other social behaviorinspired techniques is its implementation simplic-ity. As there are very few parameters to adjustin a particle swarm optimization approach, it issimpler than other evolutionary techniques.

Two factors characterize the particle’s status inthe search space: its position and its velocity. Them-dimensional position for the ith particle in thekth iteration can be denoted as:

xi(k) = (xi1(k), xi2(k), ..., xim(k)).

Similarly, the velocity (i.e., distance change) isalso an m-dimensional vector, for the ith particlein the kth iteration and can be described as:

vi(k) = (vi1(k), vi2(k), ..., vim(k)).

The particle updating mechanism for a particlecan be formulated as in (1) and (2):

vk+1id = w ∗ vk

id + c1 ∗ r1 ∗ [pb − xkid]

+c2 ∗ r2 ∗ [gb − xkid] (1)

xk+1id = xk

id + vk+1id (2)

In which vkid, called the velocity for particle i

in the kth iteration, represents the distance to betravelled by this particle from its current position,xk

id represents the particle position in the kth it-eration, pb represents its best previous position(i.e. its experience), and gb represents the bestposition among all particles in the population. r1

and r2 are two random functions with a range [0,1],

having similar or different distributions. c1 andc2 are positive constant parameters called accel-eration coefficients (which control the maximumstep size of the particle). The inertia weight w, isa user specified parameter that controls, togetherwith c1 and c2, the impact of previous historicalvalues of particle velocities on its current velocity.A larger inertia weight pressures towards globalexploration (searching new area), while a smallerinertia weight pressures towards fine-tuning thecurrent search area. Suitable selection of the in-ertia weight and acceleration coefficients can pro-vide a balance between the global and the localsearch. The random values involved, prevent theoptimization from being caught in a local optima.A detailed analysis on the effect of parameterselection on the convergence of particle swarmoptimization is provided in [28].

Using the idea of particle swarm optimization, anew approach for balancing the load in the Grid isproposed. In the ParticleZ algorithm, all the nodesin the Grid are considered as a flock or group ofswarms and each node in the Grid is a particle inthis flock.

Following the analogy from the particle swarmoptimization perspective, the position of eachnode in the flock can be determined by its load.This definition helps as we search in the loadsearch space and try to minimize the load, thuseach node in this search space takes a posi-tion according to its load. The velocity of eachparticle and its position can be defined by theload difference the node has, compared to itsother neighbor nodes. Since the particles are try-ing to balance the load, they can move towardeach other by the changes they make to theirposition (i.e. load), this change in each parti-cle’s position can be achieved by exchanging jobsbetween them. The larger their difference is, thefaster they will move toward each other, with alarger velocity.

The different phases of the ParticleZ algorithmconsist of a job submission, a queuing, a nodecommunication and a job exchange phase. Takinginto account that all nodes are exchanging theirloads in parallel, and the dynamic nature of theenvironment, the network reaches a local opti-mum quickly. Thus, each node submits some jobsto one of its neighbors, which has the minimum


load among all. If all its neighbors are busierthan the node itself, no job is submitted from thecurrent node.

The pseudo-code describing this scenario canbe seen in Algorithm 3.3. This is the pseudo-codeof each individual particle (resource), which runsthe ParticleZ algorithm. As can be seen, if thereare any jobs in the queue waiting to be executedthe node tries to submit them to a lighter node inits neighborhood, and hence achieves a fair loaddistribution among resources.

In exchanging load from a heavier loaded nodeto a lighter loaded node, attention must be paidnot to burden the lighter node, so that it ex-ceeds the load of the second lightest node amongneighbours. If this happens, the distribution of theload is not fair, but a load imbalance is created.To tackle this problem, we define a thresholdvariable, which defines how much load exchangecan happen between the nodes. It is calculatedby subtracting lightestLoad from secondLightest-Load among neighbours and the load exchangetakes place as long as the velocity is greater thanthe threshold value.

There are some issues related to the parti-cle swarm optimization, which are necessary tobe addressed. In the algorithm we propose, aparticle only moves toward its best local neigh-bour, while in the classical particle swarm opti-mization algorithm particles keep track of theirbest global solutions so far. The reason we have

not included the history of each particle is thatwe are dealing with a dynamic environment inwhich the problem being solved is changing all thetime as users are submitting new jobs randomly;thus, the global best solution that the particlehas seen is most likely not valid anymore in thisdynamic environment.

Equation (3) for updating the velocity ofeach particle, which was introduced in Section3.2.2, takes the following form in our designof ParticleZ:

vk+1id = gb − xk

id (3)

As mentioned earlier, we are dealing with anenvironment which is changing dynamically (i.e.the search space is changing), thus the use ofthe past experience of each particle is not useful;therefore, we assign zero to c1 in order to omitthe effect of the past history of the particle. Alsoagain, because of the dynamicity of the problem,the previous velocity should not effect our deci-sion, therefore, we assign a value of zero to w aswell. On the other hand, we want to use neighbourparticles to identify and decide which one is betterto share the work load with, and therefore, wehave used a value of one for c2.

In (4), the formula for updating a particle’sposition is shown, which is the same as the onewe introduced in Section 3.2.2. As mentioned,the position of a particle (xid) is its load valueand it changes while the resource submits jobs toits neighbours.

xk+1id = xk

id + vk+1id (4)

4 Experimental Setup and Results

4.1 Setup

4.1.1 GridSim Toolkit

The GridSim toolkit used as the simulation envi-ronment is a java-based discrete-event Grid sim-ulation toolkit. The toolkit supports modellingand simulation of heterogeneous Grid resources(time-shared and space-shared), users and appli-cation models. It also provides primitives for the


creation of application tasks, mapping of tasks toresources, and their management [3].

Given the implementation nature of the pro-posed algorithms, a time-shared policy was usedfor the AntZ algorithm, and a space-shared policywas taken for the ParticleZ algorithm.

The GridSim toolkit supports the modellingand simulation of a wide range of heteroge-neous resources, such as single or multiproces-sor, shared and distributed memory machineslike PCs, workstations, SMPs (Symmetric Multi-processing), and clusters with different capabili-ties and configurations. It can also be used for themodelling and simulation of application schedul-ing on various classes of parallel and distributedcomputing systems such as clusters, Grids, andP2P networks.

The following are the reasons why the GridSimtoolkit was chosen to simulate and evaluate ourscheduling algorithms [3]:

– It allows modelling of heterogeneous types ofresources.

– Resource capability can be defined in the formof MIPS (Million Instructions Per Second)and SPEC (Standard Performance EvaluationCorporation) benchmark.

– Application tasks can be heterogeneous andthey can be CPU or I/O intensive.

– There is no limit on the number of applicationjobs that can be submitted to a resource.

– Network speed between resources can bespecified.

– It supports simulation of both static and dy-namic schedulers.

– Statistics of all or selected operations can berecorded. These statistics can then be fur-ther analyzed using GridSim statistics analysismethods.

4.1.2 System Model

For experimental purposes we assume that theGrid consists of a set of resources connectedvia different communication links with differentspeeds. In general, each resource may containmultiple computing nodes (machines), and eachcomputing node (machine) may have a single ormultiple Processing Elements (PEs). The compu-

tational power or the speed of each processor isdefined by the number of Cycles Per Unit Time(CPUT). It is actually the GridSim framework’sability that provides us with the definition of thecomputational power of PEs in CPUT.

Generally, each resource may consist of oneor several machines and each machine by itselfcan have one or multiple processing elements.Processors in each computing node can be hetero-geneous, thus, they may have different processingpower. In our simulations, without loss of gen-erality and to emphasize on the basic ideas ofthe algorithms, we assume each resource consistsof one machine and each machine is equippedwith one or several processors (the variations ofthis random number for experiments are providedlater). The processors in the same or differentcomputing nodes have different processing power.

At any one time, a computing node may havebackground workload associated with it, whichwill affect the completion time of the Grid jobsassigned. The GridSim toolkit provides us withthe ability to define the background workloadaccording to historical and statistical informationfor each node. As such, each resource has a back-ground load associated which is taken from theaverage load that the resource has experienced atsimilar times (such as working days or weekends).

4.1.3 Application Model

For our application model, we assume that taskswhich are submitted to the Grid (or the applica-tion which is being run) consists of a set of inde-pendent tasks with no required order of execution.The tasks are of different computational sizes,meaning each task requires a different computa-tion time and data transmission time for comple-tion. The tasks can also have different input andoutput size requirements.

The length of each task is presented in Mil-lions of Instructions (MI). Tasks can be classifiedinto one of two categories: data-intensive andcomputationally intensive tasks. In this research,we are concerned with computationally intensivetasks as they are more common in today’s realworld applications and the waste of computationalpower of resources is, in general, more costly thantheir memory.


4.1.4 Network Topology

In order to account for different network topolo-gies, a random connection graph for the specificnumber of resources was generated. First, a Min-imum Spanning Tree with all the resources iscreated, then, random links are added to the treeto generate the final topology of the Grid. Thus,we have control on the number of links and thetopology of our Grid in different simulations. Forexample, when a resource tries to find its neigh-bours it sends a message in order to retrieve a listof its connected resources.

4.1.5 Performance Evaluation Criteria

In this section we define our performance eval-uation criteria which are used to evaluate theperformance of our algorithms. The criteria in-clude makespan and the load balancing level. Inaddition, two classical algorithms for comparisonpurposes are discussed and used. For all measure-ments taken, we have used an average of thirtyruns in order to guarantee statistical correctness.

Makespan One of the most common measuresin evaluating the performance of a load bal-ancing algorithm is measuring the makespan.The makespan is the “total application execu-tion time”. The total application execution timeis measured from the time the first job is sentto the Grid, until the last job comes out of theGrid. As we generate Gridlets (term defined inGridSim to represent jobs) and topologies ran-domly, although every simulation yields roughlythe same result, each single simulation is differentfrom another one; thus, we have used an aver-age makespan in order to account for realisticconditions.

Load Balancing Level For each resource in theGrid, the load related to that resource is depen-dent on the number of jobs which are assigned tothe node njobs by the Grid scheduler and the powerof its processing elements pi. Equation (5), showshow the resource load lr is calculated.

lr = njobs∑max

i=1 pi(5)

The total load l can be calculated using (6).According to this equation when the resource loadlr increases, it results in an increase in the load l,and a decrease in lr decreases l. The load l is avalue between 0 and 1, where 0 identifies that aresource is not busy and 1 represents a resourcebeing busy.

l = 1 − 1lr

(6)

One of the aims of a load balancing algorithmis to minimize the variations in workloads onall machines. Regarding this, the standard devi-ation in workload is often taken as the perfor-mance measure of a load balancing algorithm. Thesmaller the standard deviation, the better the loadbalancing scheme is. By looking at the changesin the standard deviation of the workload withrespect to time, it is easier to visualize the effectof load balancing upon the time of the system [7].Equation (7), shows the standard deviation of theload in the system.

d =√

∑ni=1(l − li)2

n(7)

In the equation, l is the average load of thesystem and li is the load of the ith resource.

We define the load balancing level (LBL) b ofthe system to be a measure of how good the loadbalancing algorithm is. The load balancing levelof the system is defined in (8). The most effectiveload balancing is achieved when b equals to 100%which implies that d should be zero or closeto zero.

b = (1 − d) ∗ 100% (8)

Comparison Against Classical Approaches Wehave implemented two common classical ap-proaches (Random and State Broadcast Algo-rithm) in order to evaluate the performance ofour algorithms and discuss their benefits overclassical ones. Both algorithms were describedin Section 2.1.

4.2 Results

In order to evaluate the performance of our algo-rithms, we first investigate the effect of different


Fig. 1 Effect of thechange in wandering stepson AntZ makespan

Average Makespan20000

15000

10000

5000

01 2 3 4 5 6 7 8 9 10

Mak

esp

an (

s)Number of Wandering steps

Fig. 2 Effect of thechange in wandering stepson AntZ communicationnumber

Fig. 3 Effect of decayrate on AntZ makespan

Decay Rate

13000

14000

15000

12000

11000

10000

90000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Mak

esp

an (

s)

Decay rate

Fig. 4 Effect of linknumber on ParticleZmakespan

Makespan

5500

6000

6500

5000

4500

400099 149 199 249 299 349 399 449 499

Mak

esp

an (

s)

Number of links


Fig. 5 Effect of linknumber on ParticleZcommunication number

values for the parameters of each algorithm, andthen we investigate a set of experiments to mea-sure the criteria we introduced in the previoussection. As described earlier, ParticleZ is im-plemented with a space-shared FCFS policy in-side the resources and the AntZ is accompaniedwith a time-shared Round-robin policy to sched-ule the jobs when they are received by a re-source. In all the experiments we have comparedour algorithms with both the Random and theSBA approach.

4.2.1 AntZ Parametric Measurement Ef fects

We investigate algorithm-specific performancemeasures and their effect on the algorithms inthe next set of experiments. First, we investigatethe effect of wandering steps on the performanceof the AntZ algorithm. We have a one hundrednode Grid with one thousand jobs being sent tothe Grid. Figure 1 shows that as we increase thenumber of steps an ant wanders until it deliversthe job to its destination, the makespan of thealgorithm improves, but this increase is larger atthe beginning but later on the rate drops to a greatextent until it becomes stable.

After about 5 or 6 steps the increase in wander-ing steps does not seem to have an effect on theperformance of the algorithm. The reason behindthis phenomenon is that although increasing thenumber of wandering steps seems to have a posi-

Table 1 Grid resource characteristics

Number of machines per resource 1Number of PEs per machine 1–5PE ratings 10 or 50 MIPSBandwidth 1,000 or 5,000 B/S

tive effect on the performance of the algorithm, astables are updated more frequently and ants havemore time to decide which way to go, but on theother hand, it increases the delay before the jobsare being delivered to the resources, and this delayhas a negative effect on the performance.

Figure 2, shows how increasing the numberof wandering steps can effect the communicationoverhead, which is introduced to the system. Thefigure shows that while we increase the wanderingsteps, the communication overhead also increaseslinearly.

In another experiment we measure howdifferent values of the decay rate can effect theperformance of the AntZ algorithm. As you re-member, while the ant is moving we decrease itsmutation rate by a factor; this factor is called thedecay rate. By running this experiment we canfind out what the best decay rate for a set ofspecific attributes of a Grid and its jobs is. Theresults are shown in Fig. 3. For the set of attributeswe have, 0.2 is the best decay rate while the muta-tion rate is set to be 0.5 for this experiment.

4.2.2 ParticleZ Parametric Measurement Ef fects

In the next set of experiments we measure theeffect of different ParticleZ parameter settings on

Table 2 Scheduling parameters and their values

Number of resources 100Number of Gridlets 1,000ParticleZ link number 149AntZ wander number 4AntZ mutation rate 0.5AntZ decay rate 0.2


Table 3 Gridlet characteristics

Length 0–50,000 MIFile size 100 + (10–40%) MBOutput size 250 + (10–50%) MB

the performance of this algorithm. One of theparameters which can effect the performance ofParticleZ is the number of links that connect re-sources together. As each particle (resource) com-municates with its neighbours to find the lightestnode, the number of neighbours can effect theperformance of the algorithm.

Figure 4 shows the effect of increasing the num-ber of links and the connectivity of the resourceson ParticleZ’s makespan. Although it is betterto communicate with more resources before ex-changing jobs, however, it is not always good ascommunication with more resources adds an extratime overhead, which prevents a significant im-provement in the performance of the system.

Figure 5 shows the effect of increasing the num-ber of links on the communication overhead of theParticleZ algorithm. As can be seen in the figure,it has a linear growth with an increasing numberof links.

4.2.3 Makespan and Simulation Time

The characteristics of the resources as Grid re-sources we have used for all of the followingexperiments are shown in Table 1. There is onemachine for each Grid resource and each machinehas a random number of PEs ranging between 1and 5. Each PE has a different processing power.Without loss of generality, we set the local loadfactor for resources to be zero; this does not

effect the performance measure of the algorithms.Setting it to zero helps us analyze the effect andbehaviour of the algorithms better.

For the next set of experiments, we comparethe makespan of the different algorithms. The val-ues for the different parameters of each algorithmare shown in Table 2.

As said earlier, the Gridlets which are sent tothe Grid are supposed to be independent of eachother. The characteristics of the Gridlets sent tothe Grid to compare the makespan of differentalgorithms are shown in Table 3.

Figure 6 shows a comparison between themakespan of the different algorithms with pa-rameter specifications described above. The ex-perimental results show SBA is performing bestamongst all. This is expected as the SBA is keep-ing track of the state of all the resources at eachpoint in time, which enables it to make optimaldecisions at each point in time than all the otherapproaches. After SBA, ParticleZ wins the com-petition by having the second smallest makespan.Comparing ParticleZ and AntZ with each other,ParticleZ performs better than AntZ by a fac-tor of 1.72; also, ParticleZ performs better thanRandom-SpaceShared by a factor of 3.42, andAntZ performs better than Random-TimeSharedby a factor of 1.83.

One important but hidden drawback that SBAis subjected to is related to the overall cpu cyclesand the time it takes for the Grid to execute it.Since there is a copy of the system state vector inall machines, each machine is using some time andcpu cycles to search the state vector individuallyin order to schedule each task. This causes a lotof cpu cycles to be wasted, but as we are runninga parallel platform this disadvantage cannot be

Fig. 6 Comparing themakespan of differentapproaches


Fig. 7 Simulation timerelated to each algorithmin milliseconds

observed. This negative effect is shown in Fig. 7and the time shown in this figure shows only thesimulation time of each algorithm. As the simula-tion is measured on a single machine, the effectof parallelism is discarded and SBA, althoughhaving a very low makespan, actually takes longerto run and it is because all these wasted secondscan not be seen in the previous figure becauseof parallelism. This effect is even worse as thenumber of resources grow in the Grid. Pleasenote that this figure only shows the simulationtime and does not count for different job lengths,etc.

Another drawback related to SBA is the num-ber of communications it takes. Figure 8 showsthe number of extra communications of each algo-rithm to achieve the load balancing. For ParticleZ,each communication message a node sends to itsneighbours to acquire their load status and itsresponse, and each job exchange between tworesources is considered as communication. ForAntZ, each ant taking a step while searching forthe best node to deliver the job to, is consideredas a communication. Finally for SBA, each broad-cast message a resource sends to other resourcesis considered as communication overhead. Thenumbers shown in the figure are the average val-ues of thirty runs with the same parameter settingsas described earlier. As shown in the figure, AntZ

has a higher communication overhead comparedto ParticleZ. Obviously, the other two randomapproaches have no communication overhead atall, thus, they are not shown in the figure. SBA hasthe highest number of communications by a factorof around 1300. This large number of communica-tions can be a bottleneck for the network and inscenarios with congested networks the probabilityof messages being lost increases.

4.2.4 Scalability

In the next experiment we investigate how faireach of the algorithms is. Table 4 shows the loadbalancing level of the system described earlier in(8) along with their standard deviation from sev-eral runs. The closer the value approaches 100%,the better the load balancing of the algorithms is.It means that the load is spread more fairly amongall the resources. According to the experimentalresults both, ParticleZ and SBA, have the bestload balancing levels. AntZ along with the otherrandom approaches ranks third in spreading theload uniformly among resources.

As can be seen in Fig. 9, all the algorithmsshow a linear growth in response to the increasingnumber of jobs. However, SBA along with theproposed approaches show a much slower growthcompared to the random approaches. Among

Fig. 8 Communicationoverhead related to eachalgorithm


Table 4 Average load balancing level of the system fordifferent algorithms

Algorithm Average LBL (%)

ParticleZ-SpaceShared 81±2.09SBA 80±0.47Random-SpaceShared 67±1.29AntZ-TimeShared 65±0.70Random-TimeShared 62±0.95

them, ParticleZ and SBA are quite close to eachother. Table 5 shows each algorithm with its pre-diction trend line for a 100 node Grid. As canbe seen, ParticleZ and SBA have smaller slopesamong all other approaches.

In the next set of experiments we investigatethe effect of increasing the number of jobs onthe performance of the algorithms. Thus, we keepa fixed number of resources and run the experi-ments while we increase the number of jobs beingsent to the Grid. The specifications and parametersettings of the algorithms and the system are listedin Tables 1–3.

In Fig. 10, we investigate the effect of increasingthe length of jobs on the performance of the algo-rithms. Length of the jobs is defined in Millions ofInstructions (MIs) in GridSim. Parameter settingsto run this experiment are the same as described inTables 1–3. We increase the length of the Gridletsby adding 250,000 MIs at each step and investigateits effect on the makespan. The numbers at thebottom of the figure show the execution timefor each algorithm. As can be seen, the growth

Table 5 Predicting execution time based on number ofjobs

Algorithm Prediction trend line (in seconds)

SBA 762.5 * njobs + 808.5ParticleZ-SpaceShared 906.7 * njobs + 1782AntZ-TimeShared 2,478 * njobs + 1,291Random-TimeShared 5,518 * njobs − 291.2Random-SpaceShared 6,069 * njobs − 1,419

is linear for all the approaches and the resultsshow the best performance is achieved by boththe ParticleZ and the SBA algorithm. AntZ ranksthird and the other two random approaches, asexpected, do not respond well to the larger lengthsof Gridlets, but for small Gridlet lengths they canperform comparably to others.

4.2.5 Resource Ef fect and Injection Points

Figure 11 shows how increasing the number ofresources, while keeping the same number ofjobs being sent to the Grid constant, decreasesthe performance of the Grid in terms of an in-crease in execution time. In this experiment, 3,000jobs are sent to the Grid with varying numberof resources, and as can be seen increasing thenumber of resources has a decreasing effect onthe execution time. ParticleZ and SBA are per-forming better when we have a small numberof resources (50) and a large number of jobscompared to the number of resources (3,000).As the number of resources increases the per-

Fig. 9 Effect of the increase in number of jobs on performance of the algorithms


Fig. 10 Effect of the increase in job length on performance of the algorithms

formance, the difference between the algorithmsdecreases.

One of the very interesting performance ques-tions which arises in a distributed algorithm likeAntZ and ParticleZ is: how do the algorithmsrespond if all the jobs are injected from a singlepoint in the Grid? From the AntZ’s perspective itwill take longer to build the load table informa-tion, and from the ParticleZ’s perspective it willhave a negative effect as the jobs will need moretime to be spread evenly. We have investigatedthis effect to see how much it will slow downor have a negative effect on the performance ofthe algorithms.

The random approaches obviously performvery poorly if we send all jobs to one node. Figure12 shows AntZ copes better than ParticleZ in re-sponse to all the jobs being sent to one node in theGrid. The reason lies in the mutation factor whichis incorporated inside AntZ. With the mutation,

an ant moves randomly from time to time whichhelps to build up the load tables more quickly toovercome the negative effect. It can be inferredfrom the figure, that ParticleZ’s performance de-creases by a factor of 2.4 for a one hundrednode network with Gridlets of a length between0 and 50,000. On the other hand, AntZ’s perfor-mance decreases by a factor of 1.36 in the samescenario setting.

4.2.6 Resource Heterogenity

In analyzing the performance of the introducedalgorithms one characteristic that is of impor-tance in real world scenarios is how each algo-rithm responds to different heterogeneity of jobsand resources. Figure 13 shows a comparison ofdifferent makespans for both high and low re-source heterogeneity as well as high and low jobheterogeneity. In this analysis, high resource het-

Fig. 11 Effect ofincreasing number ofresources on executiontime


Fig. 12 Effect of singleand random injectionpoints on theperformance of thealgorithms

erogeneity is simulated by each resource havinga random number of PEs between 6 and 20. Thelow resource heterogeneity is defined by havingthe number of PEs between 1 and 5. High jobheterogeneity is simulated by having the length ofthe Gridlets between 0 and 500,000 MIs, while lowjob heterogeneity refers to the condition wherebyGridlet lengths are between 40,000 to 50,000 MIs.

By analyzing the results in Fig. 13 we can seewhen both resource heterogeneity and job het-erogeneity are low, ParticleZ outperforms all theother approaches. SBA performs very close toParticleZ, while the AntZ approach ranks third.

We can see a similar trend in high job hetero-geneity with low resource heterogeneity. Whenhaving a high resource heterogeneity, SBA out-performs all other approaches with both high andlow job heterogeneity, however, ParticleZ is veryclose to SBA in terms of performance. The otherthree approaches have higher execution times.

4.3 Summary and Discussion

We investigated several algorithm-related para-metric effects for both the AntZ and Parti-cleZ algorithms. We investigated the effect of

different wandering steps on the execution timeand the communication overhead of the AntZalgorithm. The results show as we increase thenumber of wandering steps the performance ofthe AntZ improves, but there is a limit to thisimprovement after which the performance re-mains the same, although the number of wander-ing steps increases. We also studied the effect ofdifferent decay rates on the performance of theAntZ, and identified the best decay rate for oursimulation setting.

For the ParticleZ algorithm, we investigatedthe effect of different link numbers on both theexecution time and the communication overheadof the algorithm. The communication overheadgrows linearly by increasing the number of linkswhile the makespan decreases.

Analyzing the results of the makespan showsthat SBA has the smallest makespan among alland ParticleZ performs better than AntZ in thisregard. Although SBA has the smallest makespanamong all the approaches, comparing its simula-tion time with others reveals that there are manycomputational activities going on in parallel in allmachines to execute SBA, which although it doesnot effect the overall makespan, it increases the

Fig. 13 Effect ofheterogeneity of jobs andresources on themakespan (measured inseconds)


computational complexity for the overall Grid,and therefore, makes SBA the worst approachamong all in terms of computational complexity.

Comparing the number of communicationseach algorithm is concerned with, SBA shows thelargest growth in communication cost comparedto the other two approaches, whereby ParticleZinvolves the smallest number of communicationsamong all.

ParticleZ also wins the competition among allother approaches regarding the “fairness” mea-sure, as it has the highest load balancing levelamongst all other approaches.

Looking at the scalability of the algorithms, allapproaches show a linear growth in response toan increase in the number of jobs. ParticleZ andSBA have the smallest gradient and are very closeto each other, AntZ ranks third.

Regarding an increase in the length of jobs,all approaches show a linear increase; however,ParticleZ along with SBA are best among all. Fur-thermore, an increase in the number of resourcesdecreases the makespan.

Overall, ParticleZ proves to perform slightlybetter than AntZ in many regards. On the otherhand, looking at the results it shows that Parti-cleZ has most of the advantages of SBA with-out having its disadvantages. However, there isone drawback associated with ParticleZ. Whenjobs are sent to the Grid and are submitted toone or a small number of resources and are notspread throughout the Grid, ParticleZ’s perfor-mance decreases more than AntZ’s performance.The reason is the mutation factor incorporatedwithin AntZ, which makes it better to deal withsuch situations.

5 Conclusion

In this research we have investigated the use ofswarm intelligence inspired techniques in design-ing distributed Grid load balancing algorithms.Specifically, we have taken inspiration from socialinsect systems and sociological behaviour of birdsand school of fishes.

The contributions of the designed algorithms(AntZ and ParticleZ) can be categorized as fol-lows: (1) Many centralized load balancing ap-

proaches have been developed and applied tothe Grid, even those with inspiration taken fromswarm intelligence techniques such as genetic al-gorithms and Tabu search, however, the central-ized approaches have many drawbacks as we pre-viously outlined. Research using swarm intelli-gence techniques for distributed load balancinghas only started to be investigated. (2) Althougha variety of ant colony inspired approaches havebeen used for distributed load balancing, thereis no comparison of these approaches with anyother distributed swarm intelligence technique. Inthis research, we compared the performance ofthe ant colony approach with another swarm in-telligence technique, particle swarm optimization.We performed measurements to compare the twoalgorithms in order to identify which are moreeffective and under which conditions. We alsocompared the performance of the algorithms withother classical techniques. (3) Particle swarm op-timization has been used to address the problemof centralized load balancing [1, 6, 21], but it hasnever been used for distributed load balancingin a dynamic environment such as the Grid. (4)Most of the research and experimental results,especially in the area of distributed load balancingand ant colony, have used their own developedinfrastructure to simulate the performance of theirapproaches, thus, the question remains how wellthey would perform in a real world environment.We have used a real world simulation platform,GridSim, which provides us with a reliable toolkitby allowing evaluations to be done under realisticconditions.

This research investigated two different ap-proaches (inspired by ant colony and particleswarm optimization) for developing load bal-ancing algorithms and it shows the benefits ofswarm intelligence techniques in the distributedGrid load balancing domain. Furthermore, weshowed, although particle swarm optimization hasnot been used widely in designing distributed loadbalancing algorithms, it performs quite well andit even outperforms the ant colony approach inmany scenarios. One of the important charac-teristics of the designed algorithms compared tocentral approaches is their responsiveness to scal-ability of the Grid. In centralized approaches, anincrease in the number of resources in the Grid


can always be a problem as the information ofall the resources has to be stored and must berecalled at all time, but our distributed approacheswork quite well with a large number of resourcesand jobs. This drawback was seen by running theSBA simulations with a large number of resourcesand examining the simulation time.

The advantages of our proposed algorithms canbe summarized as follows: 1) Looking at the sim-ulation results, the algorithms show good perfor-mance results and optimized resource utilization.2) The algorithms have proved to be “fair” com-pared to a random and SBA approach. ParticleZhas a load balancing level of 81%, SBA has a loadbalancing level of 80%, AntZ achieves a load bal-ancing level of 65% and the random approacheshave a load balancing level close to 65%. 3) BothParticleZ and AntZ are flexible approaches indealing with the changes that happen in the Grid.4) Both proposed approaches are distributed innature. As the algorithms have taken inspirationfrom sociological systems, being distributed is aninherent part and we used this ability in design-ing the approaches. 5) Both algorithms are verysimple which is a benefit for a distributed system.In the AntZ approach, the ants which have tomove among resources to find the best resourceto deliver the job to, are very small in size andperform small computations in each resource. TheParticleZ algorithm implements simple computa-tions as it only sends small messages and has tochoose the lightest resource amongst all neigh-bour resources. 6) Looking at the scalability of thealgorithms they show linear growth in responseto both an increase in the number of jobs and anincrease in the length of jobs.

In conclusion, we can say classical approachessuch as Random and SBA, although suitablefor small sized networks, are not efficient forlarge Grids.

6 Future Work

The algorithms in their current state do not ad-dress the problem of dynamic resource failure inthe Grid. A mechanism should be in place thatprevents Gridlet loss while any resource in theGrid shuts down. Another issue which is worth

addressing is the special scenario when all re-sources in the Grid are too busy to take newjobs on. The question arises what should be donewith new jobs that are submitted to the Grid. An-other issue worth investigating is that although wehave simulated the algorithms within a simulationframework similar to a real world scenario; it maystill need some small modification. For example,we have not considered issues related to secu-rity in this research. One of the steps which canbe taken toward adding security is limiting antsfrom performing particular actions in differentresources.

One of the important issues in large-scale Gridsand peer-to-peer systems is resource failures andthe robustness of the system. As the size of theGrids is continually increasing, the probability ofresource failures also increases. As such, devel-oping fault tolerant algorithms which are ableto deal with these failures are gaining more andmore attention. Failures which happen in a Gridenvironment can be divided into two categories.

A resource may shutdown manually, thus, it cansend a notice message or perform some additionalsteps before shutting down. Or the resources mayfail suddenly without any notice. Thus, we needto incorporate a mechanism to deal with bothcases of resource failures in our system withouteffecting jobs submitted by users.

The ParticleZ algorithm can deal with failuresmore easily. At the time a resource wants toshare its workload with other resources, it sim-ply sends a message and queries about its avail-able neighbours, therefore, whenever a resourcebreaks down it is automatically eliminated fromthis process. Thus, in case of ParticleZ, a messagesent to the user reporting on the uncompleted jobswould suffice.

However, when a resource fails without furthernotice the situation is more complex. One possiblesolution is the following. When a job is sent to theGrid by a user, the worst case execution time willbe estimated for that job. This predicted time rep-resents the worst case in which the user must havereceived the results of its job submission. An eventis then scheduled for the predicted time. At thisspecific time, the user will check whether the jobresult was returned; if the job result has alreadycome back successfully, no further actions will be


taken; otherwise, the job will be resubmitted andthe whole process repeats.

Another important issue is to pay attentionthat the nodes may not be dedicated nodes in theGrid, and each may have their own backgroundworkload. Thus, the local load of each resourceshould be incorporated in the load calculationequation, which effects the decision making of thealgorithms accordingly.

In this research, we have simulated the pro-posed algorithms with a simulation platform de-veloped for the Grid, and the results proved tobe promising. The next step would be to applythe algorithms in a real world Grid or incorporatethe algorithm in existing Grid toolkits such as theSun Grid Engine or Globus toolkit to confirm thesimulation results.

References

1. Abraham, A., Liu, H., Zhang, W., TChang, G.:Scheduling jobs on computational Grids using fuzzyparticle swarm algorithm. In: Proceedings of 10th Inter-national Conference on Knowledge-Based and Intel-ligent Information and Engineering Systems, pp. 500–507 (2006)

2. Al-Dahoud, A., Belal, M.: Multiple ant colonies forload balancing in distributed systems. IN: Proceedingsof The first International Conference on ICT and Ac-cessibility (2007)

3. Buyya, R., Murshed, M.M.: Gridsim: a toolkit for themodeling and simulation of distributed resource man-agement and scheduling for Grid computing. Concurr.Comput.: Pract. Exp. 14, 1175–1220 (2002)

4. Cao, J.: Self-organizing agents for Grid load balancing.In: Proceedings of the Fifth IEEE/ACM InternationalWorkshop on Grid Computing, pp. 388–395 (2004)

5. Di Caro, G., Dorigo, M.: Antnet: distributed stigmer-getic control for communications networks. J. Artif.Intell. Res. 9, 317–365 (1998)

6. Chen, T., Zhang, B., Hao, X., Dai, Y.: Task schedul-ing in Grid based on particle swarm optimization.In: ISPDC ’06: Proceedings of the Fifth InternationalSymposium on Parallel and Distributed Computing,pp. 238–245. IEEE Computer Society, Washington(2006)

7. Chow, K.P., Kwok, Y.K.: On load balancing for dis-tributed multiagent computing. IEEE Trans. ParallelDistrib. Syst. 13(8), 787–801 (2002)

8. Eberhart, R., Kennedy, J.: A new optimizer using par-ticle swarm theory. In: Proceedings of the Sixth Inter-national Symposium on Micro Machine and HumanScience, MHS ’95., pp. 39–43 (1995)

9. Foster, I., Kesselman, C.: The Grid in a Nutshell. In:Nabrzyski, J., Schopf, J.M. (eds.) Grid Resource Man-agement: State of the Art and Future Trends, pp. 3–13.Kluwer Academic Publisher, Boston (2004)

10. Grosan, C., Abraham, A., Helvik, B.: Multiobjectiveevolutionary algorithms for scheduling jobs on compu-tational Grids. In: Guimares, N., Isaias, P. (eds.) ADISInternational Conference, Applied Computing. Sala-manca, Spain (2007)

11. Heusse, M., Guerin, S., Snyers, D., Kuntz, P.: Anew distributed and adaptive approach to rout-ing and load balancing in dynamic communicationnetworks, 4 pp. Technical Report (1998)

12. Kandagatla, C.: Survey and taxonomy of grid resourcemanagement system. http://www.cs.utexas.edu/users/browne/cs395f2003/projects/KandagatlaReport.pdf(2003)

13. Kennedy, J., Eberhart, R.: Particle swarm optimization.In: Proceedings of IEEE International Conference onNeural Networks, vol. 4, pp. 1942–1948 (1995)

14. Kwang, M.S., Sun, H.W.: Ant colony optimization forrouting and load-balancing: survey and new directions.IEEE Trans. Syst. Man Cybern. Part A 33(5), 560–572(2003)

15. Li, Y., Lan, Z.: A Survey of Load Balancing in GridComputing. Springer, Berlin (2005)

16. Livny, M., Melman, M.: Load balancing in homoge-neous broadcast distributed systems. SIGMETRICSPerform. Eval. Rev. 11(1), 47–55 (1981)

17. Moallem, A., Ludwig, S.A.: Using techniques for dis-tributed Grid job scheduling. In: Proceedings of the24th Annual ACM Symposium on Applied Computing(2009)

18. Montresor, A., Meling, H., Babaoglu, O.: Messor:Load-Balancing through a Swarm of AutonomousAgents. Springer, Berlin

19. Salehi, M.A., Deldari, H.: Grid load balancing using anecho system of intelligent ants. In: PDCN’06: Proceed-ings of the 24th IASTED International Conferenceon Parallel and Distributed Computing and Networks,vol.52, 47 pp. ACTA Press, Anaheim, CA, USA (2006)

20. Salleh, S., Zomaya, A.Y.: Scheduling in Parallel Com-puting Systems: Fuzzy and Annealing Techniques. TheSpringer International Series in Engineering and Com-puter science (1999)

21. Salman, A., Ahmad, I., Al-Madani, S.: Particle swarmoptimization for task assignment problem. Micro-process. Microsyst. 26, 363–371 (2002)

22. Schoonderwoerd, R., Holland, O., Bruten, J.: Ant-likeagents for load balancing in telecommunications net-works. In: AGENTS ’97: Proceedings of the first inter-national conference on Autonomous agents, pp. 209–216. ACM, New York (1997)

23. Schoonderwoerd, R., Holland, O.E., Bruten, J.L.,Rothkrantz, L.J.M.: Ant-based load balancing intelecommunications networks. Adapt. Behav. 2, 169–207 (1996)

24. Shivaratri, N.G., Krueger, Ph., Singhal, M.: Load dis-tributing for locally distributed systems. Computer25(12), 33–44 (1992)

http://www.cs.utexas.edu/users/browne/cs395f2003/projects/KandagatlaReport.pdf

http://www.cs.utexas.edu/users/browne/cs395f2003/projects/KandagatlaReport.pdf


25. Sim, K.M., Sun, W.H.: Multiple Ant Colony Optimiza-tion for Load Balancing. Springer, Berlin (2003)

26. Subrata, R., Zomaya, A.Y.: A comparison of threeartificial life techniques for reporting cell planning inmobile computing. IEEE Trans. Parallel Distrib. Syst.14(2), 142–153 (2003)

27. Subrata, R., Zomaya, A.Y., Landfeldt, B.: Ar-tificial life techniques for load balancing in computa-tional Grids. J. Comput. Syst. Sci. 73(8), 1176–1190(2007)

28. Trelea, I.C.: The particle swarm optimization algo-rithm: convergence analysis and parameter selection.Inf. Process. Lett. 85(6), 317–325 (2003)

29. White, T., Pagurek, B.: Asga: improving the ant sys-tem by integration with genetic algorithms. In: Uni-versity of Wisconsin, pp. 610–617. Morgan Kaufmann(1998)

30. Yagoubi, B., Slimani, Y.: Dynamic load balancing strat-egy for Grid computing. in: Proceedings of WorldAcademy of Science, Engineering and Technology(2006)

31. Zhu, W., Sun, Ch., Shieh, C.: Comparing the per-formance differences between centralized load bal-ancing methods. IEEE International Conference onSystems, Man, and Cybernetics, vol. 3, pp. 1830–1835(1996)

32. Derbal, Y.: Entropic Grid scheduling. J Grid Comput-ing 4, 373–394 (2006)

33. Ranganathan, K., Foster, I.: Simulation studies of com-putation and data scheduling algorithms for data Grids.J Grid Computing 1, 53–62 (2003)

34. Lucchese, F., Huerta, E.J., Yero, Sambatti, F.S.: Anadaptive scheduler for Grids. J Grid Computing 4,1–17 (2006)

Documents

Swarm Intelligence Approaches for Grid Load Balancingsiludwig/Publish/papers/JGC2011.pdf · Swarm Intelligence Approaches for Grid Load Balancing ... and scheduling are key Grid services,