12
J. Parallel Distrib. Comput. 70 (2010) 707–718 Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc Partition oriented frame based fair scheduler Arnab Sarkar * , P.P. Chakrabarti, Sujoy Ghose Computer Science & Engineering Department, Indian Institute of Technology, Kharagpur, WB 721 302, India article info Article history: Received 15 August 2008 Received in revised form 30 June 2009 Accepted 7 March 2010 Available online 24 April 2010 Keywords: Proportional fairness ERfair scheduling Partitioned scheduling Real time Task migration Worst-fit decreasing (WFD) partitioning Frame based scheduling abstract Proportionate fair schedulers provide an effective methodology for scheduling recurrent real-time tasks on multiprocessors. However, a drawback in these schedulers is that they ignore a task’s affinity towards the processor where it was executed last, causing frequent inter-processor task migrations which ultimately results in increased execution times. This paper presents Partition Oriented Frame Based Fair Scheduler (POFBFS ), an efficient proportional fair scheduler for periodic firm and soft real-time tasks that ensures a bounded number of task migrations. Experimental results reveal that POFBFS can achieve 3 to 100 times reduction in the number of migrations suffered with respect to the General-ERfair algorithm (for a set of 25 to 100 tasks running on 2 to 8 processors) while simultaneously maintaining high fairness accuracy. © 2010 Elsevier Inc. All rights reserved. 1. Introduction Real-time embedded systems that concurrently run a mix of different independent applications such as real-time audio processing, streaming video, interactive gaming, web browsing, telnet, etc. are becoming more and more commonplace [27,28,33]. A characteristic feature of all these applications is that they not only demand meeting deadlines, but CPU reservation to ensure a minimum guaranteed quality of service [25,13,14,23,31]. These demands are generally of the form reserve X units of time for application A out of every Y time units. Proportionate fair (Pfair) scheduling (PF [6], PD [8], PD 2 [3], ERfair [2]) forms an important class of techniques to handle this problem. Consider a set of tasks {T 1 , T 2 ,..., T n }, with each task T i having a computation requirement of e i time units, required to be completed within a period of p i time units from the start of the task. Proportional fair schedulers need to manage their task allocation and preemption in such a way that not only are all task deadlines met, but also each task is executed at a consistent rate proportional to its task weight (or task utilization) e i p i . More formally, let the start time of a task T i be s i . Then proportional fairness guarantees the following for every task T i : At the end of any time slot t , s i t s i + p i , at least e i p i * (t - s i ) of the total execution requirement of e i must be completed. Obviously, for such a criterion to be feasible in a system of m identical processors, we must have n i=1 e i p i m. Also, since * Corresponding author. E-mail addresses: [email protected], [email protected] (A. Sarkar), [email protected] (P.P. Chakrabarti), [email protected] (S. Ghose). we usually consider discrete time lines, appropriate integral values must be used while examining fairness. Scheduling algorithms that ensure proportional fairness must also be able to handle situations where new tasks are dynamically added and existing tasks are deleted. Typically, Pfair algorithms divide the tasks into equal sized subtasks. Subtasks are scheduled appropriately to ensure fairness. To achieve this, the scheduling bandwidth (earliest and latest slots) of each subtask is determined. At every time slot, an appropriate subtask from the set of runnable tasks is chosen. One popular choice is to select that subtask whose deadline (latest possible slot) is the earliest [2]. Various tie-breaking rules have been proposed to achieve fairness [3,8]. However, in spite of attractive features like flexible resource management, dynamic load distribution, fault resilience, etc. [32], actual implementations of these Pfair schedulers are rare, pri- marily due to its global scheduling policy which allows a task to execute on any processor, when resuming after having been pre- empted. Thus, Pfair schedulers being generally ignorant of a task’s affinity towards the processor where it was executed last are ap- plicable only in tightly coupled, shared-memory UMA (Uniform Memory Access) multiprocessor architectures. When the given ar- chitecture is more loosely coupled with separate local and shared memories and the shared memory is insufficient to accommodate the states of all tasks, Pfair fails due to restrictively high migration overheads. Inter-processor task migrations primarily effect two types of overheads, namely, cache-miss related overheads and task trans- fer related overheads. Cache-miss related overheads refer to the 0743-7315/$ – see front matter © 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2010.03.005

Partition oriented frame based fair scheduler

Embed Size (px)

Citation preview

J. Parallel Distrib. Comput. 70 (2010) 707–718

Contents lists available at ScienceDirect

J. Parallel Distrib. Comput.

journal homepage: www.elsevier.com/locate/jpdc

Partition oriented frame based fair schedulerArnab Sarkar ∗, P.P. Chakrabarti, Sujoy GhoseComputer Science & Engineering Department, Indian Institute of Technology, Kharagpur, WB 721 302, India

a r t i c l e i n f o

Article history:Received 15 August 2008Received in revised form30 June 2009Accepted 7 March 2010Available online 24 April 2010

Keywords:Proportional fairnessERfair schedulingPartitioned schedulingReal timeTask migrationWorst-fit decreasing (WFD) partitioningFrame based scheduling

a b s t r a c t

Proportionate fair schedulers provide an effective methodology for scheduling recurrent real-time taskson multiprocessors. However, a drawback in these schedulers is that they ignore a task’s affinity towardsthe processor where it was executed last, causing frequent inter-processor task migrations whichultimately results in increased execution times. This paper presents Partition Oriented Frame Based FairScheduler (POFBFS), an efficient proportional fair scheduler for periodic firm and soft real-time tasks thatensures a bounded number of task migrations. Experimental results reveal that POFBFS can achieve 3 to100 times reduction in the number of migrations suffered with respect to the General-ERfair algorithm(for a set of 25 to 100 tasks running on 2 to 8 processors) while simultaneously maintaining high fairnessaccuracy.

© 2010 Elsevier Inc. All rights reserved.

1. Introduction

Real-time embedded systems that concurrently run a mixof different independent applications such as real-time audioprocessing, streaming video, interactive gaming, web browsing,telnet, etc. are becoming more andmore commonplace [27,28,33].A characteristic feature of all these applications is that they notonly demand meeting deadlines, but CPU reservation to ensurea minimum guaranteed quality of service [25,13,14,23,31]. Thesedemands are generally of the form reserve X units of time forapplication A out of every Y time units. Proportionate fair (Pfair)scheduling (PF [6], PD [8], PD2 [3], ERfair [2]) forms an importantclass of techniques to handle this problem.Consider a set of tasks {T1, T2, . . . , Tn}, with each task Ti

having a computation requirement of ei time units, required to becompletedwithin a period of pi timeunits from the start of the task.Proportional fair schedulers need to manage their task allocationand preemption in such a way that not only are all task deadlinesmet, but also each task is executed at a consistent rate proportionalto its taskweight (or task utilization) eipi . More formally, let the starttime of a task Ti be si. Then proportional fairness guarantees thefollowing for every task Ti: At the end of any time slot t, si ≤ t ≤ si+pi, at least

eipi∗(t−si) of the total execution requirement of ei must be

completed. Obviously, for such a criterion to be feasible in a systemof m identical processors, we must have

∑ni=1

eipi≤ m. Also, since

∗ Corresponding author.E-mail addresses: [email protected], [email protected] (A. Sarkar),

[email protected] (P.P. Chakrabarti), [email protected] (S. Ghose).

0743-7315/$ – see front matter© 2010 Elsevier Inc. All rights reserved.doi:10.1016/j.jpdc.2010.03.005

weusually consider discrete time lines, appropriate integral valuesmust be usedwhile examining fairness. Scheduling algorithms thatensure proportional fairness must also be able to handle situationswhere new tasks are dynamically added and existing tasks aredeleted.Typically, Pfair algorithms divide the tasks into equal sized

subtasks. Subtasks are scheduled appropriately to ensure fairness.To achieve this, the scheduling bandwidth (earliest and latest slots)of each subtask is determined. At every time slot, an appropriatesubtask from the set of runnable tasks is chosen. One popularchoice is to select that subtaskwhose deadline (latest possible slot)is the earliest [2]. Various tie-breaking rules have been proposed toachieve fairness [3,8].However, in spite of attractive features like flexible resource

management, dynamic load distribution, fault resilience, etc. [32],actual implementations of these Pfair schedulers are rare, pri-marily due to its global scheduling policy which allows a task toexecute on any processor, when resuming after having been pre-empted. Thus, Pfair schedulers being generally ignorant of a task’saffinity towards the processor where it was executed last are ap-plicable only in tightly coupled, shared-memory UMA (UniformMemory Access) multiprocessor architectures. When the given ar-chitecture is more loosely coupled with separate local and sharedmemories and the shared memory is insufficient to accommodatethe states of all tasks, Pfair fails due to restrictively high migrationoverheads.Inter-processor task migrations primarily effect two types of

overheads, namely, cache-miss related overheads and task trans-fer related overheads. Cache-miss related overheads refer to the

708 A. Sarkar et al. / J. Parallel Distrib. Comput. 70 (2010) 707–718

delay suffered by resumed threads of execution due to compul-sory and conflict misses while populating the caches with theirevicted working sets after a migration takes place. Task transferrelated overheads refer to the time spent by the operating sys-tem to transfer the complete state of a thread from the proces-sor where it had been executing to the processor where it willexecute next after a migration. Obviously, the more loosely cou-pled a system, the higher will be the transfer overhead. Theseexpensive overheads underline the importance of devising suit-able scheduling techniques that attempt to maximize the timefor which a task executes on a particular processor, especially inreal-time systems where time is at a premium. This area has at-tracted considerable interest over the years [1,7,9,11,12,21,26,29].An overview of the different approaches that has been adopted toavoid inter-processor task migrations has been discussed in thenext section.The work presented in this paper is motivated by the

fact that there exists a large class of systems which need tosupport a dynamic mix of coexisting, possibly misbehaving (bytrying to take more time than it is actually stipulated to),independent applications with different firm and soft real-timerequirements [18,19,28]. Missed deadlines are undesirable but notcatastrophic. Thus, in these systems, while maintaining fairness isimportant, a slight deviation from perfect fairness can be toleratedif it substantially reduces the inter-processor migration overhead.This is more so because the speedup obtained through reducedoverheads of the scheduler not only provides more bandwidth foruseful applications, but (as discussed later in Section 6.3.2) it alsoeffectively enhances the actual system fairness in terms of realtime.With this motivation in mind, we have developed the Partition

Oriented Frame Based Fair Scheduler (POFBFS), a frame basedproportional fair scheduler that provides high fairness accuracywhile only allowing a bounded number of task migrations for a setof periodic firm and soft real-time tasks.POFBFS works as follows: The idea is to define a frame/window

of a certain specific size (consisting of a certain number of timeslots) and to allocate shares (of time slots) to each task inproportion to their weights eipi within the frame. Then a two phasedmechanism is used to partition the task set into the m availableprocessors. The first phase allocates tasks to individual processorsusing the worst-fit decreasing (WFD) [22,24] heuristic such thatthe sum of task shares in each processor is less than the framesize G and simultaneously builds a sorted list ΛMGR of tasks thatcannot receive its full share into any single processor. The secondphase allocates the tasks in list ΛMGR by partitioning the share ofeach task in to more than one processor. Having assigned the tasksinto the available processors, the task shares are executed usingan ERfair scheduler in each processor. After execution completesinside a frame, each task is put in an appropriate future framesuch that ERfairness [2] of the system remains preserved at frameboundaries. Experimental results reveal that POFBFS can achieve 3to 100 times reduction in the number of migrations suffered withrespect to the General-ERfair algorithm (for a set of 25 to 100 tasksrunning on 2 to 8 processors) while simultaneously maintaininghigh fairness accuracy. The use of a frame based approach for taskgrouping was earlier found to be useful in reducing the schedulingcomplexity for proportionally fair uniprocessor schedulers [30].The application of a similar technique for global task partitioningin multiprocessors by POFBFS for migration aware schedulinghighlights the general utility of such a frame based approach.However, the proposed approach is effective for independent tasksystems. For non-independent task systemswhich are representedby task graphs, there exists another huge body of work [16,17].This paper is organized as follows. In thenext section,we review

the important approaches targeted at avoiding inter-processor

task migrations. In Section 3, we introduce some importantterminology and definitions. Section 4 describes the working ofthe POFBFS algorithm along with illustrative examples and thenpresents the algorithm in detail. We present an analysis of thealgorithm in Section 5. Experimental results are presented inSection 6. We conclude in Section 7.

2. Related work

The traditional approach to avoiding migration has been byadopting a partition oriented scheduling approach where, once atask is allocated to a processor, it is exclusively executed on thatprocessor [5,10,22]. One of its major problems though is that noneof the algorithms here can achieve an overall system utilization(∑ni=1 ei/pi) greater than

m+12 in a system of m processors in the

worst case [4]. That is, nomore than 50% of the system capacity canbe used in order to ensure that all deadlines aremet. However, thisworst case condition may be relaxed by bounding the maximumweight of any individual task under a certain value. Lopez et al.in [21] proved that if all the tasks have an weight under a value α,theworst case achievable utilization becomes:Uwc(m, β) =

βm+1β+1 ,

where β =⌊ 1α

⌋. Thus, as α approaches 0, Uwc(m, β) approaches

m and when α = 1, Uwc(m, β) becomes m+12 .The middle ground between the two extremes of no migration

and unrestricted migration is formed by algorithms allowingrestricted inter-processor migration of real-time tasks and thisarea has also attracted attention in the last few years [1,5,12,15].Anderson et al. in [1] have provided an interesting restrictedmigration oriented algorithm that provides bounded deadlinetardiness and does not pose any restriction on the overall systemutilization. The algorithm is also important from the standpointthat it ensures that over a long range of time, all tasks execute attheir prescribed rates (given by their weights) and thus, it takes thefirst steps at attempting to develop a partition oriented real-timerate based scheduler. However, being based on EDF [20], a nonratebased algorithm, its rate based properties are weak. The otherlimitations of the algorithmare that it does not allow task prioritiesto change within a job, requires that individual task utilizations tobe capped to at most 1/2 and restricts the number of processorsinto which a task can be migrated to at most 2. Another importantrestricted migration based algorithm with a much stronger notionof fairness has been presented by Kimbrel et al. in [15]. Given aset of n tasks to be scheduled on m identical processors (n =qm+r, q ∈ N, r ∈ I and 0 ≤ r < m), this algorithmasymptoticallyrequires r(m − r) task migrations every n(q(d − 1) + 1) timeslots where d (drift) measures the deviation from perfect fairnessand is defined as the maximum difference between the number oftime slots of execution completed by any two tasks. This researchthus provides a trade-off between the drift d and the number ofmigrations. The drawbacks of this algorithm stems from its limitedscope of applicability in that it works only for a set of persistent(non-dynamic) equal priority tasks.The partition oriented frame based proportional fair scheduling

approach that we have proposed in this work (POFBFS) ensures atmostm−1 taskmigrations per frame and in spite of being partitionoriented, it does not place any additional restriction on individualtask utilizations for guaranteeing schedulability. POFBFS also doesnot restrict the number of processors into which a migrating taskmay get partitioned into and allows a task to be partitioned into allthem processors if required.

3. Terminology and definitions

3.1. Notations

• t: Time; represents the tth time slot.• n: Total number of tasks.• m: Number of processors used.

A. Sarkar et al. / J. Parallel Distrib. Comput. 70 (2010) 707–718 709

• T : The set of tasks. Symbolically, T = {T1, T2, T3, . . . , Tn}, whereTi is the ith task.• T ji : jth subtask of Ti.• si: Starting time of Ti. This is equivalent to its arrival time.• ei: Execution requirement of Ti (in number of time slots).• pi: Period within which Ti’s execution must complete to meetits deadline.• rei: Currently remaining execution requirement of Ti.• rpi: Currently remaining period of Ti; the number of time slotsremaining within which Ti must execute so that its deadline isnot violated.• G: Frame size (in number of time slots). The size of a frame is adesign parameter and is appropriately chosen.• ctu: Summation of weights of all the currently active taskinstances in the system. Its value is updated whenever a newinstance of a task starts, or an existing instance of a task finishesexecution.• Sys_Util: Summation of weights of the tasks currently executingin the system. Its value is updated only when a new taskor application arrives into the system or an existing taskdeparts. Unlike ctu, its value remains unaffected by the start orcompletion of existing task instances.• Vi: Denotes the ith processor.• Zi: The number of fixed tasks in processor Vi.• ai: Number of processors into which a migrating task Ti getspartitioned.• count i: The remaining unexecuted shares (execution require-ment) of task Ti in the current frame.• ift: Intra-frame time; the number of time slots that have passedin the current frame.• fst: Starting time of the current frame.

3.2. Definitions

(1) lag(Ti, t): The difference between the amount of time actuallyallocated to a task and the amount of time that would beallocated to it in an ideal system with a scheduling quantumapproaching zero. Formally, the lag is defined as follows:

lag(Ti, t) =eipi∗ (t − si)− (ei − rei). (1)

The under-allocation of Ti at time t is defined as max(0, lag(Ti, t)).

(2) Early-Release fairness (ERfairness): A schedule is said to beearly-release fair (ERfair) iff: (∀Ti, t :: lag(Ti, t) < 1) That is,the under-allocation associated with each task must always beless than one time quantum.

(3) naf i: Next allotted frame for task Ti. nafi is calculated whenTi completes execution in a frame and has to be allotted afuture frame in which it will execute next. It gives the numberof frames that Ti can skip execution but still avoid under-allocation. Lemma 2 derives the formula for nafi.

naf i =⌊rpirei ∗ G

⌋. (2)

(4) shr i: Next allotted share for task Ti. shri is calculated when Ticompletes execution in a frame and determines the share ofTi in the next frame in which it will execute. This is given bythe difference between the number of time slots of executionwhich Ti must complete by the end of its next allotted frameand the number of time slots of execution which Ti has alreadycompleted. The formula for shr i as derived in Lemma 2 is givenbelow.

shr i =⌈eipi((naf i + 2)G+ (fst − si))

⌉− (ei − rei). (3)

(5) wt i: Weight of a task Ti inside a frame. It is given by the ratioof its share shr i and the frame size G:wt i =

shr iG .

(6) τx: The spare capacity in processor Vx for a migrating task T1. Itis defined as:

τx = G−Zx+1∑i=2

shr i. (4)

(7) wt_mei%k : Given a migrating task T1 that gets i partitionedinto a processors (whose indices are denoted as%1, %2, . . . , %a)within a frame, wt_mei%k denotes the effective weight of eachfixed task Ti when T1 is executing on processor V%k (%k beingthe index of the kth processor (1 ≤ k ≤ a) into which T1gets partitioned). The formula for wt_mei%k has been derivedin Theorem 4.

wt_mei%k =shr i(G− shr1)G(G− τ%k)

. (5)

(8) wt_mpei%k : Similar to wt_mei%k . wt_mpei%k represents theeffective weight of each fixed task Ti after the non-terminatingtask T1 completes executing in processor V%k (refer Theorem4).

wt_mpei%k =shr iG− τ%x

. (6)

(9) pdik: Pseudo-deadline of the kth subtask of task Ti within agiven frame. It denotes the time slot before which Ti mustcomplete executing its kth subtask to remain ERfair. It is givenby:

pdik =⌈kwt i

⌉. (7)

4. The POFBFS algorithm

The POFBFS actuallyworks at two levels. The outer level consistsof a global task allocator that partitions and allocates the tasks tothe different processors at the beginning of each frame. Each of theindividual processors then run a separate ERfair scheduler at theinner level to execute the tasks allotted to it. We now describe theglobal task allocation mechanism with an illustrative example.Global task allocation: Given a list of n tasks arranged in non-

increasing order of their shares (the number of time slots allottedto it within a frame) in a system of m processors, the global taskallocation mechanism may be conceptualized by the followingthree steps:1. Allocation of fixed tasks to individual processors: The first step

partitions the task set into m disjoint subsets using the Worst-FitDecreasing (WFD) bin packing algorithm such that the sum of taskshares in each such set is less than the frame size G. If none of theprocessors have enough spare capacity to accommodate a task, itis put in a separate list calledΛMGR.

Example. Let us consider a sorted list of five tasks T1, T2, T3, T4, T5 tobe scheduled within a frame of size G = 12 in a system of m = 3processors. Let the task shares be as follows: shr1 = 10, shr2 =shr3 = 9, shr4 = shr5 = 4. The partitioning process thenpartitions the task set onto the three processors in WFD manner.The mapping of the task set to the set of processors V1, V2 and V3as provided by step 1 are as follows: T1 → V1, T2 → V2, T3 → V3.The remaining capacities of V1, V2 and V3 after the above mappingprocess are 2, 3 and 3 respectively. Thus, tasks T4 and T5 eachhaving a share value of 4 do not fit into the remaining capacitiesof any of the processors and hence get stored in the listΛMGR.

2. Partitioning and allocation of migrating tasks: The tasks in listΛMGR obtained at the end of the first step has to be partitioned andstored into the remaining capacities of more than one processor.If the first task inΛMGR, say Tmgr gets partitioned into a processors(2 ≤ a ≤ m), Tmgr completely fills up the remaining capacities ofthe first a−1 processors among these a processors. The remaining

710 A. Sarkar et al. / J. Parallel Distrib. Comput. 70 (2010) 707–718

capacity of the ath processormay be partially filled. The sum of thepartial shares of Tmgr in the a processors equals its total share in theframe. The next task inΛMGR is partitioned and stored starting fromthe ath or the (a + 1)th processor depending on whether the athprocessor has any remaining spare capacity left.

Example. After step 1, tasks T4 and T5 form the set of candidatemigrating tasks and have to be partitioned into 2 or moreprocessors. In step 2, T4 gets partitioned with a partial share of 3on V2 and 1 on V3, while T5 gets partitioned with a partial share of2 on V3 and 2 on V1.

3. Calculation and assignment of priorities to fixed tasks in thepresence of migrating tasks: To ensure that a migrating task Tmgris able to execute at a rate shrmgr/G (this is necessary to ensurethat Tmgr will be able to complete execution of its share shrmgrby the end of the frame without being required to be scheduledsimultaneously on more than one processor) on any intermediateprocessor Vx into which Tmgr gets partitioned, the fixed tasks onVx must execute with a lower weight (given by Eq. (5)) while taskTmgr is executing, so that sum of the task weights on Vx do notexceed 1. After Tmgr completes executing on Vx, the fixed tasksstart executing at a higher rate (given by Eq. (6)) so that they maycomplete execution of their shares by the end of the frame.

Example. In our example, processors V2 and V3 contain non-terminating migrating tasks T4 and T5 respectively. Therefore, thefixed tasks T2 and T3 must execute at a lower rates 2/3 and 7/12(given by Eq. (5)) instead of their actual rate 3/4 while T4 and T5are executing on V2 and V3. After the non-terminating migratingtasks T4 and T5 finishes executing on V2 and V3 respectively, thefixed tasks T2 and T3 execute at a higher rates 1 and 11/12 (givenby Eq. (5)) respectively to complete their shares by the end of theframe. It is worth noting that the terminating migrating task T4 donot effect any weight revision of the fixed task T3 on V3.

ERfair based scheduling in individual processors: Within the frame,the shares in each individual processor are executed using aseparate ERfair scheduler. At the end of the frame, the tasks in eachprocessor are rescheduled for execution in an appropriate futureframe (using Eq. (2)) with a proper share (calculated using Eq. (3))so that ERfairness of the system ismaintained at frame boundaries.We now present an example to show how a processor allocatesfuture frames and shares to tasks at the end of a frame.

Example. Let us suppose that at the end of a frame, a processorcontains 4 tasks T1, T2, T3, T4 having weights 3/5, 1/5, 3/25, 2/25.Let the remaining execution time required by each task be 18 timeslots. So, re1 = re2 = re3 = re4 = 18. Hence, rp1 = 30, rp2 =90, rp3 = 150, rp4 = 225. Let the frame size G be 10. The naf andshr values of the tasks T1, T2, T3 and T4 are 0, 0, 0, 1 and 6, 2, 2, 1respectively. Therefore, tasks T1, T2 and T3 get scheduled to executein the very next frame while T4 gets scheduled in the next to nextframe.

4.1. Detailed algorithm

4.1.1. Data structuresThe algorithm primarily uses two data structures, namely, an

array FA of arrays FL of linked lists and the readyheapRH i of tasks ineach processor Vi. The array of arrays, FA, manages all the runnabletasks. Each array FL in FA corresponds to a frame. Each linked listFLi forms the bucket of tasks with share value G− i. The ready heapis arranged in non-decreasing order of task pseudo_deadlines pdi.The nodes corresponding to each task in RH i contain informationincluding rei, rpi,wt i, count i,wt_mei,wt_mpei, pdi, etc.

4.1.2. Size of array FAThe size FSZ of array FA is determined by themaximumnumber

of frames thatmay ever be required to be accessed simultaneously.

Fig. 1. The principal data structure: FA forms the array of arrays FL of linked lists.Each array FL in FA corresponds to a frame. Each linked list FLi forms the bucket oftasks with share value G− i.

This number is obtained from the lower bound (1/k) of theweightsof tasks in the system. FSZ is defined as: FSZ =

⌈ kG

⌉+ 1. FSZ thus

defines the slidingwindowof themaximumnumber of frames thatmay be accessed simultaneously. To maintain this sliding window,FA has been implemented as a circular array. Fig. 1 shows this datastructure.

4.1.3. The algorithm pseudo-codeThe POFBFS algorithm consists of three functions. The main

function Algorithm POFBFS carries out the overall scheduling andcalls two other functions, namely function Global_Task_Allocator(Li) and Schedule (Vi) which are called at the beginning of eachframe to partition and allocate tasks to the processors and schedulethe allotted tasks within each processor respectively.

Algorithm 1 Algorithm POFBFS1: for each active task Ti in T {Initialize (FA)} do2: rei ← ei; rpi ← pi.3: Calculate naf i. {Determine execution frame of Ti; using

Eq. (2)}4: shr i ←

eipi(naf i + 1)G;wt i ←

shr iG ; count i ← shr i.

5: Create a new list node αi for Ti.6: Insert αi at the tail of the queue of FAnaf i+1.7: while true do8: Select the next non-empty frame FAi.9: if all frames are empty then10: Exit.11: Form sorted list Li of tasks in FAi.12: Call Global_Task_Allocator(Li)13: for Each processor Vk from 1 tom in parallel do14: Call Schedule (Vk).

4.1.4. Handling dynamic task arrival and departureIn a dynamic task system, a task can arrive or depart at any

time. Handling dynamic task departure is easy. The task node isremoved from the ready heap RH of its containing processor and itsweight is decremented both from ctu and Sys_Util. Similarly, whena new task say Ti arrives within a frame, its weight is first added to

A. Sarkar et al. / J. Parallel Distrib. Comput. 70 (2010) 707–718 711

Algorithm 2 Function Global_Task_Allocator(Li)1: Let the index of the m processors V1, V2, . . . , Vm be arrangedin a list P = pc1, pc2, . . . , pcm, such that at any time the listremains sorted in terms of processor weights, pc1 holding theindex of the lightest processor and pcm holding the index of theheaviest processor.

2: for Each task Tj in Li starting with the heaviest {Allocate fixedtasks.} do

3: Insert Tj in the ready heap RHpc1 of the processor Vpc1 .4: if Vpc1 does not have enough spare capacity to hold Tj then5: Insert Tj in listΛMGR.6: After insertion, shift Vpc1 to its proper place in P such that it

remains sorted.7: Let Vpc1 , . . . , Vpcm have spare capacities scpc1 , . . . , scpcm . So,scpc1 > · · · > scpcm .

8: for Each task Tj in ΛMGR starting with heaviest {Allocatemigrating tasks.} do

9: tshr ← shr j. {shr j is the share of task Tj for the next frame.}10: if tshr > 0 and there are more processors in the list P then11: Extract the next processor Vpck from list P .12: TPj ← TPj ∪ Vpck . {TPj: list of processors into which Tj gets

partitioned.}13: for All fixed tasks Tx in Vpck do14: Calculatewt_meipck {using Eq. (5)} andwt_mpeipck {using

Eq. (6)}.15: if |TPj| = 1 then16: migrpck ← 1. {Set migr flag of Vpck to indicate that the

migrating task Tj will execute in Vpck from the start ofthe next frame.}

17: for All fixed tasks Tx in RHpck do wtx ← wt_mex.18: Insert Tj in the ready heap RHpck .19: if tshr > sck then20: tshr ← tshr − sck; sck ← 0; Goto Step 10.21: else22: sck ← sck − tshr; tshr ← 0.23: Delete the first processor from list TPj and goto Step 8.

the current values of both ctu and Sys_Util. Then its share shri forthe remaining part of the frame is calculated. If shri ≥ 1 and canbe fully accommodated within the processor, say Vk, having thehighest spare capacity sck, Ti is allocated processor Vk with sharevalue shr i. Otherwise, if shri > sck, Ti is allocated with share valuesck. The remaining share (shr i − sck) is adjusted in future frames.This has been done to avoid the complexities of handling a newmigrating task inside a frame. If (shr i < 1), Ti is allocated in anappropriate future frame with a proper share.

4.1.5. Low overhead sortingPOFBFS usesworst-fit decreasing (WFD) strategy to partition the

task set. This requires the tasks to remain sorted (in non-increasingorder of share values) when a frame starts. As the share values atask can have ranges only between 1 and G, we used a countingsort technique to order the tasks in O(G) time. This is O(n) sincewe only consider such values of G which are proportional to thetask set size n.

4.1.6. The work conserving nature of POFBFSPOFBFS is a work conserving algorithm. So, no processor should

idle at any time unless the number of runnable tasks become lessthan m, the number of processors. In the algorithm mentionedabove, the principle of work conservation may be violated intwo situations, either occurring at frame boundaries, or occurringinside frames. We now describe these situations and present themechanisms that have been employed to handle them.Situation 1: There are light runnable tasks scheduled for

execution in future frames but the next frame has large spare

Algorithm 3 Function Schedule (Vi)1: while RH i is not empty do2: Extract from RH i task Tk, having nearest pseudo_deadline

pdk. {Ties are broken arbitrarily}3: Decrement countk; calculate pdk for next subtask of Tk. {using

Eq. (7)}4: if countk = 0, {The share of Tk has exhausted} then5: if |TPk| > 0 {Tk is a migrating task} then6: migrk ← prev_migri ← 0.7: Find next processor Vnxt from TPk and delete this

processor from TPk.8: migrnxt ← 1.9: for All tasks Tx in RH i do10: wtx ← wt_mpex.11: else12: Calculate naf k. {using Eq. (2)}; Calculate shrk. {using

Eq. (3)}13: Insert Tk at the tail of FLshrk in the frame FAi+naf k .14: else15: Insert Tk into heap RH i.16: if prev_migri = 0∧migri = 1 {A newmigrating task Tmgr has

arrived} then17: if |TPmgr| > 0 {Tmgr is a non-terminating migrating task}

then18: for All tasks Tx in RH i do19: wtx ← wt_mex.20: Insert Tmgr in the ready heap RH i.21: prev_migri ← migri.

capacities. To handle this situation, the light tasks that have notbeen allocated in the next frame are stitched together in a linkedlist ltl and maintained in non-increasing order of their weights.Whenever the next frame has spare capacity and ltl is not empty,tasks are extracted from ltl and scheduled for execution in the nextframe with a share value of 1.Situation 2: All the tasks that have been scheduled to run in a

frame on a given processor have finished executing their sharesbefore the completion of the frame. To avoid such a situation,we have avoided scheduling tasks in future frames just after theycomplete execution of their shares within a frame. Instead, theyare inserted into a separate heap of completed tasks CH . Wheneverthe ready heap RH becomes empty, tasks from CH are executed inERfair fashion either until the end of the frame, or until a new taskcomes in RH . At the end of the frame, all the tasks are scheduled inappropriate future frames.

5. Analysis of the algorithm

Lemma 1. Given a set of tasks T = {T1, T2, . . . , TZx+1} with cor-responding shares Shr = {shr1, shr2, . . . , shrZx+1} to be scheduledwithin a frame of size G on a processor Vx where T1 is the migratingtask not terminating in Vx, the length of the interval (NTSx) by whichT1 finishes execution on Vx is given by: NTSx = Gτx

shr1.

Proof. Existence of the migrating task T1 implies that∑Zx+1i=1

shr i > G.Fraction of T1’s share to be executed on Vx = G −

∑Zx+1i=2

shr i/shr1.So, the number of time slots NTSx by which T1 must complete

executing(G−

∑Zx+1i=2 shr i/shr1

)th part of its share on Vx : NTSx

= G G−∑Zx+1i=2 shrishr1

=Gτxshr1. �

Theorem 1. If a task T1 gets partitioned into a processors (whoseindices are denoted as %1, %2, . . . , %a) within a frame, the sum of themaximum under-allocations (UT%k ) suffered by the fixed tasks Ti on

712 A. Sarkar et al. / J. Parallel Distrib. Comput. 70 (2010) 707–718

V%k (%k being the index of the kth processor (1 ≤ k ≤ a) into which T1gets partitioned) is non-increasing (that is, ∀k (1 ≤ k ≤ a−1)UT%k ≥UT%k+1 ) and is given by:

UT%k = τ%k

1−k∑x=1τ%x

shr1

. (8)

Proof. T1 executes in processor V%k only after it executessequentially in V%1 , V%2 , . . . , V%k−1 . Using Lemma 1, the total time(TEk) elapsed from the start of the frame to the termination of T1 inV%k is given by: TEk =

Gshr1

∑Kx=1 τ%x .

As T1’s share on V%k is τ%k , all tasks in V%k would execute at theirspecified rates shri/G (2 ≤ i ≤ Z%k ), if T1 would execute at a rateτ%k/G throughout the frame interval.At this rate T1 would complete (τ%k · TEk)/G time slots of

execution within the interval TEk. However, T1 actually executesat a faster rate and completes executing its whole share τ%k in thisinterval.Hence, the sum of the maximum under-allocations suffered

by fixed tasks obtained at the end of TEk time slots is given by:

UT%k = τ%k − ((τ%k · TEk)/G) = τ%k

[1−

∑kx=1 τ%xshr1

]. �

We now prove that ∀k (1 ≤ k ≤ a− 1) UTi%k ≥ UTi%k+1 .When partitioning a migrating task into processors, POFBFS

considers the processors in non-increasing order of their remain-ing spare capacities. Again, because a migrating task completelyfills up the remaining spare capacity (provided it has enough re-maining shares to do so) of a processor before proceeding to thenext, the largest portion of its total share always gets allotted inthe first processor, the second largest portion in the second pro-cessor, and so on. That is, τ%1 ≥ τ%2 ≥ · · · ≥ τ%a . The rest of theproof is easily derivable using Eq. (8).

Corollary 1. At any instant within a frame, each available processorcontains 1 or more fixed tasks, 0 or 1migrating task that terminate init, and 0 or 1migrating task that does not terminate in it.

Proof. From Theorem 1. �

Theorem 2. The sum of the maximum under-allocations of the fixedtasks within a frame in a given processor due to a migrating task inthe worst case is 1/4th the total share of the migrating task.

Proof. From Theorem 1, we know that the fixed tasks in the firstprocessor into which a migrating task gets partitioned suffersthe maximum under-allocation and it is given by: UT%1 =

τ%1

[1−

τ%1shr1

].

To find the value of τ%1 for which the value of UT%1 maximizes,we differentiate UT%1 with respect to τ%1 and equate it to zero.

dUT%1dτ%1= 0 or, 1−

2τ%1shr1= 0 or, τ%1 =

shr12 . Therefore,MAX(UT%1)

=shr14 . �

Theorem 3. Given a set of n tasks to be scheduled on a set of mprocessors within a frame of size G, the share value of a migrating taskgets maximized when all tasks have equal priority and the system isfully loaded. The worst case value for the sum of maximum under-allocations of the fixed tasks within a frame in a given processor forthis case is given by: MAX_UT = Gm

4n

Proof. As POFBFS considers tasks in non-increasing order of theirshare values (using the Worst-Fit Decreasing (WFD) bin packingheuristic) during theGlobal Task Allocation Phase, lesser the relativedifference between the shares of the different tasks, higher is the

probability that a task with a relatively larger share becomes amigrating task.Hence, a migrating task will have maximum share value when

all tasks have equal priority and the system is fully loaded. For aset of n equal priority tasks on a set of m fully loaded processors,the weight wt i of each task Ti becomes: wt i = m/n and the shareshr i of each task Ti within a frame of size G becomes: shr i = m

n G.Hence, from Theorem 2, we getMAX_UT = Gm

4n . �

Theorem 4. If a migrating task T1 gets partitioned into a processors(whose indices are denoted as %1, . . . , %a) within a frame, then theeffective weight of the fixed tasks Ti on processor V%k when T1 isexecuting on V%k (%k being the index of the kth processor (1 ≤ k ≤ a)into which T1 gets partitioned) is given by:wt_mei%k =

shri(G−shr1)G(G−τ%k )

.The effective weight of each fixed task Ti after T1 has completed

executing in V%k is given by:wt_mpei%k =shr iG−τ%x

.

Proof. When the migrating task T1 is not executing on V%k , thefixed tasks in it execute at higher rates and get over-allocated.At the instant when T1 starts executing on V%k , the total over-allocation becomes: OV%k =

τ%kG

∑k−1x=1 NTS%x =

τ%kshr1

∑k−1x=1 τ%x .

Therefore, over-allocation of each fixed task Ti (2 ≤ i ≤ Z%k +1) = OV%k

shriG−τ%x

.On the other hand from Eq. (8), the under-allocation of each

fixed task Ti at the instant when T1 completes execution on V%k isobtained as: UT%k

shriG−τ%x

.Number of time slots of execution that a fixed task Ti would

complete by executing at its original rate shr iG during the periodNTS%k is given by:

shr iG NTS%k =

shr ishr1τ%k . However, the actual number

of time slots that Ti has completed executing during this period is:shr ishr1τ%k − (UT%k + OV%k)

shriG−τ%x

=shr i·τ%k (G−shr1)shr1(G−τ%x )

.Therefore, effectiveweightwt_mei%k of each fixed task Ti during

the period NTS%k is given by: wt_mei%k =shr i·τ%k (G−shr1)shr1(G−τ%x )NTS%k

=

shr i(G−shr1)G(G−τ%k )

.Hence, first part of the theorem is proved. Now, we prove the

second part.Number of time slots of execution that a fixed task Ti would

complete if it executed at a rate shr iG in the remaining period

in the frame G −∑kx=1 NTS%x after T1 completes execution

on V%k :shriG

(G−

∑kx=1 NTS%x

). However, Ti actually executes:

shr iG

(G−

∑kx=1 NTS%x

)+ UT%k

shriG−τ%x

=shr iG−τ%x

(G−∑kx=1 NTS%x).

Therefore,wt_mpei%k =shriG−τ%x

. �

Lemma 2. Given a task Ti of weight eipi having remaining executionrequirement rei time slots and remaining period rpi time slots at a timewhen it completes execution within a frame. Then, Ti will not sufferunder-allocation at the end of its next frame of execution, provided itexecutes next in the (naf i+ 1)th frame after the current frame with ashare shr i and the system is not overloaded.

Proof. Ti will not be under-allocated after the execution of its nextsubtask if it gets scheduled at or before the next

⌊rpirei

⌋time slots.

Now, given a frame of size G, Ti will avoid under-allocation afterexecution in its next frame if it gets scheduled within the next⌊rpirei∗G

⌋frames.

Now, we show that if Ti executes in the (naf i+1)th frame afterthe current frame, its correct share should be shri.Let us assume that Ti has completed execution of its share

within a frame after ift time slots have passed within the frame.

A. Sarkar et al. / J. Parallel Distrib. Comput. 70 (2010) 707–718 713

So, fst + ift − si time slots have elapsed since its arrival and G− ifttime slots are left before end of frame.If Ti executes next in the (naf i+1)th frame, number of time slots

between the arrival of Ti and the (naf i + 1)th frame’s completion,is given by: (fst+ ift− si)+ (G− ift)+ (nafi+1)G = (naf i+1)G+G+ (fst − si) = (nafi + 2)G+ (fst − si).Hence, to avoid under-allocation after executing in the (nafi +

1)th frame, Ti must complete execution of:⌈eipi((naf i + 2)G +

(fst− si))⌉time slots. As, Ti has actually already completed ei− rei

time slots of execution, it needs to execute a share of shr i =⌈eipi((naf i + 2)G+ (fst − si))

⌉− (ei− rei) time slots in the (naf i+

1)th frame to avoid under-allocation. �

Theorem 5. Algorithm POFBFS satisfies ERfairness at frame bound-aries.

Proof (By Induction). At t = 0, all tasks have a lag of 0; thehypothesis is trivially true.At each frame boundary, the naf and shr values for all tasks

which executed in the previous frame are calculated giving theappropriate frame and share values for the tasks such that they donot get under-allocated.We assume the truth of the hypothesis after the ith frame, at

t = iG.We have to establish the truth of the hypothesis at t = (i+1)G.

All tasks scheduled to execute in the (i+1)th framemayhave eithercome from the ith frame or from some earlier frame according tothe naf value that was calculated initially (if it is executing for thefirst time) or after the exhaustion of its share in the frame whereit last executed. Now by Lemma 2, no task (whether it executesin the (i + 1)th frame, or gets scheduled for execution in a laterframe) executingwith its corresponding share of shr can get under-allocated at the (i+1)th frame’s completion. Thus, POFBFS is ERfairat frame boundaries. �

Theorem 6. POFBFS has an amortized scheduling complexity ofO(max(m, lgn)).

Proof. Let us analyze the complexity of each step of algorithmPOFBFS.

(1) The for loop in lines 1–6 initializes different schedulerparameters. Initialization takesO(n) time but is done only onceat the beginning of scheduling. So, scheduling complexity is notaffected by this function.

(2) Selection of the next non-empty FL list before the start of eachframe (line 8) can always be done in a constant number of stepsas the size of FA is fixed.

(3) Sorting (line 11) takes O(n) time in the worst case (We haveused a counting sort technique. In most cases however, as themaximum share that a task may have is much less than theframe size G, the actual sorting overhead becomes very low.)This is done at the start of each frame. As, each frame is ofO(n) size, the effective overhead of sorting on the complexityis O(1).

(4) Line 12 calls function Global_Task_Allocator() at each frametransition. We now analyze the complexity of this function.(a) The for loop in lines 2–6 is called for each task. Line 3involves a heap insertion statement and line 6 contains aninsertion into a sorted list and thus each has a complexityofO(lgn). lines 4 and 5 execute in constant time. Hence, thefor loop has a complexity of O(n · lgn).

(b) The for loop in lines 8–23 allocates migrating tasks. Asthere can be at most m − 1 migrating tasks, this loop isexecuted at most O(m) times. The for loops in lines 13–14

and line 17 has an O(n) overhead in the worst case. Line18 involving a heap insertion has a O(lgn) complexity. Theother lines execute in constant time. So, the for loop in lines8–23 has an overall complexity of O(m · n).

(c) Thus, Global_Task_Allocator() suffers an overall overhead ofO(n·max(m, lgn)). However, because this function is calledonly at frame boundaries, it has an amortized complexityof O(max(m, lgn)) per time slot.

(5) Lines 13–14 call function Schedule() in parallel in eachprocessor. This function basically implements an ERfairscheduler and thus has a complexity of O(lgn) per time slot.

(6) Therefore, the overall scheduling complexity of the POFBFS isO(max(m, lgn)). �

6. Experiments and results

We have experimentally evaluated both the migration over-heads and fairness of our algorithm and compared it against: 1. TheGeneral-ERfair algorithm and 2. A modified version of the General-ERfair algorithm called Stringent_ERfair which provides around 2to 3.5 times reduction in the number of migrations suffered com-pared to the General-ERfair algorithm (for a set of 25 to 100 tasksrunning on 2 to 8 processors) albeit at the cost of an increase inscheduling complexity. The evaluation methodology is based onsimulation studies using an experimental framework which is de-scribed in the next subsection.We now provide a brief overview ofthe General-ERfair and Stringent_ERfair algorithms.Given a set of m processors V1, V2, . . . , Vm and n (≥ m) tasks,

General-ERfair chooses the m most urgent tasks from a priorityqueue at each time slot and allocates processors to these tasksin the order in which they have been extracted from the priorityqueue (thus, the first task extracted from the queue is allocatedV1, the second task is allocated V2, and so on). Thus, General-ERfairis completely oblivious of the processor in which a task executedthe last time it was scheduled and hence incurs an unrestrictednumber of migrations.Stringent_ERfair on the other hand remembers the processor

where a task executed the last time it was scheduled. Like General-ERfair, it also chooses the m most urgent tasks at each time slot.For each of these m tasks, if the processor where it last executedis free, it allocates this processor to the task. Otherwise, it puts thetask in a separate list L1. Once, all them tasks have been checked forthe availability of the processor where it executed last, the tasks inL1 are allocated to the remaining free processors. Hence, only thetasks in L1 incur a migration.

6.1. Experimental setup

The experimentation framework used is as follows: The datasets consist of randomly generated hypothetical periodic taskswhose execution periods (pi) and weights

(eipi

)have been taken

from normal distributions.Given the total number of tasks to be generated (n) and the

summation of weights of the n tasks (U), the task weights havebeen generated from a distribution with standard deviation (σ ) =0.1 and mean (µ) = U

n . The summation of weights of the tasks asgenerated through the above procedure is not constant. However,making the summation of weights constant helps in the evaluationand comparison of the algorithms. Therefore, the weights havebeen scaled uniformly to make the cumulative weight of eachdistribution constant and equal to U . All the task periods have alsobeen generated from a normal distribution having σ = 3500 andµ = 4000. Now, different types of data sets have been generatedby setting different values for the following parameters:(1) Task set size n: sizes considered were 25, 50 and 100 tasks.(2) Number of processors NP: Multiprocessor systems consisting of2, 4 and 8 processors were considered.

714 A. Sarkar et al. / J. Parallel Distrib. Comput. 70 (2010) 707–718

(a) 100 tasks; 2, 4 and 8 processors. (b) 25, 50 and 100 tasks; 8 processors.

Fig. 2. Migrations per time slot for POFBFS; 100% system load.

(a) 100 tasks; 2, 4 and 8 processors. (b) 25, 50 and 100 tasks; 8 processors.

Fig. 3. Migration ratio k (General-ERfair:POFBFS = k:1); 100% system load.

(a) 100 tasks; 2, 4 and 8 processors. (b) 25, 50 and 100 tasks; 8 processors.

Fig. 4. Migration ratio k (Stringent_ERfair: POFBFS = k:1); 100.

(3) Workload: We have considered workloads varying between40% to 100% loaded systems.

(4) Frame size G: For each combination of the above parameters,measurements have been taken for various frame sizes in therange n to 10n. Each value is a multiple of the task set size n.

During experimentation, no slack has been provided betweenthe periods of two consecutive instances of a task. This has beendone to keep the total load on the system constant throughout theschedule. The schedule length has been taken to be 500000 timeslots.To evaluate POFBFS’s performance under dynamic workloads,

a slightly different experimental setup has been used. Here, wehave considered periodic tasks in which the number of instancesin each task has been generated from a Poisson distribution whilethe inter-arrival times have been generated from an exponential

distribution. The detailed setup and the results are presented inthe Section 6.2.3.

6.2. Results

6.2.1. Migration measurementsWe have measured the average number of inter-processor

migrations suffered per time slot by POFBFS, General-ERfairand Stringent_ERfair algorithms running them on 100 differentinstances of each data set type. We have presented the number ofmigrations suffered per time slot by POFBFS and have also foundout the ratio k (called migration ratio) of the number of migrationssuffered by POFBFS over the general and stringent versions ofthe ERfair algorithm. Fig. 2 represents the plots of the number ofmigrations per time slot for POFBFS. Figs. 3 and 4 shows the plotsfor the ratio of the number of migrations suffered respectively

A. Sarkar et al. / J. Parallel Distrib. Comput. 70 (2010) 707–718 715

(a) General-ERfair:POFBFS = k:1. (b) Stringent_ERfair: POFBFS = k:1.

Fig. 5. Migration ratio k for 100 tasks; frame size G = 500; 2, 4 and 8 processors; System workloads varying between 40% and 100%.

Table 1Fairness of POFBFS for 25, 50 and 100 tasks, system loads 90%, 95% and 100%, 2, 4 and 8 processors and frame sizes varying between n and 10n.

G S (%) n = 25 n = 50 n = 100m = 2 m = 4 m = 8 m = 2 m = 4 m = 8 m = 2 m = 4 m = 8

n90 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.00095 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000100 0.008 0.010 0.022 0.011 0.019 0.045 0.018 0.021 0.063

2n90 0.000 0.000 0.001 0.000 0.000 0.001 0.000 0.000 0.00195 0.000 0.000 0.002 0.000 0.000 0.003 0.000 0.000 0.002100 0.009 0.016 0.051 0.017 0.031 0.078 0.017 0.049 0.100

5n90 0.000 0.002 0.007 0.000 0.002 0.015 0.000 0.002 0.01595 0.000 0.004 0.015 0.000 0.005 0.019 0.000 0.005 0.032100 0.017 0.053 0.094 0.023 0.074 0.186 0.038 0.099 0.282

10n90 0.002 0.009 0.013 0.002 0.009 0.023 0.002 0.011 0.03595 0.005 0.015 0.037 0.005 0.022 0.045 0.003 0.031 0.059100 0.043 0.095 0.154 0.080 0.162 0.277 0.111 0.299 0.372

n: Task set size;m: Number of processors; G: Frame size; S: Total system load; G = kn: Fairness when G is k times n.

by General-ERfair and Stringent_ERfair with respect to the POFBFSalgorithm for different number of tasks and processors on fullyloaded systems (100% system load). In Fig. 5,wepresent the plots ofmigration ratios for 100 tasks and frame size G = 500 on 2, 4 and 8processors and system workloads varying between 40% and 100%.

6.2.2. Fairness measurementsTo measure the scheduling fairness of POFBFS, we have

developed a measure called miss which is similar to the measuredrift used by Kimbrel et al. in [15]. At any given time instant, whiledrift measures the maximum difference between the amount ofexecutions completed by any two tasks, the miss of a particulartask measures the difference between the amount of executioncompleted by it and the amount of execution that would becompleted by it in a perfectly fair system. Although drift is a goodfairness measure in case of task systems consisting of persistent(all tasks have the same arrival time) equal priority (all tasks havethe same execution rate requirement) tasks as considered in [15], itis clearly inapplicable to the task systems considered in this paperbecause the tasks here are dynamic and have different priorities.miss is defined in terms of the lag (Eq. (1)) of each task at eachinstant of time as follows:

miss ={lag if lag > 00 otherwise.

Thus, if at a given time slot, a task has lag = 3, it is considered tohave suffered 3 misses at that time slot. We determine the missvalues for each task at each quantum of time. Using these missvalues, the average miss over the entire schedule length is foundout. This is given by: avg_miss =

∑miss

tslot∗n · Here, tslot represents thetotal number of time slots in the schedule and n represents the taskset size.

The value of avg_miss gives a measure of the number of missesper time slot per task. Thus, if the avg_miss value of a scheduleis 0.016, it means that there will be 0.016 misses per time slotper task, or 16 misses every 1000 time slots. Table 1 summarizesthe fairness results obtained for the POFBFS algorithm for systemloads between 90% and 100%. Fig. 6 shows the fairness plots forsystem loads between 40% and 100% when the frame size is fivetimes the number of tasks. BothGeneral-ERfair and Stringent_ERfairalgorithms are perfectly fair and their fairness value is 0 in all cases.

6.2.3. Performance under dynamic workloads

6.2.3.1. Experimental setup. Given the number of processors (m),total number of tasks to be generated (n) and required averagesystem workload percentage (WL%), the task weights and periods(eipiof each task Ti

)have been similarly generated from normal

distributions. However here, the mean of the distribution forgenerating task weights have been taken to be (µ) = m

n . Thenumber of instances of each periodic task have been generatedfrom a Poisson distribution having mean (µ) = 20. Aftercompletion of execution of a task, it departs from the system andre-arrives after an interval. For a given task Ti, the length of theseintervals have been generated from an exponential distributionhaving mean (µ) = 20 · pi · (1− (WL/100)).

6.2.3.2. Results. We have calculated the migration ratios andthe fairness results for different number of tasks, processors,workloads and frame sizes. Fig. 7(a) shows the plots for themigration ratios of General-ERfair with respect to POFBFS fordifferent number of tasks and varying frame sizes on fully loadedsystems (100% system load) consisting of 8 processors. Fig. 7(b)

716 A. Sarkar et al. / J. Parallel Distrib. Comput. 70 (2010) 707–718

(a) 100 tasks; 2, 4 and 8 processors. (b) 25, 50 and 100 tasks; 8 processors.

Fig. 6. Fairness of POFBFS for frame size 5n (n = no. of tasks).

(a) Migration ratio k; 100% system load; 8 processors. (b) Fairness of POFBFS; 40% to 100% system load; 2n and 5nframe sizes.

Fig. 7. Migration ratio k (General-ERfair:POFBFS = k:1) and fairness of POFBFS under dynamic workloads for 25, 50 and 100 tasks on 8 processors.

presents the fairness plots for different task set sizes, varyingworkloads and frame sizes 2n and 5n on 8 processors.

6.3. Discussion

From the results obtained in the previous subsection, wecan make the following important observations and inferences:From Figs. 2(b), 3(b) and 4(b), it may be observed that boththe number of migrations per time slot and the migration ratiosare almost independent of the number of tasks. However, forany given number of tasks the number of migrations reducesand the migration ratio k increases as frame size increases. Thecurve for the migration ratio with respect to frame size is almostlinear and its slope typically varies between 1.2V and 1.27V (Vdenotes the number of processors) for the migration ratio General-ERfair:POFBFS and between 0.35V and 0.55V for the migrationratio Stringent_ERfair: POFBFS. Thus, themigration ratio (k) is givenby: k ≈ C · V · X , where, C is a constant (C(General-ERfair:POFBFS) ≈1.24 and C(Stringent_ERfair:POFBFS) ≈ 0.45) and X denotes the framesize factor (frame size G = X · n; n = number of tasks). Forexample, given a set of n = 100 tasks scheduled on V = 8processors, the actual experimental value of the migration ratio(General-ERfair:POFBFS) obtained at frame size G = 8n = 800is 80.06 while the value obtained using the above expression is79.36. The migration ratios also decrease slightly with increasingworkloads (as seen from Fig. 5). This is due to the fact that thenumber of migrating tasks generally tend to increase as systemload increases.Fairness degrades with increase in frame size. Frame sizes in

the range of n to 10n provide high fairness values with 3 to100 times reduction in the number of migrations suffered withrespect to General-ERfair and 1 to 33 times reduction with respectto Stringent_ERfair. The fairness plots in Fig. 6 also show that

POFBFS provides high fairness over a wide range of workloadsexcepting only when the system becomes almost fully loaded.Fairness degrades though with the increase in the number ofprocessors. This may be attributed to the fact that the probabilityof the number of migrations per frame increases with the numberof processorsm (there can be a maximum ofm− 1 migrations perframe) and a migrating task within a given processor degrades thefairness properties of the other tasks in it.From the performance results for dynamic workloads, it may

be observed that the values for the migration ratios as obtained inFig. 7(a) is very similar to the plots obtained for static workloadsin Fig. 3(b). However, the fairness results (Fig. 7(b)) for dynamicworkloads are poorer compared to its static counterpart (Fig. 6(b)).This can be attributed to POFBFS’s strategy of not allocating a newlyarriving task in the middle of a frame to more than one processor.Thismay sometimes cause transient fairness loss for this taskwhenits allocated share (for the rest of the frame after its arrival) doesnot fit in its allotted processor. This fairness loss also degrades theoverall system fairness.

6.3.1. POFBFS vs. General-ERfair—the speedupWe have seen from Theorem 6 that POFBFS has a complexity

of O(max(m, lgn)) per time slot which is lower than thetypical scheduling complexity of ERfair (O(m · lgn)). Moreover,being a global algorithm, ERfair incurs the extra overhead ofcommunicating to each processor the task it will execute at eachtime slot. On the other hand, POFBFS being a partition orientedscheduler communicates to each processor the tasks it will runin the next frame only at frame boundaries and this incurs anoverhead of O(m + n) on a system of m processors and n tasks.For ERfair this overhead amounts to O(m · G) over an interval of Gtime slots (G being the frame size). From this discussion, it may be

A. Sarkar et al. / J. Parallel Distrib. Comput. 70 (2010) 707–718 717

concluded that the scheduling overhead of ERfair is at least as largeas the scheduling overhead of POFBFS.Therefore, the reduction in migration overhead obtained by

using POFBFS when translated to the corresponding reductionobtained in terms of actual time, directly gives a measure of thespeedup provided by POFBFS over the ERfair algorithm. However,the actual time gain also depends heavily on the overhead for asingle migration on a given system and the size of the time slot.Realistic values for themigration overheadmay typically vary fromlower than 1µs in closely coupledmulti-core systems tomore than100 µs in loosely coupled multiprocessor systems. The size of atime slot may vary from 500 µs to 5 ms [32].As an example of the obtainable time gain, let us consider a

system of n = 100 tasks being scheduled on m = 8 processors,the overhead for a single migration being 0.01 ms, the time slotsize is 1 ms and the frame size is 10n. The system is 100% loaded.From Figs. 2(a) and 3(a), it may be observed that such a systemwhen scheduled using POFBFS suffers 0.07 migrations per timeslot and the migration ratio k is about 100 times with respect toGeneral-ERfair. So, after a period of say 100 s, POFBFS will incur(0.07 ∗ 100 ∗ 1000=)7 ∗ 103 migrations while General-ERfair willincur about 7∗105migrations. The time gain after 100 secs is givenby: (7 ∗ 105 − 7 ∗ 103)0.01 ms = 6.93 s (approx).

6.3.2. POFBFS vs. General-ERfair—real-time fairness accuracyAnalytical results obtained in Section 5 shows that POFBFS

satisfies ERfairness at frame boundaries (Theorem 5) and also thattheworst case value of the sum of themaximumunder-allocationsof the fixed taskswithin a frame in a fully loaded processor is upperbounded by Gm4n (Theorem 3). This worst case is obtained when alltasks in the system have equal priorities. The burden of this under-allocation is shared equally by all the fixed tasks in a processor.Thus, assuming that a processor contains x fixed tasks within aframe, the maximum under-allocation of a fixed task is given byGm4nx . It is noteworthy here that POFBFS allows atmost onemigratingtask within a frame per processor at a given time and this task isnever allowed to get under-allocated.From the above discussion, it is clear that any subtask in

a POFBFS scheduled system may miss its pseudo-deadline (andhence, also its task deadline) by at most Gm4nx time slots. Takingn = 25, G = 5n = 125, m = 4 and x = b25/4c − 1 = 5 asan example system, it may be observed that this system will allowpseudo-deadline misses by at most 1. However, for most practicalpurposes when all tasks do not have the same priority and theworkload is even as high as 95%, a POFBFS scheduled system neverallows pseudo-deadline misses.The experimental results obtained in Section 6.2.2 aptly reflect

this analytical observation. However, one aspect which may beimportant to consider is the actual real-time fairness of a POFBFSscheduled system obtained due to its speedup over the perfectlyfair General-ERfair algorithm. This is not typically revealed fromthe experimental results. Following the discussion in Section 6.3.1,if a POFBFS scheduled system executes (say) 7 time slots moreper 100 time slots than its General-ERfair scheduled counterpart(due to the slightly lower scheduling complexity and number oftask migrations saved) for the same input task set, POFBFS mayallow pseudo-deadline tardiness of up to 7 time slots per 100 timeslots of execution of a task and still remain perfectly fair in termsof real time. However, while this is indeed promising, an exactquantification of the same is left as future work.

6.3.3. Choice of frame size GFrom the results and discussion presented above, we observe

that for a given fairness threshold, it is possible to estimate (basedon previously profiled experimental data or theoretical worst casebounds) the maximum frame size G that will allow POFBFS to

provide the required fairness when the workload (defined by itsexpected average system load and the nature of its dynamicity,that is, how frequently new tasks enter and exit the system) isknown in advance. For example, let us assume that we are givena task set of size 50 to be scheduled on a 95% loaded system of 8processors. Let the prescribed average fairness requirement be atmost 0.025 misses per time slot. From Table 1 it may be observedthat a frame size of 250 (G = 5n) will be able to maintain thisfairness requirement.However, if the fairness bound is to be strictly maintained,

two alternative approaches may be adopted. The first is to switchto the completely global ERfair scheduling whenever the fairnessdegrades beyond a given threshold. The second approachwould beto continue with the partition based POFBFS algorithm, but witha lower frame size to correct the fairness aberration. Therefore,the scheduler may check for the system fairness at the endof each frame and switch to a lower frame size whenever thefairness threshold is violated. The new frame size may be based onpreviously profiled experimental results. Such a dynamic changein frame size will however incur an extra overhead of O(n) toregenerate the share values and the next execution frame for eachtask.

7. Conclusions

In this paper, we presented a partition based proportionalfair scheduling algorithm called POFBFS which restricts inter-processor task migrations using a periodic global synchroniza-tion/repartitioning strategy. The length of this period is called aframe. While partitioning helps POFBFS to drastically reduce thenumber of task migrations as compared to both the General andStringent versions of the ERfair algorithm, global synchronizationallows it to maintain high fairness accuracy (excepting the casewhen the system is almost fully loaded). Analytical results showthat POFBFS can provide high proportional fairness accuracy and abounded number of migrations while simultaneously maintainingthe same order of complexity as ERfair. Experimental results re-veal that POFBFS is able to achieve 3 to 100 times reduction in thenumber of migrations suffered with respect to the General-ERfairalgorithm (for a set of 25 to 100 tasks running on 2 to 8 processors)while simultaneouslymaintaining high fairness accuracy.We havedesigned, implemented, and evaluated the POFBFS algorithm. Thesimulation results are promising.

Acknowledgments

We thank the reviewers for their comments and suggestions.Arnab Sarkar was supported by the Microsoft Research India

Ph.D. Fellowship Award.

References

[1] J. Anderson, V. Bud, U.C. Devi, An EDF-based scheduling algorithm for multi-processor soft real-time systems, in: Proceedings of the 17th Euromicro Con-ference on Real-Time Systems, ECRTS’05, IEEE Computer Society,Washington,DC, USA, 2005, pp. 199–208.

[2] J. Anderson, A. Srinivasan, Early-release fair scheduling, in: Proceedings of the12th Euromicro Conference on Real-Time Systems, 2000.

[3] J. Anderson, A. Srinivasan, Mixed Pfair/ERfair scheduling of asynchronousperiodic tasks, Journal of Computer and System Sciences 68 (1) (2004)157–204.

[4] B. Andersson, J. Jonsson, The utilization bounds of partitioned and pfair static-priorityscheduling on multiprocessors are 50%, in: Proceedings of the 15thEuromicro Conference on Real-Time Systems, 2003.

[5] S. Baruah, J. Carpenter, Multiprocessor fixed-priority scheduling with re-stricted interprocessor migrations, Journal of Embedded Computing 1 (2)(2005) 169–178.

[6] S. Baruah, N. Cohen, C. Plaxton, D. Varvel, Proportionate progress: a notion offairness in resource allocation, Algorithmica 15 (6) (1996) 600–625.

[7] S. Baruah, N. Fisher, The partitioned multiprocessor scheduling of deadline-constrained sporadic task systems, IEEE Transactions on Computers 55 (7)(2006) 918–923.

718 A. Sarkar et al. / J. Parallel Distrib. Comput. 70 (2010) 707–718

[8] S. Baruah, J. Gehrke, C. Plaxton, Fast scheduling of periodic tasks onmultiple resources, in: Proceedings of the 9th International Parallel ProcessingSymposium, 1995.

[9] A. Block, J. Anderson, Accuracy versus migration overhead in real-time mul-tiprocessor reweighting algorithms, in: Proceedings of the 12th InternationalConference onParallel andDistributed Systems, IEEEComputer Society,Wash-ington, DC, USA, 2006, pp. 355–364.

[10] J. Carpenter, S. Funk, P. Holman, A. Srinivasan, J. Anderson, S. Baruah,A categorization of real-time multiprocessor scheduling problems andalgorithms, URL: citeseer.ist.psu.edu/601206.html.

[11] S. Funk, J. Goossens, S. Baruah, On-line scheduling on uniformmultiprocessors,in: Proceedings of the 22nd IEEE Real-Time Systems Symposium, 2001.

[12] S. Harizopoulos, A. Ailamaki, Affinity scheduling in staged server architectures,March 2002, URL: citeseer.ist.psu.edu/harizopoulos02affinity.html.

[13] K. Jeffay, S. Goddard, A theory of rate-based execution, in: Proceedings of the20th IEEE Real-Time Systems Symposium, 1999.

[14] K. Jeffay, S. Goddard, Rate-based resource allocation models for embeddedsystems, Lecture Notes in Computer Science 2211 (2001) 204–222.

[15] T. Kimbrel, B. Schieber, M. Sviridenko, Minimizing migrations in fairmultiprocessor scheduling of persistent tasks, Journal of Scheduling 9 (4)(2006) 365–379. URL: http://dx.doi.org/10.1007/s10951-006-7040-0.

[16] Y. Kwok, I. Ahmad, Dynamic critical-path scheduling: an effective techniquefor allocating task graphs onto multiprocessors, IEEE Transactions on Paralleland Distributed Systems 7 (5) (1996) 506–521.

[17] Y. Kwok, I. Ahmad, Static scheduling algorithms for allocating directed taskgraphs to multiprocessors, ACM Computing Surveys 31 (4) (1999) 406–471.

[18] X. Liu, S. Goddard, Supporting dynamic QoS in Linux, in: Proceedings of the10th IEEE Real-Time and Embedded Technology and Applications Symposium,RTAS’04, IEEE Computer Society, Washington, DC, USA, 2004, pp. 246–254.

[19] X. Liu, S. Goddard, Scheduling legacy multimedia applications, Journal ofSystems and Software 75 (3) (2005) 319–328.

[20] C.L. Liu, J.W. Layland, Scheduling algorithms for multiprogramming in a hard-real-time environment, Journal of the ACM 20 (1) (1973) 46–61.

[21] J. Lopez, M. Garcia, J. Diaz, D. Garcia, Worst-case utilization bound for EDFscheduling on real-time multiprocessor systems, in: Proceedings of the 12thEuromicro Conference on Real-Time Systems, 2000.

[22] J. Malkevitch, Bin packing and machine scheduling.[23] M.Moir, S. Ramamurthy, Pfair scheduling of fixed andmigrating periodic tasks

on multiple resources, in: Proceedings of the 20th IEEE Real-Time SystemsSymposium, 1999.

[24] D. Niz, R. Rajkumar, Partitioning bin-packing algorithms for distributed real-time systems, International Journal of Embedded Systems 2 (3–4) (2006)196–208.

[25] R.G.A. Parekh, A generalized processor sharing approach to flow-control inintegrated services networks: the single-node case, IEEE/ACM Transactions onNetworking 1 (3) (1993) 344–357.

[26] C. Phillips, C. Stein, E. Torng, J. Wein, Optimal time-critical scheduling viaresource augmentation, in: Proceedings of the Twenty-Ninth Annual ACMSymposium on Theory of Computing, 1997.

[27] S. Ramabhadran, J. Pasquale, Stratified round robin: a low complexity packetscheduler with bandwidth fairness and bounded delay, in: ACM SIGCOMM,2003.

[28] J. Regehr, M. Jones, J. Stankovic, Operating system support for multimedia: theprogramming model matters, Tech. Rep. MSR-TR-2000-89, September 2000.

[29] C. Sapuntzakis, R. Chandra, B. Pfaff, J. Chow,M. Lam,M. Rosenblum, Optimizingthemigration of virtual computers, ACMSIGOPSOperating Systems Review36(SI) (2002) 377–390.

[30] A. Sarkar, P. Chakrabarti, R. Kumar, Frame-based proportional round-robin,IEEE Transactions on Computers 55 (9) (2006) 1121–1129.

[31] A. Srinivasan, J. Anderson, Fair scheduling of dynamic task systems onmultiprocessors, in: Parallel and Distributed Real-Time Systems, Journal ofSystems and Software 77 (1) (2005) 67–80 (special issue).

[32] A. Srinivasan, P. Holman, J. Anderson, The case for fair multiprocessorscheduling, in: Proceedings of the 11th InternationalWorkshop onParallel andDistributed Real-Time Systems, Nice, France, 2003.

[33] I. Stoica, H. Abdel-Wahab, K. Jeffay, S. Baruah, J. Gehrke, C. Plaxton, Aproportional share resource allocation algorithm for real-time, time-sharedsystems, in: Proc. of the 17th IEEE Real-Time Systems Symposium, 1996.

Arnab Sarkar received the B.Sc. degree in ComputerScience in 2000 and B.Tech. degree in InformationTechnology in 2003 from University of Calcutta, Kolkata,India. He received the M.S. degree in Computer Scienceand Engineering from the Indian Institute of Technology(IIT), Kharagpur, India in 2006 and is currently pursuinghis Ph.D. in the same institute. He received the NationalDoctoral Fellowship (NDF) from AICTE, Ministry of HRD,Govt. of India, in 2006 and the MSR India Ph.D. fellowshipfromMicrosoft Research Lab India, in 2007. He is currentlypursuing his research as a Microsoft Research Fellow.

His current research interests include real-time scheduling, system software forembedded systems and computer architectures.

P.P. Chakrabarti received the B.Tech. and Ph.D. degreesin Computer Science and Engineering from the IndianInstitute of Technology (IIT), Kharagpur, in 1985 and1988, respectively. He joined the Department of ComputerScience and Engineering, IIT, as a faculty member in 1988and is currently a professor in the Computer Scienceand Engineering Department, where he currently holdsthe position of dean (Sponsored Research and IndustrialConsultancy) and where he was the professor in chargeof the state-of-the-art VLSI Design Laboratory. He haspublished more than 100 papers and collaborated with

a number of world-class companies. His areas of interest include artificialintelligence, CAD for VLSI, and algorithm design. He received the President of IndiaGold Medal, the Swarnajayanti Fellowship, and the Shanti Swarup Bhatnagar Prizefrom the Government of India for his contributions.

Sujoy Ghose received the B.Tech. degree in Electronics andElectrical Communication Engineering from the IndianInstitute of Technology, Kharagpur, in 1976, the M.S.degree from Rutgers University, Piscataway, NJ, andthe Ph.D. degree in Computer Science and Engineeringfrom the Indian Institute of Technology. He is currentlya Professor in the Department of Computer Scienceand Engineering, Indian Institute of Technology. Hisresearch interests include design of algorithms, artificialintelligence, and computer networks.