Evolvable Multi-processor a Novel MPSoC Architecture With Evolvable Task Decomposition and Schedulin

8/3/2019 Evolvable Multi-processor a Novel MPSoC Architecture With Evolvable Task Decomposition and Schedulin

1/14

Published in IET Computers & Digital Techniques

Received on 21st September 2008

Revised on 26th February 2009

doi: 10.1049/iet-cdt.2008.0120

ISSN 1751-8601

Evolvable multi-processor: a novelMPSoC architecture with evolvable taskdecomposition and scheduling

S. Vakili S.M. Fakhraie S. MohammadiSilicon Intelligence Lab, School of ECE, University of Tehran, Tehran, IranE-mail: [email protected]

Abstract: Multi-processor system-on-chip (MPSoC) approach is an emerging trend for designing high performance

computational systems. This trend faces some restrictive challenges in hardware and software developments. This

paper presents a novel MPSoC system, which tries to overcome some of these major challenges using new

architectural techniques. The main novelty of this system is accomplishment of both task decomposition and

scheduling activities at run-time using hardware units. Hence, parallel programming or compile-time

parallelisation is not required in this system and thus, it can directly and efficiently execute single-processor

sequential programmes. This system utilises evolutionary algorithms to perform decomposition and scheduling

operations, and therefore it is called evolvable multi-processor (EvoMP) system. This approach finds an

efficient scheme to distribute different segments of the running application among available computational

resources. Such a dynamic adaptability is also beneficial to achieve advantageous features like low-cost fault

tolerance. This paper presents the operational and architectural details, improvements, constraints, and the

obtained experimental results of the EvoMP.

1 Introduction

Contemporary digital design methods can be classifiedinto threemajor orientations including pure hardware, reconfigurablehardware and microprocessor-based designs. Flexibility,

simplicity of development, and short design time ofmicroprocessor-based solutions have made it the most popularand applied design method. On the other hand, lowperformance is the main disadvantage of general purposemicroprocessor in comparison with the other approaches.Introduction of large varieties of techniques and architectures inthe last two decades has led to great improvements inperformance and proportional growth in hardware complexityof the processors. However, these improvements seem to besaturated in recent years. The sequential essence of theconventional processors and their software is one of the mostrestrictive constraints that prevent parallel execution of the

codes. Although some architectural techniques such as very-long-instruction-word (VLIW) have tried to address this issue,they could not meet theincreasingdemandson processing power.

Multi-processor approach is one of the most remarkabletrends to design new high performance computationalsystems [1]. Emerging MPSoC design field demonstratesthe implied multi-processor orientation in embeddedsystems and system-on-chip (SoC) devices. MPSoC is

a processor-centric solution and therefore most of thedesirable advantages of uniprocessors such as short time-to market, post-fabricate reusability, flexibility andprogrammability are also achievable in MPSoC designs[2, 3]. However, moving from single processor to multi-processor systems is accompanied by many challengesin hardware and software developments. The mostcomplicated design challenge in MPSoC systems issoftware development issues due to sequential nature of theconventional programming models. Software developershave been trained for many decades to think aboutprograms, sequentially. However, multi-processor systems

require concurrent software that their execution can bedistributed among different processors. Approximately allexisting software are developed using classical sequential

IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 2, pp. 143 156 143

doi: 10.1049/iet-cdt.2008.0120 & The Institution of Engineering and Technology 2010

www.ietdl.org


2/14

models [2]. Thus, in order to execute these programs on amulti-processor environment, they must be first convertedto concurrent ones. In recent years, some researches havefocused on compile-time techniques, which aim to performthis conversion automatically. Nevertheless, programming

with parallel models is still the most commonly used

approach to achieve concurrent executable software. Parallelprogramming models are often supported by standardapplication programming interface libraries, such as MPIand OpenMP [4, 5]. But, reprogramming all existingsoftware for future MP systems requires huge amount ofinvestments. Furthermore, writing efficient programs usingthese libraries is much more complicated than classicalsequential programming. There are two necessary activitiesfor concurrent software generation including decompositionof the program into some tasks and scheduling them amongcooperative processors in the system. Both taskdecomposition and scheduling activities are NP-complete

(non-deterministic polynomial time complete) problemsand major issues for concurrent execution. Optimaldecomposition of an application described in a serial mannerinto a set of concurrent tasks is a very difficult job and thereare still very few applications, which can be decomposedautomatically despite many years of research in this field [2].

Another complicated challenge in such systems is taskscheduling. All task scheduling mechanisms can be dividedinto static (programming-time or compile-time) and dynamic(run-time) categories [4]. Static scheduling approach haspotential advantages to find more optimal solutions in

comparison with dynamic scheduling. Because theprogrammer or compiler can see the entire application andcan also compare different solutions but a dynamic schedulermust make a decision according to the available resources andpending tasks, instantaneously. On the other hand, thenumber of computational resources must be constant andpredetermined in static scheduling. It means that thisapproach implies in non-scalable software whereas dynamicscheduling systems do not face such constraints [69].

Synchronisation of processing elements is another importantissue in MPSoC designs. Data dependencies between different

tasks necessitate inter-processor communications [2]. Thesecommunications must be managed by an appropriatecontrolling mechanism. In static scheduling systems, controland synchronisation information are embedded in thesoftware. In dynamic scheduling systems, a dedicatedscheduler unit (which can be implemented in hardware,middleware or operating system) usually performs theseactivities. Debugging, security and lack of designmethodology for on-chip interconnection networks are otherchallenges facing MPSoC designs [13].

Remarkable advantages of adaptive and dynamic MPSoC

systems have motivated many researches to work on novelarchitectures and techniques for the development of suchsystems, in recent years [1014].

This paper introduces a novel homogeneous MPSoCarchitecture, which can perform all necessary activities forparallel execution of a program, dynamically on hardware. Inother words, this system accomplishes task decompositionand scheduling, and also addresses data dependencyrequirements at run-time through hardware mechanisms.

An evolutionary algorithm (EA) hardware core is exploitedto perform both task decomposition and scheduling andtherefore, this system is called evolvable multi-processorsystem (EvoMP). Hence, the existing classical sequentialsoftware can be effectively partitioned and mapped intodifferent processors automatically. One of the main goals ofthese novelties is to find a solution to prevent hugeinvestments required for reprogramming the availablesoftware. Furthermore, all controlling and synchronisationoperations are distributed among processing elements. Run-time parallelisation brings this system adaptability andflexibility. Low-cost fault tolerance is another advantageous

feature of EvoMP that benefits from its adaptability.

The presented version of the EvoMP uses 2-D meshtopology and utilises network-on-chip (NoC) forinterconnections. The size of each dimension can be simplydetermined by setting a configurable parameter. The EvoMPuses a shared data memory. Accesses to this memory are alsoaccomplished via NoC. This paper presents the primary

version of EvoMP that uses genetic algorithm (GA). For thispurpose, a custom hardware core is designed and exploited forGA computations. In [1517] also, GA is used for taskscheduling but at compile time and in static schedulingsystems. EvoMP system is designed and implemented usingRT-level VHDL.

Subsequent sections of this paper are organised as follows.Section 2 describes the EvoMP system architecture and someof its major constitutive units. The architecture of eachprocessor is explained in Section 3. This section will alsoclarify the principles of operation of the entire system.Scope of the work and experimental results are presented inSection 4 and finally Section 5 contains the conclusionsand future works.

2 EvoMP system architecture The EvoMP utilises a GA core to perform both taskdecomposition and scheduling simultaneously at run time.

The genetic core generates an encoded bit string(chromosome) that contains the decomposition andscheduling information, that is, this bit string determines theprocessor, which is in charge of executing each instruction inthe programme. These data are received and used by allcooperating processors. The top view of the EvoMP system isshown in Fig. 1a.

Evolutionary strategies obviously require enough time to

evolve. Therefore this system can be efficiently used foriterative programs like DSP applications that performconstant computations on different data. When this system

144 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 2, pp. 143156

& The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2008.0120

www.ietdl.org


3/14

starts to execute such a program, genetic core generates arandom data that results in random decomposition andscheduling of instructions among processors. When allprocessors reach the end of the iteration, the genetic corelooks at a dedicated counter, which counts the clock cycles.

At the end of an iteration, the output of this counter showsthe number of clock cycles, taken to execute entire recentiteration. This value is used as fitness value for thecorresponding chromosome generated by the genetic core.

Then, the counter is reset, the genetic core generates thenext chromosome and the system starts execution of the

next iteration with a new parallelisation scheme. After fewinitial random chromosomes (first population), the geneticcore goes to evolution state, in which new chromosomesare generated by the recombination of the best foundsolutions in previous generations and random data. Thisprocess is repeated until this core finds an appropriatesolution for task decomposition and scheduling. Hereafter,the genetic core goes from evolution to termination state

where the best found solution is used as constant output ofthe genetic core. Hence, the EvoMP does not require priorinformation about task decomposition and scheduling ofthe target program. But, it needs a primary evolution time

to find the best decomposition and scheduling solutionsaccording to the number of available computationalresources and the running program.

The most important remaining problem is the datadependency. When the program is divided into some tasks,

which must be executed on different processors, the datadependency requirements between these tasks must be metby appropriate inter-processor communications. Control ofthese communications is fully distributed in EvoMP, thatis, each processor will automatically detect and send its owndata to all processors that need them. The functionality ofthis system is tightly coupled with the following key point.

All processors have one dedicated copy of the program in

their internal instruction memory.

Accordingly, all processors have enough information torecognise the processor on which each instruction must beexecuted. Hence, they can detect any dependency on theirlocal computed data and do not require requestresponsescheme. The architectural details of the utilised mechanism,to meet the dependency requirements, are presented inSection 3.

2.1 Inter-processor communicationscheme

NoC is an advanced SoC interconnection architecture.Enhanced performance and scalability are the main

Figure 1 EvoMP architecture

a Overview of EvoMP machineb NoC switch architecture



www.ietdl.org


4/14

advantages of NoC in comparison with previouscommunication architectures (e.g. shared buses, or segmentedbuses with bridges) [1821]. EvoMP system exploits acustom designed NoC with a simple XY routing algorithm.

The architectural overview of the designed switch is shown inFig. 1b. This architecture utilises the globally asynchronous

and locally synchronous approach to prevent probable hazardcaused by clock skew problem. Data output port of the overallsystem is connected to memory management unit (MMU inFig. 1a) and the output values are sent to this port throughNoC. The bit length of the flits (i.e. bit length ofcommunication links between switches) is one of theconfigurable parameters of the EvoMP. As shown in Fig. 1b,a shared bus is used for communication between input andoutput ports in the NoC switch in order to reduce thehardware area of the switches. The highest priority inputmodule, which contains a new packet, obtains the control ofthe bus and holds it until all flits of the current packet are

sent to the destination output port. Simulations haveconfirmed that the throughput of this switch architectureproperly meets the throughput requirements of the system.

2.2 Encoded decomposition andscheduling data format

Decomposition and scheduling data generated by the geneticcore consists of some scheduling words. Each worddetermines the processor which is in charge of executingsome successive instructions in the program. A scheduling

word consists of Proc_Addr and Instr_num fields.

Proc_Addr indicates the target processor address andInstr_num specifies number of instructions, which must beexecuted on it. The maximum number of instructionsscheduled by one such word (i.e. maximum number ofinstructions in an individual task) depends on bit length ofInstr_num field. This bit length is also a configurableparameter in EvoMP system. Fig. 2 illustrates a samplescheduling data for the 2-tap finite impulse response (FIR)filter program and corresponding parallelisation scheme.

2.3 Memory organisation

All multi-processor systems can be divided into twocategories including shared and distributed data memorysystems. Comprehensive comparison between these twoapproaches can be found in [1, 4]. EvoMP system utilisesthe shared memory scheme. The memory accesses are

performed through the NoC communication to help thescalability of the system. For this purpose, 00 address inthe mesh (as shown in Fig. 1a) is dedicated to datamemory and MMU and no processing or computationalcircuit exists in this unit. As stated earlier, mesh size(number of contributing processors) is configurable in this

system. The only valid address in all mesh sizes is 00, andtherefore this position is selected for data memory. MMUalso has an internal instruction memory and schedulingdata FIFO (first-in first-out memory). This unit reads theinstruction memory in order to find Load and Storeinstructions. Then, it reads the addressed data word in Loadinstruction from the data memory and sends it to theprocessor responsible for this instruction. For Storeinstruction, the processor sends the data and writing addressto the MMU via a packet. Only Store instruction packetsare received in MMU. Thus, it manages and then writes thereceived data to the memory. In data management phase,

read and write addresses must be compared in order toprevent data dependency hazards including read after write,

write after write and write after read.

If the size of the mesh increases, the traffic near MMUNoC-switch becomes a bottleneck of performance. Oursimulations show that this limiting issue appears only whenthe system is large enough (often in 4 4 or greater sizesdepending on the application). This is the main issue forscalability of the current system and we hope to eliminate itby designing a distributed physical memory version of theEvoMP in the near future.

2.4 Genetic core architecture

EAs are a population-based metaheuristic search methodsinspired by biological evolution mechanisms in livingorganisms. The candidate solutions are usually encoded as a

vector of numbers (commonly binary). Each population iscomprised of a fixed number of candidates. The fitnessfunction quantitatively evaluates suitability of eachcandidate. The promising candidates are selected and keptfor subsequent generations and poor ones are weeded out.GA is the most popular EA that is used in presented

version of the EvoMP system. The candidate solutions arecalled chromosomes in GA. First generation of solutionsis chosen randomly. After fitness evaluation, the elites areselected. In subsequent generations, the chromosomes arecreated through a process of selection and recombination.

Figure 2 Sample of scheduling data (chromosome) for 2-tap FIR filter program and corresponding execution scheme



www.ietdl.org


5/14

Recombination (crossover) operators merge the informationcontained within pairs of selected parents by placingrandom subsets of the information from both parents intotheir respective positions in a child. Due to random factorsinvolved in producing children chromosomes, the childrenmay, or may not, have higher fitness values than their

parents. In this way, the generations gradually movetowards better regions of the search space [22].

As mentioned earlier, a hardware genetic core is designedand exploited to perform decomposition and schedulingactivities in the EvoMP. The genetic core hardware areamust be considered as an overhead in this system. Therefore

we have tried to design low complexity architecture for thiscore. Each chromosome consists of some decomposition andscheduling words that must be sent to the processors.Fig. 3a shows the internal architecture of the designed core.

A chromosome memory is used to store all chromosomes of

each population. When the system starts to work, thegenetic core is in an initialise state, in which the output

words are generated randomly using a dedicated linearfeedback shift register (LFSR). These words are generatedand distributed in successive clock cycles at the beginning ofeach new iteration. The Instr_num field in these words isaccumulated and the result is compared with the programlength stored in a register. When the accumulator valueexceeds the program length, the chromosome is completedbecause all instructions are scheduled and the core stopsgenerating new words. Note that all output words are alsostored in the chromosome memory. Then, this core waits for

End_Loop signal. This signal is only activated when allcontributing processors reach the end of an iteration. Then,the genetic core looks at the Clock-Count counter, whichshows the number of spent clock cycles to execute thecompleted iteration. This value is used as the fitness value of

the corresponding chromosome. Two storage components(called Best_Chrom in Fig. 3a) always store the fitness valueand start memory address of the two best found chromosomes(i.e. the elite count is constant and equal to two in this core).

The Clock-Count is then reset to zero for the next iteration.

When the first population was completed, the genetic coregoes to evolution state, in which the new population isgenerated by recombination operators in crossover module.

The internal architecture of this module is depicted inFig. 3b. In the recombination state, new chromosomes canbe produced by one of the following three ways:

1. by crossover between two best found chromosomes(elites),

2. by crossover between one of the best chromosomes and arandom data generated by the LFSR,

3. a pure random chromosome generated by the LFSR.

There are some configurable parameters in this coreincluding population size (number of chromosomes in apopulation) and number of chromosomes that must begenerated by each of the above approaches in each population.

These parameters affect the evolution speed. The evolutionprocess is continued until the termination condition is met.

Then the core state changes to termination state, in which thebest chromosome obtained in evolution phase is permanentlyused for decomposition and scheduling. Thus, the executiontime for all iterations is constant in this state. Thetermination condition can be one of the following options:

1. best achieved chromosomes have not been changed for apredetermined number of populations,

Figure 3 Internal architecture ofa Genetic coreb Crossover unit



www.ietdl.org


6/14

2. a solution is found that satisfies predetermined criteria,

3. fixed number of generations is reached.

The first option is exploited in the current version of theEvoMP. Note that the execution time of each iteration is

variant during the evolution phase. Thus, this phase isuseless in applications in which response time is cruciallybounded. On the other hand, more evolution timeobviously leads to better solution and better results.

The genetic core remains in termination state as long asthere is no faulty processor. When a fault is detected in aprocessor, no task will be assigned to this processor infuture iterations and in fact, it will be eliminated of futurecomputation. Therefore the genetic core returns to theevolution phase to find an appropriate solution for the newsituation. Online built-in self-test technique is used for

fault detection in the processors.

There are sometimes invalid processor addresses in thissystem. For example, assume that a 2 3 mesh of processors is instantiated. At least three bits are needed toaddress these processors (one bit for rows and two bits forcolumns). But as there are only three columns, twoaddresses (i.e. 111 and 011) are invalid. Hence, theseinvalid addresses must not appear in Proc_addr field ofoutput words of the genetic core. The Convert_Addressunit (shown in Fig. 3a) is used to map invalid addresses to

valid ones. A simple and scalable logic is used for thisaddress conversion. This unit also contains an addressmapping table, which plays an important role in faulttolerance scheme utilised in EvoMP system.

2.5 Fault tolerance scheme

Low-cost fault tolerance is one of the features of the EvoMP,the benefits of its adaptability. Different fault tolerancetechniques have been proposed in the literature for varioustypes of MP systems. These mechanisms can be dividedinto two major categories. First category consists of faulttolerance techniques for homogeneous reconfigurablehardware architectures and static scheduling multi-

processor systems, in which the number of operationalprocessing elements and their tasks are predetermined.

Thus, the fault tolerance can only be achieved by somededicated spare PEs. The faulty PEs can be replaced by thespare ones to perform their assigned tasks [23]. Forexample [24] uses some spare modules (molecules, whichare simple reconfigurable hardware units) in eachcomputational cell in Embryonic project. Triple modularredundancy (TMR) and similar redundant-based schemesare also used for fault tolerance in bio-inspired POEtictissue project in [25].

Second category of these mechanisms is focused ondynamic task scheduling systems, in which the tasks aredistributed between all operational resources and whenever

a PE becomes faulty, the scheduler stops assigning newtasks to it. A major issue in such systems is the existence ofcentralised scheduler and controlling units. Because, if afault occurs in one of these components, then the entiresystem fails [26, 27].

The EvoMP system does not use redundant sparehardware for fault tolerance and instead adapts itself to theavailable resources dynamically, that is, it always utilises allavailable processors in the system. All processors are at thesame degree of importance in this system. This significantadvantage plays an important role in the utilised faulttolerance scheme. As stated earlier, there is an addressmapping table in the output stage of the genetic core thatmaps input addresses to determined addresses in the table.

When a processor becomes faulty, the contents of this tableare changed so that the faulty processor address is replacedby a random valid one. Accordingly, no instruction will be

assigned to the faulty processor in subsequent iterations.The genetic core also returns to the evolution state to findan appropriate solution for the new situation. In this way,the system adapts its decomposition and scheduling schemeto the available computational resources by paying rathersmall time penalty for re-evolution. Thus, EvoMP is agracefully degradable system, which can continue executioneven with one healthy processor.

The genetic core and the centralised memory are the onlycentralised controlling units in this system. These units aremuch smaller than the computational section area includingthe processors and the NoC, and therefore, simpleredundant hardware techniques, for example, TMR, seem tobe suitable to make them fault tolerant. Furthermore, notethat the genetic core is important only in the evolutionphase. Therefore if it becomes faulty in termination state,the system will continue to work without any change. Evenif it becomes faulty before the evolution phase, making theevolution impossible, the system will still execute theprogram, somehow in an inefficient way.

3 Architecture of each processor

This section describes theinternalarchitectureof each processor

in the EvoMP system. The main particular feature of theseprocessors, that distinguishes them from the conventionalones, is their capability of automatic data dependencydetection and transmission of the corresponding data.

As shown in Fig. 4, this architecture is a multi-functional-unit (multi-FU) design. The shared bus scheme is used fordata communications of different FUs. Number and typesof FUs can vary. Hence, adding a new instruction to thisarchitecture can be easily obtained by designing therequired hardware and exploiting it as a new FU. Thecommunication scheme of this new FU must be obviously

compatible with the others. Before studying thearchitecture, the dedicated machine code style of theEvoMP system must be considered.



www.ietdl.org


7/14

3.1 EvoMP machine code style

Run-time detection of the data dependencies was the mostcomplicated challenge in this work. This is achieved by thecombination of a special machine code style designed for

the EvoMP and some architectural techniques. In EvoMPmachine code style, each line of the program has a linenumber that is called ID. When an instruction requires aregister as a source operand, the line number of the mostrecent instruction, which has modified this register, is usedinstead of the register number. Assume that the followingthree instructions are a segment of a sample program. Left-side numbers represent the line number (ID) of eachinstruction.

10. ADD R1, R2, R3; R3 R1 R2

11. AND R2, R6, R7; R7 R2&R6

12. SUB R7, R3, R4; R4 R7 R3

Accordingly, R7 and R3 operands in the above code must bereplaced by 11 and 10, respectively. Thus, the SUBinstruction will be converted to the following line inEvoMP machine code:

12. SUB (11), (10), R4

The processor in charge of executing this instruction requeststhese IDs as operands. If they are also computed on thisprocessor, then they will be found in the register file,

otherwise the corresponding processors detect the dependency

and send them along with their IDs to this processor. The IDnumber is also stored in the register file. For example, in theabove example, 12 will be saved in a dedicated position in R4register. The word length of the ID numbers must be largeenough to be able to identify all instructions of the program.

The only remaining problem is the data dependency betweensuccessive iterations. This problem is also solved by addinganother field to the ID numbers. This field specifies theiteration number and acts just like an iteration counter. Bitlength of this field depends on maximum distance ofdependent instructions. In our experimental applications, onebit is sufficient because two dependent instructions are atmost one iteration far from each other (i.e. they are in thesame or two successive iterations). Fig. 5a shows the assemblyprogram of 2-tap FIR filter. The data dependencies areillustrated by arrows in this figure. The only inter-iterationdependency is distinguished by a solid downward arrow.

Fig. 5b

shows the same program after applying the requiredchanges (EvoMP code style). The ID of each instruction isequal to its address in the instruction memory, except for theinitialisation section of the program, for which the IDs arespecified in the header section of the code.

3.2 More detailed operational viewof each processor

The internal architecture of each processor is represented inFig. 4. The Fetch_Issue unit has access to both instructionmemory, and decomposition and scheduling data generated

by the genetic core. This unit can determine the processor,

Figure 4 Internal block diagram of each EvoMP processors

Figure 5 2-tap FIR filter assembly code ina Regular styleb EvoMP style



www.ietdl.org


8/14

which is in charge of executing each instruction. Localinstructions (instructions, which must be executed on thisprocessor) will appear on Instr bus to be received andexecuted by an FU. When the highest priority non-busyFU observes an executable instruction on Instr bus, it sendsa signal through the shared bus to other FUs and the

Fetch_Issue unit to inform them of reception of the currentinstruction. The Fetch_Issue unit will read the nextinstruction when it receives this signal. Token-ringtechnique is utilised to specify the FU, which must receivethe pending instruction when more than one non-busy FUexists. Both operand IDs on Instr bus are checked in theregister file module. The corresponding data value of eachexisting ID is put on R1_Data or R2_Data bus and thereceiving FU will store this value as an operand. All ofthese operations are performed in combinational circuits. Ifan operand is not found in the register file, the FU receivesthe instruction but the position of this operand remains

empty and the FU does not start the computation until thisoperand is received through Extra_Bus. The following tworeasons may cause the operands to be unavailable:(i) another processor possesses this operand and has not sentit yet and (ii) another local FU is computing this operand.In both cases, the required value will appear on Extra_Bussooner or later and the pending FU grabs it immediately.

This architecture also supports in-order issue, out-of-orderexecution scheme, that is, the instructions appear on Instr busin the same order of their appearance in the program. But theexecution time of different instructions may vary. Thus, theresult of an instruction may be computed before a priorinstruction. All types of data dependencies are addressed byappropriate hardware mechanisms in this architecture. Theregister file unit contains 15 register modules (R1R15), eachof them contains a register to store the ID and a register tostore the value.

The Fetch_Issue unit exploits a dual-read-port instructionmemory. The second output port is connected toInvalidate_Instr bus (Fig. 4). This bus is used for invalidationof the register contents and detection of the datadependencies of other processors to local register data. Allinstructions (local and non-local) in the program will appear

on this bus one-by-one. Register modules monitor thedestination field and ID of Invalidate_Instr bus. When thedestination register is occupied and the stored ID in thisregister is smaller than the Invalidate_Instr bus ID field, theregister contents will be invalidated; because it means that aposterior instruction is met and this register is its destination.

Therefore the prior value of this register is useless hereafter.Invalidate_Instr also contains two operand IDs of thecorresponding instruction. These IDs are used to detect thedependency of other processors to the local data. All validregister modules compare these IDs with their own. If they

were equal, the corresponding value is put on Send1_Data or

Send2_Data bus and NoC interface module sends it tothe appropriate processor. Note that instructions onInvalidate_Instr are not going to be executed and a single

clock cycle is adequate to check them, while execution of localinstructions appeared on Instr bus may require many clockcycles. But, the PC2, which is the address of Invalidate_Instrbus instruction, must never exceed the PC1 (address of Instrbus instruction).

The Fetch_Issue module also contains two FIFO memoryunits to store the encoded scheduling data (shown in Fig. 2).

The processor does not start to work until the firstscheduling word is received by the Scheduling Data FIFO.

The output word of this FIFO is used to determine whetheror not the specified instruction must be executed on thisprocessor. If another processor is determined in this word,the PC1 is added to Instr_num field to bypass all specifiedinstructions. The words, which are read from theScheduling Data FIFO, will be immediately stored in theScheduled FIFO (Fig. 6). The output of this FIFO is usedto determine the address of the processor responsible for the

current instruction on Invalidate_Instr bus. NoC Interfacemodule recognises the destination address of thetransmitting data using this value.

A low-complexity reconfigurable computing hardware isdesigned as a general FU that can perform differentarithmetic and logical operations in different configurations.Fig. 7 shows the internal architecture of this FU. R1 and R2components in this figure, store the data operand of theissued instruction. Multiply operation is realised with add andshift in this architecture. Table 1 lists EvoMP instruction setsupported by the presented FU architecture. Note that Loadand Store are the only instructions, which are not executed inFUs. There is also an immediate version of all instructions, in

which the second operand is an immediate value.

4 Experimental results

This section presents the target application domain andperformance evaluation results of the EvoMP system. Fourrepresentative DSP programs are developed and used asexperimental applications. These sample applications areexecuted on EvoMPs of different sizes. Three otherdecomposition and scheduling schemes are also implementedand evaluated on EvoMP for comparison. The mentioned

applications are also executed on NIOS II soft core to makethe evaluations more comprehensive.

4.1 Scope of work

EvoMP system can be efficiently used for iterative applicationsas described in [6, 28]. The only crucial requirement in suchapplications is that its main part has to be iteratively executedfor a considerable number of times. Because this system isefficient only when the number of required iterations in theevolution phase is negligible in comparison with total numberof similar iterations. This is especially the case for applications

that perform an identical computation on different datasamples. Furthermore, the EvoMP is not suitable forapplications in which numerous forward jumps takes place



www.ietdl.org


9/14

(e.g. control applications); as these jumps would affect theexecution time and the fitness value, and thus lead toproblems in fitness evolution. Backward jumps (loops) are alsobetter to be unrolled at compile time to permit decompositionand scheduling to take place precisely. At first glance, it seemsthat any conditional jump may lead to unfairness in fitnessevaluation. Because the conditional jumps may be taken ornot taken in different iterations, and thus number ofinstructions executed in different iterations is not necessarilyequal. This is specifically the case in multimedia applications,

in which computations highly depend on input data. Aprimary solution is to consider the number of executedinstructions in fitness evaluation. In this way, size of the

accomplished computations must also be measured and usedfor fitness estimation in order to achieve fairness.

In embedded systems area, the EvoMP can be used inapplications, where the same computation is performed fora stream of data inputs. Different encoders and decoders,signal processing applications in communication systems,encryption and decryption standards and packet processingin network systems can be considered as some example

application domains of the EvoMP system. Theconfiguration of the EvoMP (including number ofcontributing processors) must be properly selected in orderto meet the processing power requirements of the targetapplications. Note that, activities like coding styleconversion and loop unrolling do not require high-levelanalysis and can be accomplished by simple compile-timealgorithms or even by object code modification.

4.2 Configuration of the experimentalenvironment

EvoMP system has some configurable parameters that affectthe performance and implementation results. Hence, theseparameters and their values in experimental environment

Figure 7 Internal architecture of exploited configurable FU

Figure 6 Internal architecture of Fetch_Issue unit (general view)

Table 1 Instruction set of the EvoMP

Instruction Instruction category

load/store memory

MOV data movement

shift/rotate left shift and rotate

shift/rotate right

AND/OR/XOR/NOT logical

ADD/SUB/MUL arithmetic



www.ietdl.org


10/14

must be carefully set based on the running application.Table 2 lists these parameters and their configuration valuesused for our experimental applications. These values areselected according to the best obtained simulation results

after testing different configurations. The mesh size isanother configurable parameter that is variable in ourexperiments and therefore, has not been mentioned in

Table 2.

Table 2 Configurable parameters of the EvoMP and their values in each experiment

Parameter Configuration values Description

FIR-

16

DCT-

8

DCT-

16

MATRIX-

5 5

Word_Len 16 16 16 16 processor word length

FU_num 1 1 1 1 number of instantiated FUs in each processor

Flit_Word_Len 16 16 16 16 bit-length of connection links between NoC switches

Pop_Size 8 16 16 16 number of chromosomes in each population

Cross_Rate1 1 1 2 2 number of chromosomes in each generation produced by crossover

between best found chromosomes

Cross_Rate2 4 4 8 8 number of chromosomes in each generation produced by crossover

between a random data and the best chromosome

Rand_Size 3 11 6 6 number of chromosomes in each generation produced randomly

Figure 8 Evolution phase best chromosome fitness value (number of clock cycles required for execution of each iteration) in

different EvoMP sizes for

a 16-tap FIR filter

b 8-point discrete cosine transformc 16-point discrete cosine transformd 5 5 matrix multiplication



www.ietdl.org


11/14

Table 3 Fitness value of the final best chromosome (in clock cycles) and corresponding speed up and evolution time for four

decomposition and scheduling schemes using different number of processors

FIR-16 DCT-8 DCT-16 MATRIX-5 5

number of instructions 74 88 324 406

number of multiply instructions 16 32 128 125

NIOS II required number of clock cycles 510 810 3452 3838

1 processor (1 2) in all three schemes fitness (clock cycles) 350 671 2722 3181

speed up 1 1 1 1

2 processors (1 3) presented EvoMP fitness (clock cycles) 214 403 1841 2344

speed up 1.63 1.66 1.47 1.37

evolution time (us) 27 342 42 807 74 582 198 384

SDGS fitness (clock cycles) 202 401 1812 2218

speed up 1.73 1.67 1.50 1.43

evolution time (us) 1967 29 315 84 365 65 119

first free fitness (clock cycles) 293 733 2529 2487

speed up 1.19 0.91 1.08 1.27

pure random fitness (clock cycles) 306 656 2441 2655

speed up 1.14 1.022 1.11 1.19

3 processors (2 2) presented EvoMP fitness (clock cycles) 171 319 1460 1868

speed up 2.04 2.10 1.86 1.70


SDGS fitness (clock cycles) 161 306 1189 1817

speed up 2.17 2.19 2.28 1.75


first free fitness (clock cycles) 239 681 1933 2098

speed-up 1.46 0.98 1.40 1.51

pure random fitness (clock cycles) 291 589 2213 2492

speed up 1.20 1.13 1.23 1.27

4 processors (2 3) presented EvoMP fitness (clock cycles) Unevaluated 285 1213 1596

speed up 2.33 2.25 1.99

evolution time (us) 93 034 630 482 546 095

SDGS fitness (clock cycles) 256 1106 1575

speed-up 2.62 2.46 2.01

evolution time (us) 41 023 111 118 178 219

first free fitness (clock cycles) 496 1587 1815

speed up 1.35 1.71 1.75

pure random fitness (clock cycles) 500 1837 2176

speed up 1.34 1.48 1.46



www.ietdl.org


12/14

The following equation illustrates the relation between thegenetic core parameters described in Table 2

Pop Size Cross Rate1Cross Rate2 Rand Size (1)

4.3 Simulation and synthesis results

Fig. 8 shows the fitness value (number of clock cycles requiredto execute each iteration) of the best found chromosome ateach instant of the evolution phase for different EvoMP

sizes. Note that the execution time in single processorEvoMP is constant as the decomposition and scheduling aremeaningless in such a system. These results confirm theapplicability of this approach because by increasing thenumber of processors better results are achieved. It meansthat the expected resource utilisation is obtained withoutapplying noticeable changes on sequential code developmentprocess. Consider that, increasing the program length andnumber of contributing processors would also increase therequired time for evolution, obviously because of largerdecomposition and scheduling search space. Exploiting amore advanced and dedicated heuristic method for

decomposition and scheduling (instead of the current puregenetic architecture) can reduce the required evolution time. As illustrated in Table 3, the final achieved speedup isgradually saturated by increasing the number of processorsin the system. Remarkable increase in decomposition andscheduling search space, restrictive data dependencies in theprogram and increasing communication cost are the mainreasons of this phenomenon.

Simultaneous exploration of an efficient solution fordecomposition and scheduling requires search in a very hugesearch space. As mentioned in previous sections, manydynamic task scheduling architectures and techniques are

introduced in the literature. But the run-time taskdecomposition is a novel work in EvoMP. Comparisonbetween the presented version of EvoMP and otherapproaches is necessary to prove its applicability. Hence, wehave designed three other schedulers with predetermined(static) task decomposition scheme to make the comparisonfeasible. The first scheduler uses the GA and the second oneutilises the classical First Free (FF) [9] algorithm for dynamictask scheduling. Also, the third one is a pure randomscheduler. All of them use a predetermined decompositionscheme (manually specified inside the program). Thesimplicity of the developed experimental programs allowed us

to partition them to small tasks manually, in an efficient way. The genetic core in the presented EvoMP architecture(studied in Section 2.4) is replaced by these new schedulers.

The static decomposition and genetic scheduler (SDGS) isa previously introduced approach in the literature [29] in

which the genetic core only performs the task scheduling.Task decomposition scheme is statically determined in thismethod. The FF approach is a simple and well knownscheduling scheme. This approach starts from address 01and selects the FF node able to execute the first pendingtask in the job queue. This scheduler neither saves itsdecisions, nor receives any feedback about the efficiency ofits prior decisions [9]. In the pure random approach,scheduling scheme is different in each iteration. Thus, wehave used the mean value (number of clock cycles periteration) of 1000 iterations as the result.

The simulation results of these three schedulers are illustratedinTable 3. As shown, thefinal results of both proposed EvoMPand static decomposition genetic scheduling (SDGS) are muchbetter than FF and pure random approaches. Furthermore, theresults of the SDGS is much better (better achieved fitness

values in less evolution time) than the proposed one, obviouslybecause of much smaller search space. However, note that thestatic decomposition necessitates programming or compile-time task decomposition, which equates to loss of sequential

program execution capability. NIOS II processor is alsoexploited to execute the same applications. Number of clockcycles required to execute one iteration of the applications, onthis processor, is also measured and presented in Table 3to make the results more comprehensive. Note that,

we have used a configuration of NIOS that has hardwaremultiplier.

The EvoMP is completely implemented at RT-levelVHDL. Table 4 illustrates the synthesis results of a 2 2-mesh system on a Xilinx VirtexII XC2V3000 FPGA.

5 Conclusions and future works

This paper presented the EvoMP, which is a novel NoC-based MPSoC system with dynamic task decomposition andscheduling capability. The conventional sequential programscan be executed on this system efficiently. EvoMP exploits ahardware genetic core for run-time task decomposition andscheduling operations. A special architecture is also designedand exploited for each processor to achieve the capability ofautomatic detection and alleviation of data dependencies.Centralised memory is the main bottleneck for scalability ofthe EvoMP and low-cost fault tolerance is a beneficial

feature of this system. EvoMP is suitable for applicationsthat perform a unique computation for huge amount of dataor a data stream. Operational mechanism, architecture,

Table 4 Synthesis results of a 2 2-mesh EvoMP system on a XC2V3000 FPGA

NoC switch Genetic core MMU Processor Total system

area (total LUTs) 741 (2%) 1891 (6%) 3612 (12%) 4583 (15%) 12 877

max freq. (MHz) 92.4



www.ietdl.org


13/14

advantages and challenges of the system are presented in thispaper. The experimental results also confirm theapplicability of EvoMP novel ideas. Note that the final goalof the authors, in this research, was presentation anddemonstration of applicability of novel ideas in designingMPSoC system. These ideas can be utilised in future MP-

based high performance computing architectures.

Centralised physical memory causes some issues for thescalability. Designing a distributed physical memory versionof EvoMP is a useful future work. Note that, distributingaddress space seems to be impossible but techniques likedistributed-shared memory that keep the address spaceshared and distribute physical memory [30], can be usefulin this system. A pure genetic core is used in the presented

version of the EvoMP. The authors believe that theperformance of this system can be improved by designand utilisation of more dedicated heuristic algorithm.

Utilisation of such techniques is another beneficial work inthe future.

6 References

[1] JERRAYA A.A., WOLF W.: Multiprocessor systems-on-chips

(Morgan Kaufmann Publishers, 2005, 1st edn.)

[2] MARTIN G.: Overview of the MPSoC design challenge.

Proc. Design and Automation Conf., San Francisco, USA,

July 2005, pp. 274279

[3] WOLF W.: The future of multiprocessor systems-on-

chips. Proc. Int. Design Automation Conf., San Diego,

USA, June 2004, pp. 681685

[4] PARHAMI B.: Introduction to parallel processing:

algorithms and architectures (Kluwer Academic Press,

1999, 1st edn.)

[5] EL-REWINI H., ABD-EL-BARR M.: Message passing interface

(MPI), in Advanced computer architecture and parallel

processing (Wiley, 2005, 1st edn.), pp. 205233

[6] PARHI K., MESSERSCHMITT D.: Static rate-optimal schedulingof iterative data-flow programs via optimum unfolding,

IEEE Trans. Comput., 1991, 40, pp. 178195

[7] MANIMARAN G., MURTHY C.S.R.: An efficient dynamic

scheduling algorithm for multiprocessor real-time systems,

IEEE Trans. Parallel Distrib. Syst., 1998, 9, (3), pp. 312319

[8] KHAN A.A., MCCREARY C.L., JONES M.S.: A comparison of

multiprocessor scheduling heuristics. Proc. Int. Conf.

Parallel Processing, 1994, pp. 243250

[9] CARVALHO E., CALAZANS N., MORAES F.: Heuristics for dynamictask mapping in NoC-based heterogeneous MPSoCs. Proc.

Int. Rapid System Prototyping Workshop, 2007, pp. 3440

[10] HUBNER M., PAULSSON K., BECKER J.: Parallel and flexible

multiprocessor system-on-chip for adaptive automotive

applications based on Xilinx MicroBlaze soft-cores. Proc.

Int. Symp. Parallel and Distributed Processing,

Washington, DC, USA, 2005, p. 149.1

[11] GOHRINGER D., HUBNER M., SCHATZ V., BECKER J.: Runtimeadaptive multi-processor system-on-chip: RAMPSoC. Proc.

Int. Symp. Parallel and Distributed Processing, April 2008,

pp. 17

[12] LIANG-THE L., CHIA-YING T., SHIEH-JIE H.: An adaptive

scheduler for embedded multi-processor real-time,

systems. Proc. IEEE TENCON Conf., October 2007,

pp. 16

[13] MALANI P., MUKRE P., QIU Q., WU Q.: Adaptive scheduling

and voltage scaling for multiprocessor real-time

applications with non-deterministic workload. Proc.Design, Automation and Test Conf. in Europe, April 2007,

pp. 652657

[14] KLIMM A., BRAUN L., BECKER J.: An adaptive and scalable

multiprocessor system for Xilinx FPGAs using minimal

sized processor cores. Proc. Parallel and Distributed

Processing Symp., April 2008, pp. 17

[15] ZOMAYA A.Y., WARD C., MACEY B.: Genetic scheduling for

parallel processor systems: comparative studies and

performance issues, IEEE Trans. Parallel Distrib. Syst.,

1999, 10, pp. 795812

[16] YI-WEN Z., JIAN-GANG Y.: A genetic algorithm for tasks

scheduling in parallel multiprocessor systems. Proc. Int.

Conf. Machine Learning and Cybernetics, November 2003,

pp. 17851790

[17] HOU E., ANSARI N., R EN H.: A genetic algorithm for

multiprocessor scheduling. IEEE Trans. Parallel Distrib.

Syst., 1994, 5, (2), pp. 113120

[18] LEE S.J., LEE K., YOO H.J.: Analysis and-implementation of

practical, cost-effective networks on chips, Proc. IEEE Des.

Test Comput., 2005, 22, (5), pp. 422433

[19] BJERREGAARD T., MAHADEVAN S.: A survey of research and

practices of network-on-chip, ACM Comput. Surv., 2006,

38, pp. 154

[20] FREEH V.W., BLETSCH T.K., RAWSON F.L.: Scaling and packing on

a chip multiprocessor. Proc. Parallel and Distributed

Processing Symp., March 2007, pp. 18

[21] RUGGIERO M., GUERRI A., BERTOZZI D., POLETTI F., MILANO M.:

Communication-aware allocation and scheduling

framework for stream-oriented multi-processor systems-on-chip. Design, Automation and Test Conf. in Europe,

March 2006, pp. 612



www.ietdl.org


14/14

[22] MONTAZERI F., SALMANI-JELODAR M., FAKHRAIE S.N., FAKHRAIE S.M.:

Evolutionary multiprocessor task scheduling. Proc.

Int. Symp. Parallel Computing in Electrical Engineering,

2006

[23] OBERMAISSER R., KRAUT H., SALLOUM C.: A transient-resilient

system-on-a-chip architecture with support for on-chipand off-chip TMR. Proc. Int. Dependable Computing

Conf., 2008, pp. 123134

[24] CANHAM R., TYRRELL A.: An embryonic array with

improved efficiency and fault tolerance. Proc. NASA/DoDConf. on Evolvable Hardware, July 2003, pp. 265272

[25] BARKER W., HALLIDAY D.M., THOMA Y., E T A L.: Fault

tolerance using dynamic reconfiguration on the POEtic

Tissue, IEEE Trans. Evol. Comput., 2 00 7, 11, (5) ,

pp. 666684

[26] MANIMARAN G., MURTHY C.S.R.: A fault-tolerant dynamic

scheduling algorithm for multiprocessor real-time systems

and its analysis, IEEE Trans. Parallel Distrib. Syst., 1998, 9,

(11), pp. 11371152

[27] BEITOLLAHI H., DEECONINICK G.: Fault-tolerant partitioning

scheduling algorithms in real-time multi-processor

systems. Proc. Pacific Rim Symp. Dependable Computing,

December 2006, pp. 296 304

[28] JAGADISH H.V., KAILATH T.: Multiprocessor implementation

models for adaptive algorithms, IEEE Trans. Signal

Process., 1996, 44, (9), pp. 23192331

[29] PAGE A.J., NAUGHTON T.J.: Dynamic task scheduling using

genetic algorithms for heterogeneous distributed

computing. Proc. Int. Symp. Parallel and Distributed

Processing, April 2005, p. 189.1

[30] EL-REWINI H., ABD-EL-BARR M.: Introduction to advanced

computer architecture and parallel processing, in ZOMAYA,A.Y. (ED .): Advanced computer architecture and parallel

processing (Wiley, 2005, 1st edn.), pp. 117



www.ietdl.org

Documents

Evolvable Multi-processor a Novel MPSoC Architecture With Evolvable Task Decomposition and Schedulin