Particle Swarm Optimization for Run-Time Task Decomposition and Scheduling in Evolvable MPSoC Shervin Vakili, Sied Mehdi Fakhraie, Siamak Mohammadi, and

Particle Swarm Optimization for Run-Time Task Particle Swarm Optimization for Run-Time Task

Decomposition and Scheduling in Evolvable MPSoCDecomposition and Scheduling in Evolvable MPSoC

Shervin Vakili, Sied Mehdi Fakhraie, Siamak Mohammadi, and Ali Shervin Vakili, Sied Mehdi Fakhraie, Siamak Mohammadi, and Ali AhmadiAhmadi

International Conference on Computer Engineering and Technology 2009 (ICCET International Conference on Computer Engineering and Technology 2009 (ICCET 2009)2009)

January 24, 2009January 24, 2009

OutlineOutline

►Why MPSoC?Why MPSoC?► Introduction to EvoMP Introduction to EvoMP ►Processing Platform Hardware Processing Platform Hardware

ArchitectureArchitecture►PSO Hardware CorePSO Hardware Core►Simulation and Synthesis ResultsSimulation and Synthesis Results

Why MPSoC?Why MPSoC?

►An emerging trend to design high An emerging trend to design high performance computing architectures.performance computing architectures.

►Have most of the desirable advantages Have most of the desirable advantages of single-processor solutions such as of single-processor solutions such as short time-to market, post-fabricate short time-to market, post-fabricate reusability, flexibility and reusability, flexibility and programmability programmability

►Moving toward large number of simple Moving toward large number of simple processors on a chipprocessors on a chip

MPSoC Development MPSoC Development ChallengesChallenges

► Programming models: MP systems Programming models: MP systems require concurrent software. Two main require concurrent software. Two main solutions:solutions:

Software development using parallel Software development using parallel models e.g. OpenMP and MPImodels e.g. OpenMP and MPI

► ““Software developers have been well-trained Software developers have been well-trained by sixty years of computing history to think in by sixty years of computing history to think in terms of sequentially defined applications terms of sequentially defined applications code” [2]code” [2]

► Requires huge investment to re-develop Requires huge investment to re-develop existing softwareexisting software

MPSoC Development Challenges MPSoC Development Challenges (2)(2)

Automatic parallelization at compile-timeAutomatic parallelization at compile-time► Does not require reprogramming but requires Does not require reprogramming but requires

re-compilation re-compilation ► Such compiler must solve two complex Such compiler must solve two complex

problems:problems: Decomposition of the program into some tasksDecomposition of the program into some tasks Scheduling the tasks among cooperating processorsScheduling the tasks among cooperating processors

► Both task decomposition and scheduling Both task decomposition and scheduling operations are NP-complete problemsoperations are NP-complete problems

► G. Martin [2]: “Decomposition of an application described in a G. Martin [2]: “Decomposition of an application described in a serial fashion into a set of concurrent or parallel tasks that can serial fashion into a set of concurrent or parallel tasks that can cooperate in an orderly and predictable way is one of the most cooperate in an orderly and predictable way is one of the most difficult jobs imaginable and despite of forty or more years of difficult jobs imaginable and despite of forty or more years of intensive research in this area there are very few applications intensive research in this area there are very few applications for which this can be done automatically”.for which this can be done automatically”.

MPSoC Development Challenges MPSoC Development Challenges (3)(3)

► All MPSoCs can be divided into two All MPSoCs can be divided into two categories:categories: Static SchedulingStatic Scheduling

► Task scheduling is performed before run-timeTask scheduling is performed before run-time► Number of contributing processors must be Number of contributing processors must be

predeterminedpredetermined

Dynamic scheduling (Dynamic scheduling (e.g. current multi-core PC e.g. current multi-core PC processorsprocessors))

► A run-time scheduler (in hardware, middleware, or OS) A run-time scheduler (in hardware, middleware, or OS) is in charge of task schedulingis in charge of task scheduling

► Does not require prior information about number of Does not require prior information about number of available processors (desirable for fault tolerance)available processors (desirable for fault tolerance)

Introduction to EvoMPIntroduction to EvoMP

► An NoC-Based Homogeneous Multi-An NoC-Based Homogeneous Multi-processor system with evolvable task processor system with evolvable task decomposition and schedulingdecomposition and scheduling

► Features:Features: Distributed control and computingDistributed control and computing ScalableScalable Does not need parallel programmingDoes not need parallel programming

► One of the main difficulties in parallel processingOne of the main difficulties in parallel processing► Requires reprogramming all the developed Requires reprogramming all the developed

(sequential) software(sequential) software

Introduction to EvoMP (2)Introduction to EvoMP (2)

►Features:Features: All computational units have one copy of All computational units have one copy of

the entire programthe entire program A hardware PSO core is exploited in A hardware PSO core is exploited in

EvoMP architecture to generates a bit-EvoMP architecture to generates a bit-stringstring

►Specifies each instruction must be executed Specifies each instruction must be executed in which processorin which processor

Our first version of EvoMP had used a Our first version of EvoMP had used a genetic algorithm core [8]genetic algorithm core [8]

Introduction to EvoMP (3)Introduction to EvoMP (3)

►Target Applications: Applications, Target Applications: Applications, which perform a unique computation which perform a unique computation on a stream of data, e.g.:on a stream of data, e.g.: digital signal processing of video and digital signal processing of video and

audio signalsaudio signals Different codec standardsDifferent codec standards Huge sensory data processingHuge sensory data processing Packet processing in network applicationsPacket processing in network applications … …

EvoMP Top ViewEvoMP Top View► PSO core produces a bit-PSO core produces a bit-

string (particle) which string (particle) which determines the location of determines the location of execution of each execution of each instruction at the beginning instruction at the beginning of each Iterations.of each Iterations.SW00

Cell-01

SW01

SW10 SW11

Cell-00

Cell-11Cell-10

PSO Core

1- MOV R1, 01- MOV R1, 0

2- MOV R2, 02- MOV R2, 0

L1:L1: ;Loop;Loop

3- MOV R1, Input3- MOV R1, Input

4- MUL R3, R1, 4- MUL R3, R1, Coe1Coe1


6- ADD R1, R3, R46- ADD R1, R3, R4

7- MOV Output, R17- MOV Output, R1

8- MOV R1, R28- MOV R1, R2

9- JUMP L19- JUMP L1

1- MOV R1, 01- MOV R1, 0

2- MOV R2, 02- MOV R2, 0

L1:L1: ;Loop;Loop




6- ADD R1, R3, R46- ADD R1, R3, R4


8- MOV R1, R28- MOV R1, R2

9-JUMP L19-JUMP L1

1- MOV R1, 01- MOV R1, 0

2- MOV R2, 02- MOV R2, 0

L1:L1: ;Loop;Loop




6- ADD R1, R3, R46- ADD R1, R3, R4


8- MOV R1, R28- MOV R1, R2


1- MOV R1, 01- MOV R1, 0

2- MOV R2, 02- MOV R2, 0

L1:L1: ;Loop;Loop




6- ADD R1, R3, R46- ADD R1, R3, R4


8- MOV R1, R28- MOV R1, R2


Particle: Particle: 01101010…1101101010…11

How Chromosome Codes the How Chromosome Codes the Scheduling DataScheduling Data

► Streaming applications Streaming applications have two main parts:have two main parts: InitializationInitialization Infinite (or semi-Infinite (or semi-

infinite) Loopinfinite) Loop

;Initial;Initial

1- 1- MOV R1, 0MOV R1, 0

2- MOV R2, 02- MOV R2, 0

L1:L1: ;Loop;Loop




6- ADD R1, R3, R46- ADD R1, R3, R4


8- MOV R1, R28- MOV R1, R2

9- PSO9- PSO

10-JUMP L110-JUMP L1

How EvoMP WorksHow EvoMP Works

► Following process is repeated for each iteration:Following process is repeated for each iteration: At the beginning of each iteration, PSO core At the beginning of each iteration, PSO core

generates and sends the bit-string (particle) to all generates and sends the bit-string (particle) to all processorsprocessors

Then processor executes this iteration of the Then processor executes this iteration of the program with the decomposition and scheduling program with the decomposition and scheduling scheme specified by this bit-stringscheme specified by this bit-string

An internal counter in PSO core is used to count An internal counter in PSO core is used to count number of spent clock cycles meanwhile execution number of spent clock cycles meanwhile execution of each iterationof each iteration

When all processors reached the end of the loop, the When all processors reached the end of the loop, the PSO core uses the output of this counter as the PSO core uses the output of this counter as the fitness value of the last generated particlefitness value of the last generated particle

How EvoMP Works (2)How EvoMP Works (2)

► The system has three main statesThe system has three main states Initialize:Initialize:

► Just in first populationJust in first population► PSO core generates random particlesPSO core generates random particles

► Evolution: Evolution: PSO core produces the new population through particular computations PSO core produces the new population through particular computations

using best previously archived particles using best previously archived particles When the termination condition is met, system goes to final stateWhen the termination condition is met, system goes to final state

► Final: Final: The best particle achieved in evolution stage is used as constant The best particle achieved in evolution stage is used as constant

output of the PSO coreoutput of the PSO core When one of the processors becomes faulty the system returns to When one of the processors becomes faulty the system returns to

evolution stage to perform re-evolution (beneficial for fault tolerance evolution stage to perform re-evolution (beneficial for fault tolerance capability of the EvoMP)capability of the EvoMP)

Initialize Evolution Final

Fault detected

Terminate

How Chromosome Codes the How Chromosome Codes the Scheduling Data (1)Scheduling Data (1)

►Each bit-string (Particle) consists of Each bit-string (Particle) consists of some small words (Sub-Particles)some small words (Sub-Particles)

►Each Sub-Particles contains two fields:Each Sub-Particles contains two fields: A processor number A processor number A limited number which specifies number A limited number which specifies number

of instructions which must be executed in of instructions which must be executed in specified processor in first fieldspecified processor in first field

How Chromosome Codes the How Chromosome Codes the Scheduling Data (2)Scheduling Data (2)

► Assume that we have a 2X2 meshAssume that we have a 2X2 mesh

10 001

# of Instructions

00 01011 00010 101

10

00

Particle

11

10

00

10

01

11

;Initial;Initial

1- 1- MOV R1, 0MOV R1, 0

2- MOV R2, 02- MOV R2, 0

L1:L1: ;Loop;Loop




6- ADD R1, R3, R46- ADD R1, R3, R4


8- MOV R1, R28- MOV R1, R2

9- GENETIC9- GENETIC

10-JUMP L110-JUMP L1

Inter-Processor Data Inter-Processor Data DependenciesDependencies

► Inter-processor data dependencies Inter-processor data dependencies are detected in source processor using are detected in source processor using

architectural mechanismsarchitectural mechanisms Source processor transmits the required Source processor transmits the required

data for the destination one(s) through data for the destination one(s) through NoC NoC

Does not require request-send schemeDoes not require request-send scheme

Architecture of each Architecture of each ProcessorProcessor

► Number of FUs is a configurable parameterNumber of FUs is a configurable parameter► Supports out of order executionSupports out of order execution► First free FU grabs the instruction from First free FU grabs the instruction from InstrInstr bus bus

and send a signal to and send a signal to Fetch_IssueFetch_Issue to fetch next to fetch next instructioninstruction

Fetch_Issue

Register File

NoC Interface

NoC_In

NoC_Out

Scheduling Data Word

FU1 FU2

Invalidate_Instr

Dest_occupied

InstrR2_DataR1_Data

Send2_Data

Send1_Data

Extra_Bus

...Instr_Rcvd

Extra_bus_Req

Destination Processor Address

Particle Swarm Optimization Particle Swarm Optimization AlgorithmAlgorithm

►An stochastic population-based An stochastic population-based evolutionary algorithm evolutionary algorithm

►Ties to find the optimum solution over Ties to find the optimum solution over the search space by the search space by sampling points and converging the sampling points and converging the

swarm on the most promising regions swarm on the most promising regions Number of these sampling points (called Number of these sampling points (called

particle) is constant (population size) particle) is constant (population size) Each sampling point is a candidate Each sampling point is a candidate

solution solution

PSO CorePSO Core

SW00

Cell-01

SW01

SW10 SW11

Cell-00

Cell-11Cell-10

PSO Core

[6]

Evolution Phase Results: DCT-16Evolution Phase Results: DCT-16

► Parameters:Parameters: Population size=16Population size=16 NoC connection NoC connection

width=16width=16

► 324 Instructions324 Instructions► 128 128

multiplicationmultiplication

► Execution results of 16-point Descrete Cosine Transform on Execution results of 16-point Descrete Cosine Transform on different-size EvoMPs different-size EvoMPs

► Best fitness shows number of clock cycles Best fitness shows number of clock cycles required to execute one iteration using the required to execute one iteration using the best particle which has been found yet.best particle which has been found yet.

DCT-16

0

500

1000

1500

2000

2500

3000

0 20000 40000 60000 80000 100000 120000

Time (us)

Bes

t F

itn

ess

(Clo

ck C

ycle

s) 1 Processors

2 Processors

3 Processors

5 Processors

Evolution Phase Results: MATRIX-Evolution Phase Results: MATRIX-5x55x5

► 406 Instructions406 Instructions► 125 125

multiplicationmultiplication

► Execution results of 5x5 Matrix multiplication on different-size Execution results of 5x5 Matrix multiplication on different-size EvoMPs EvoMPs

MAT-5x5

0

500

1000

1500

2000

2500

3000

3500

0 50000 100000 150000 200000 250000 300000

Time (us)

Bes

t F

itn

ess

(Clo

ck C

ycle

s) 1 Processor

2 Processors

3 Processors

5 Processors

► Parameters:Parameters: Population size=16Population size=16 NoC connection NoC connection

width=16width=16

Final Evolution Phase Results Final Evolution Phase Results

► Following table shows final results achived in evolution phase (and Following table shows final results achived in evolution phase (and corresponding evolution time) in both genetic-based and PSO-based corresponding evolution time) in both genetic-based and PSO-based EvoMPs.EvoMPs.

► These results shows small improvement in final results and These results shows small improvement in final results and convergence time in PSO-based system.convergence time in PSO-based system.

# of Instr.

# of Multiplies

Particle length (bits)

1 Processor 2 Processors 3 Processors 5 Processors

Both Genetic PSO Genetic PSO Genetic PSO

Fit Time Fit Time Fit Time Fit Time Fit Time Fit Time

FIR-16 74 16 240 350 214 23.7 211 12.3 171 30.1 174 14.3 unevaluated

DCT-8 88 32 280 671 403 93.0 393 6.2 319 99.8 308 21.8 285 138.1 203 15.6

DCT-16 324 128 720 2722 1841 74.5 1831 41.7 1460 23.3 1439 45.3 1213 633.7 1191 98.3

MAT-5x5 406 125 800 3181 2344 198.3 2312 86.3 1868 294.8 1821 148.3 1596 546.7 1518 240.9

Synthesis ResultsSynthesis Results

Population size Sub-particle Length

Total LUTs Max. Freq. (MHz)

PSO-based scheduler 16 4 1864 92.6

Genetic-based scheduler 16 - 1642 68.4

► Follwing table shows the synthesis results of bothe PSO and genetic Follwing table shows the synthesis results of bothe PSO and genetic cores on a VIRTEX II (cores on a VIRTEX II (XC2V3000) FPGAXC2V3000) FPGA

ReferencesReferences[1] [1] A. A. Jerraya and W. Wolf, A. A. Jerraya and W. Wolf, Multiprocessor Systems-on-ChipsMultiprocessor Systems-on-Chips, San Francisco: Morgan , San Francisco: Morgan

Kaufmann Publishers, 2005.Kaufmann Publishers, 2005.[2] G. Martin, “Overview of the MPSoC design challenge,” [2] G. Martin, “Overview of the MPSoC design challenge,” Proc. Design and Automation Proc. Design and Automation

Conf.Conf., July 2005, pp. 274-279., July 2005, pp. 274-279.[3] M. Hubner, K. Paulsson, and J. Becker, “Parallel and flexible multiprocessor system-[3] M. Hubner, K. Paulsson, and J. Becker, “Parallel and flexible multiprocessor system-

on-chip for adaptive automo tive applications based on Xilinx MicroBlaze soft-on-chip for adaptive automo tive applications based on Xilinx MicroBlaze soft-cores,” cores,” Proc. Int. Symp. Parallel and Distributed ProcessingProc. Int. Symp. Parallel and Distributed Processing, 2005, pp. 149.1., 2005, pp. 149.1.

[4] D. Gohringer, M. Hubner, V. Schatz, and J. Becker, “Runtime adaptive multi-[4] D. Gohringer, M. Hubner, V. Schatz, and J. Becker, “Runtime adaptive multi-processor system-on-chip: RAMP SoC,” processor system-on-chip: RAMP SoC,” Proc. Int. Symp. Parallel and Distributed Proc. Int. Symp. Parallel and Distributed ProcessingProcessing, Apr. 2008, pp. 1-7., Apr. 2008, pp. 1-7.

[5] A. Klimm, L. Braun, and J. Becker, “An adaptive and scalable multiprocessor system [5] A. Klimm, L. Braun, and J. Becker, “An adaptive and scalable multiprocessor system for Xilinx FPGAs using minimal sized processor cores,” for Xilinx FPGAs using minimal sized processor cores,” Proc. Symp. Parallel and Proc. Symp. Parallel and Distributed ProcessingDistributed Processing, April 2008, pp. 1-7., April 2008, pp. 1-7.

[6] A. Farmahini-Farahani, S. Vakili, S. M. Fakhraie, S. Safari, and C. Lucas, “Parallel [6] A. Farmahini-Farahani, S. Vakili, S. M. Fakhraie, S. Safari, and C. Lucas, “Parallel scalable hardware implementation of asynchronous discrete particle swarm scalable hardware implementation of asynchronous discrete particle swarm optimization,” optimization,” Elsevier J. of Engineering Applications of Artificial IntelligenceElsevier J. of Engineering Applications of Artificial Intelligence , , submitted for publication. submitted for publication.

[7] A.J. Page and T.J. Naughton, “Dynamic task scheduling using genetic algorithms for [7] A.J. Page and T.J. Naughton, “Dynamic task scheduling using genetic algorithms for heterogeneous distributed computing,” heterogeneous distributed computing,” Proc. Int. Symp. Parallel and Distributed Proc. Int. Symp. Parallel and Distributed ProcessingProcessing, April 2005, pp. 189.1., April 2005, pp. 189.1.

[8] S. Vakili, S. M. Fakhraie, and S. Mohammadi, “EvoMP: a novel MPSoC architecture [8] S. Vakili, S. M. Fakhraie, and S. Mohammadi, “EvoMP: a novel MPSoC architecture with evolvable task decompo sition and scheduling,” with evolvable task decompo sition and scheduling,” to be Appeared in IET Comp. & to be Appeared in IET Comp. & Digital Tech.Digital Tech.

[9] E. Carvalho, N. Calazans, and F. Moraes,[9] E. Carvalho, N. Calazans, and F. Moraes, "Heuristics for dynamic task "Heuristics for dynamic task mapping in mapping in NoC-based heterogeneous MPSoCs," NoC-based heterogeneous MPSoCs," Proc. Intl. Rapid System Prototyping Proc. Intl. Rapid System Prototyping WorkshopWorkshop, 2007, pp. 34-40., 2007, pp. 34-40.

Fetch_Issue UnitFetch_Issue Unit

► PC1-Instr bus is used for executive instructionsPC1-Instr bus is used for executive instructions► PC2-Invalidate_Instr bus is used for data PC2-Invalidate_Instr bus is used for data

dependency detectiondependency detection

Scheduling Data FIFO

Scheduled FIFO

=

PC1

PC2

Instruction Memory

+Down Counter

Proc_Addr

Instr_num

Address of this processor

=0

RD

WR

LD

OpcodeR1_ID

R2_ID

Instr

Destination

=Address of this processor

Proc_Addr

Local_Instruction

Send1_ID

Send1_ID

Invalidate_Instr

Destination

WR_Scheduling_Word

Scheduling_Data_Word

Addr1

Addr2

Dout1

Dout2

WR

>

Inc

Inc

Inc

Destination Processor Address

Example: 2-Order FIR filterExample: 2-Order FIR filter

MOV R1, 0MOV R1, 0

MOV R2, 0MOV R2, 0

L1:L1:

MOV R1, InputMOV R1, Input

MUL R3, R1, Coe1MUL R3, R1, Coe1


ADD R1, R3, R4ADD R1, R3, R4

MOV Output, R1MOV Output, R1

MOV R1, R2MOV R1, R2

GENETICGENETIC

JUMP L1JUMP L1

Cell1 Cell2

Clock:

PC1 MOV R1, 0MOV R1, 0

MOV R2, 0MOV R2, 0

L1:L1:

MOV R1, InputMOV R1, Input



ADD R1, R3, R4ADD R1, R3, R4

MOV Output, R1MOV Output, R1

MOV R1, R2MOV R1, R2

GENETICGENETIC

JUMP L1JUMP L1

1-1-

2-2-

1-1-

1-1-

2-2-

2-2-

2-2-

2-2-

1-1-

1-1-

2-2-

1-1-

1-1-

2-2-

2-2-

2-2-

2-2-

1-1-

PC1

1

PC2 PC2

PC1

PC1

23

PC1

PC2 PC2

PC2 PC2

45

PC2

12

PC1

13

PC1

PC1

PC2 PC2

14

PC2 PC2 PC2

PC2

15

PC1

R3

PC2

Waiting for R31617

PC2

18

PC1

PC2

Total time: 18 Clock Cycles

Documents

Particle Swarm Optimization for Run-Time Task Decomposition and Scheduling in Evolvable MPSoC Shervin Vakili, Sied Mehdi Fakhraie, Siamak Mohammadi, and