COSMOS: A System-Level Simulation Reconfigurable Systems · PDF fileReconfigurable Systems ... reconfigurable unit (RU) is frequently designed as a special instruction-set ... reconfigurable

COSMOS: A System-Level Modelling and

Simulation Framework for Coprocessor-Coupled

Reconfigurable Systems

Kehuai Wu, Jan MadsenDept. of Informatics and Mathematical Modelling, Technical Univ. of Denmark

Email: {kw,jan} @imm.dtu.dk

Abstract- Dynamically reconfigurable systems demand com-plicated run-time management. Due to resource constraints andreconfiguration latencies, efficient reconfiguration strategies thatcan reduce the overhead cost of dynamic reconfiguration needto be studied. In this paper, we i) propose a reconfigurabletask model which extends the classical real-time task model tosupport the additional states and latencies needed to capturedynamically reconfigurable behavior, ii) propose a coprocessor-coupled reconfigurable architecture which has hardware run-time support for task execution, task reallocation and resourcemanagement, and iii) present a SystemC based framework tomodel and simulate coprocessor-coupled reconfigurable systems.We illustrate how COSMOS may be used to capture the dynamicbehavior of such systems and emphasize the need for capturingthe system aspects of such systems in order to deal with futuredesign challenges of dynamically reconfigurable systems.

I. INTRODUCTION

Future embedded systems will be based on platforms whichallow the system to be extended and incrementally updatedwhile running in the field. This will not only extend the lifetime of the system, but also allow the system to adapt to thephysical environment as well as performing self-repair andhence increasing the reliability and robustness of the system.In order to facilitate this, the platforms need to be dynamicallyreconfigurable architectures. Although these platforms willbe based on multiprocessor system-on-chips (MPSoC) andnetwork-on-chip (NoC) architectures, the dynamic behavior ofthe hardware pose new challenges to tools and methodologiesin order to ensure both efficient platform design and run-timeplatform usage.

Reconfigurable architectures fully exploit the tradeoff be-tween the chip area and hardware reusability. Instead of imple-menting the digital IP core with fixed logic, a programmabledigital device is used. By switching the application runningon the programmable device at run-time, the architectureshould have the flexibility of software and the efficiency ofthe hardware, thus enables us to close the gap between thesoftware and the hardware.The biggest challenge in reconfigurable system design is to

improve the rate of reconfiguration at run-time by reducing thereconfiguration overhead. The reconfiguration overhead comesfrom multiple sources, and without proper management, theflexibility of the reconfiguration can not justify the overheadcost. Many new technologies and designs for minimizing the

reconfiguration overhead have been proposed. Logic granular-ity [4], [5], host coupling [3], resource management [7], [6]etc. have been studied in various contexts. These technologiessubstantially increase the practicality of the reconfigurablesystems, but also often lead to highly complicated systembehavior. There exists several highly efficient architectures,but many of them have significant drawbacks in terms ofprogrammability, flexibility, scalability or utilization rate.

Even though low-level technologies have drawn a lot ofattention, the study on system-level behavior and compilationis still in its infancy. As high level design decisions madeearly in the design process can have a high impact onthe performance of reconfigurable systems, the evaluation ofapplications executing on a reconfigurable system in the earlydevelopment stages, is a new challenge which needs to beaddressed.

For datapath-coupled architectures [11], [4], reconfigurableunit (RU) is frequently designed as a special instruction-setfunctional unit or extended to a large-scale VLIW proces-sor, thus the application can be efficiently evaluated withinstruction-level simulation. However, coprocessor-coupled ar-chitectures, which are usually large-scale, need advanced run-time resource management and carefully designed architec-tures. Hence, to improving the system efficiency, we needto be able to model and analyze such architectures and theapplications running on them.

In this paper we present COSMOS, a framework to modeland simulate coprocessor-coupled reconfigurable systems. Wepropose a novel real-time task model which captures theadditional characteristics to correctly handle dynamically re-configurable systems. We also propose a general model ofcoprocessor-coupled reconfigurable systems. The task and ar-chitecture models are based on an existing MPSoC simulationmodel, ARTS [1], which has been extended to facilitate run-time resource management strategies. To the best of ourknowledge, this is the first attempt to create a system-levelmodelling framework for dynamically reconfigurable systems,which takes into account the reconfigurable architecture, theapplication running on the architecture and the run-time taskand resource management.The rest of the paper is organized as following. Section

II gives an overview of the new technologies employed inreconfigurable system design. Section III proposes the recon-

1-4244-1058-4/07/$25.00 C 2007 IEEE 128

Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on November 13, 2009 at 08:23 from IEEE Xplore. Restrictions apply.

figurable system task model which captures the real-time ap-plication execution characteristics. Section IV propose a real-time system model of the coprocessor-coupled architecturewith focus on the run-time resource management and taskexecution efficiency. Section V discusses how the task andarchitecture models have been implemented in SystemC, anda demonstrative simulation result is presented in section VI.Section VII discusses our model's future work and SectionVIII concludes our study.

II. BACKGROUND

During a reconfiguration, reconfigurable architectures suf-fers from latencies due to context switching (configuration andintermediate data) of an RU. The severity of this latency isdetermined by several physical factors, e.g. the scale of the RU,the logic granularity, the configuration memory bandwidth, therate of reconfiguration or the buffering technics of reconfig-uration memory fetching. In the following we will give anoverview of the related research areas that can reduce suchlatencies, and discuss how they affect system behavior.One research trend assume that the applications, or a collec-

tion of tasks, share the RU in time, as shown in Figure 2A. [8]proposed a multi-context FPGA that can significantly reducethe reconfiguration time, but the extra cost of chip area ishardly justifiable by the performance gain. A solution that cansubstantially reduce the area overhead is to increase the logicgranularity of the RU to medium- or coarse-grained, as shownin Figure 1. Even if these higher-granularity architectures donot offer highly optimal solutions to applications that heavilyexploit bit-level data manipulations, the concept of multi-context is proved feasible. But still, the number of contextsbeing cached on the RU is usually limited, and optimalutilization of the limited context resource at run-time is adifficult challenge for a multi-tasking system [9].

Logic

lII IntermediateIdata I

I 1I(memory elements) Iliii ~~~~~~~~~~~II Configuration I

1 ContextFine-grained

LogicLogic

Intermediate Iirr=2u.1 data lli Intermediate

(memory elements)l II data

Configuration I (memory elements)Configuration ConfigurationjContextContext

Coarse-grained

Fig. 1. The impact of logic granularity on the chip area of reconfigurablearchitectures.

Another type of reconfigurable architecture assume that theRU is shared in space [10], [7], as shown in Figure 2B. The RUis partially reconfigurable and large-scaled, thus several taskscan be run on the same RU with no conflicts among each other.Besides reconfiguration latency, this class of architecturesleads to complicated inter-task communication and resourcemanagement. Since a task can be allocated on an RU at

any free location during run-time, data traffic between tasksgo through non-deterministic paths, maybe requiring dynamicrouting. For a large programmable array, the complexity ofperforming the task placement and data routing at run-timecan be very hard to handle. Also, it is clear that the frag-mentation is a common issue for this kind of design, thus task(context) reallocation and rerouting is consistently required fordefragmentation. In summary, the behavior and efficiency ofsuch system can be very unpredictable, and understanding thesystem behavior in the early development stage is crucial.

TiT2T3T4T5 /

Reconfigurablei-L"unit2

A. RU shared in time B. RU shared in space

Fig. 2. Reconfigurable unit design

A third type of RU is a hybrid of the two former families.This type of architecture is one of the main focus of ourwork, and it will be introduced in section IV. This architectureis viewed as an array of networked multi-context RUs. Suchsystem also requires efficient dynamic resource management,but the routing problem is greatly simplified compared tospace-shared architectures.

In general, we are facing the increasing complexity of thereconfigurable system's spatial and temporal behavior. Newtechnologies that improve the system's efficiency also compli-cates the architecture, and the value of the tradeoff betweenperformance and design complexity is not easily assessable.A system-level simulator is much needed for evaluating theperformance of dynamically reconfigurable systems. Such asimulator should give the designer the opportunity to tunevarious design parameters and to study the consequences onsystem performance.To build a system-level simulator, we need a thorough

understanding on how to model the tasks which comprise theapplication. A task running on the reconfigurable architecturehas a different execution behavior than the classical real-timetask, thus the classical model does not fit our purpose. Forthe simulator design, we need i) a general and generic modelof the RU which can represent various types of coprocessor-coupled RU designs, ii) the dynamic resource managementissue of the RU should be addressed, and iii) the simulationshould be parameterizable so that the consequences of changesin the physical design can be captured within the model.The ARTS modelling Framework captures real-time be-

havior of heterogeneous multiprocessor systems, where eachprocessor may run its own operating system. In our work, weadopt the underlying message-passing based mechanism of theARTS model and some of its RTOS functionality. We extend it

129


further to support the modelling of dynamically reconfigurablesystems. In particular, our model, unlike ARTS, supports taskreconfiguration and reallocation during rune-time, i.e. duringsimulation.

111. TASK MODEL

In the ARTS framework, an application is modelled as aset of task graphs, and each task is modelled as a finite statemachine (FSM), as shown in Figure 3. The state transitions ofa task are driven by the operating system control messages.Whether a task should run, or be preempted, depends on theresource allocation, scheduling and task dependency. But forreconfigurable system, such an FSM is not sufficient to capturethe task execution scenario.

A. task graph B. task FSM

Fig. 3. ARTS task model

Firstly, to initialize the task execution, the initial config-uration needs to be loaded from the configuration memoryto the RU. Depending on the RU's granularity, size andmemory interfacing, the timing cost for fetching the wholeconfiguration can be a big overhead. To explicitly express thistask execution phase, a new state initLconfig has been addedto the task model, as shown in Figure 4.

Secondly, the preemption is not simply a process of taskgiving up an RU for other tasks, but is also a process ofhardware context switching. Differ from the software contextswitching, which mostly involves backing up special-purposeregisters and bookkeeping the operating system managemententries, hardware context switching needs to back up theconfiguration and all the intermediate data stored in all thememory elements. The timing cost of the preemption canbe extremely high if the context is stored in the externalmemory, or as low as a single clock cycle if the RU has multi-context support. Architecture designers would experiment onvarious combinations of different context storage design inorder to find an optimal strategy, thus the reconfigurationlatency may vary. In our model, we added two delay states,reconfig-preempt and reconfig-run, to represent the timingcost of the preemption.

Finally, the effect of process of task migrating amongmultiple RUs need to be modelled. As shown in Figure 4, weadd the reallocation state realloc at three places and marked itwith dashed circles, in order to emphasize that this single statehas multiple entry points and exit points. This is modelled so

Fig. 4. Real-time reconfigurable task model

because reallocation can happen anytime after a task leaves theidle state, and at different point of time, the reallocation hasdifferent effect on task execution. If the reallocation is startedbefore a task is run for the first time, the task needs to beinitialized on a different RU. In this case, the (partial) contextof the reallocated task is moved to another RU, then it resumesthe previous state for either continue initializing or waiting toget permission to run. If the task has been run before, then theallocation must be ended with the task going to the preemptstate. The reason for such a setup is because the task doesn'tknow if it can continue executing after the reallocation, sincethe resource status of the reallocated RU is unknown. It is safefor a task to preempt itself and request resource managementunit for permission to continue execution.

IV. COPROCESSOR COUPLED ARCHITECTURE MODEL

As shown in Figure 2, the architecture design is headingto two directions. Besides the aforementioned resource man-agement issue, the time-shared architectures also suffer fromscalability issue, since parallelizing a task to use the full RUgets harder when RU's size increases. Similarly, the space-shared architecture's defragmentation gets harder when theRU upscales, and the rerouting becomes impossible to handleat run time. Unless the RU is partitioned and modularized,the space-shared architecture has too many practical issue torealize.

To solve the problem of both types of the reconfigurablearchitecture, we propose a hybrid architecture model. Asshown in Figure 5, our coprocessor consists of an arrayof homogeneous multi-context RUs connected with on-chipnetworks (NoC). Similar to the time-shared design, the com-putation resource of our architecture is still the context of

130


the RUs. By statically partitioning the applications into tasks,each of which is small enough to fit into one RU's context,or one resource, we can explore the application's parallelismat different levels and efficiently utilize the potential of thecoprocessor. To ease the dynamic resource management, weassume that tasks never share one RU in space, even if severaltasks can fit into one RU at the same time. This guarantees thateach task can be reallocated without interfering the executionof the other tasks.

Fig. 5. Hybrid coprocessor-coupled reconfigurable system

As an architecture design, our architecture has several ad-vantages compared to many previous designs. Our coprocessoris upscaled by increasing the number of RUs, thus has moreflexibility to efficiently support tasks of various complexity.With the support of the NoC, rerouting problem can be solvedon the fly when the tasks are communicating. Since thecoprocessor is modularized, defragmentation is not a crucialissue as in previous space-shared design. When combined withour resource management strategy, which will be discussedlater, our co-processor is highly scalable.As a model, our model can easily be used for both time-

shared and space-shared architecture PSE. To model the time-shared architectures, by assuming the number of RUs to beone, our model imitates a multi-context architecture. As tomodel a space-shared architecture, by assuming all the RUs tobe single-context, our model can be viewed as a modularizedspace-shared RU. By employing a NoC and assuming thatany task can be allocated to a randomly selected RU, thedynamic placement and routing issues of space-shared archi-tecture becomes a much easier issue to address. However, thehomogeneity of RUs adds an extra compile/synthesis resourceconstraint.The resource management is still a problem for our ar-

chitecture, since the run-time system needs to manage tasksin both space and time. For a small-scaled coprocessor, theCPU/operating system can be used to manage the resource.But if the system reaches certain scale, it is foreseeablethat taking a snapshot of the whole coprocessor's resourcedistribution, evaluating it and allocating/reallocating task byusing the CPU can be a performance bottleneck. Here weintroduce our alternative to address this issue.

Mg M m m

l~

Fig. 6. Hierarchical organization of reconfigurable units

As shown in Figure 6, some nodes in the coprocessor areselected as Coordinator nodes (C-nodes) or Master nodes (M-nodes), and the rest are Slave nodes (S-nodes). By structuringthe whole design into hierarchies, the resource management isdistributed into different roles each type of the nodes play.

C-nodes are the resource management nodes. Each time anapplication is started by CPU, all C-nodes send message to thelower hierarchy for resource check. Then M-nodes collect theweighted resource distribution status from S-nodes and pass itto the C-nodes. Then the C-nodes, all of which run the samedecision-making protocol, select a resource-optimal M-nodesto initialize and synchronize the application's execution.

M-nodes are the task execution management nodes. Afterthe C-nodes assign an application to an M-node, the M-node reallocates the currently running tasks to free up someresources if the new application has a higher priority. Thenthe M-node initializes the new application's tasks to freeresources, and start its execution. During execution, dependingon the task dependencies and priorities, the M-node canreallocate the tasks or preempt the task execution. C-nodesand M-nodes forms a cluster. M-nodes is only controlled bythe C-nodes in its cluster, thus any message received fromother C-node will be ignored.

S-nodes are the computation units. When a task is allocatedto an S-node, the task can be blocked or selected for execution,depending on its priority or deadline. The node keeps track ofhow many resources is currently in use and how many is stillavailable. The multi-context S-nodes is not bounded to onespecific M-nodes. As long as it gives optimal results, contextson a S-node can be shared among all M-nodes.The tasks of a certain application are distributed on the

S-nodes near one M-node selected by the C-nodes. Thehigher priority an application has, the more effort the M-nodes will put into to cluster its tasks, in order to lower thecommunication cost. Lower priority applications' clusters canbe disrupted by the M-nodes when a new application withhigher priority is started. Careful placement of clusters canhelp achieve overall system optimality, thus is crucial in ourapproach.Our hierarchical design represents our general resource

management strategy, but we don't enforce a physical bound-ing between the function of a resource/task management unitand a RU, except for the S-nodes. For instance, when thecoprocessor is small, the function of the C-node and M-nodes

131

r777771

L% EK4 M.4

[E [H[H

Esc


can be realized by the operating system running on CPU, orbe combined into one physical RU. It gives us the freedom tomodel the architecture on various scales. In our experiment,priority is based on the overall communication demand of anapplication, but we don't constrain how task priority is definedor what allocation strategy is used. Different designer mayhave different preference on specific parts of the system, andwe leave them open for experimentation.Even though it is not the focus of our work, the latency of

off-coprocessor data communication can be easily modelled byour framework. Data IO ports connected to the main memorycan be modelled as S-nodes with no context limitation, and off-coprocessor data communication can be modelled as specialtasks that can only be allocated on the S-nodes that imitates thecoprocessor's JO ports. Given a set of coordinates to the ports,which is preferably on the boundary of the coprocessor, theoff-coprocessor communication latency can change when thetask reallocation occurs, depending on the distance betweenthe JO port and the tasks that need access to the main memory.

We will not go into detail of the NoC model design inour work, since it has been addressed by the ARTS modeldescribed in previous work[2]. In this paper, we simply assumethat if two communicating tasks are k hops away on thecoprocessor, the communication latency is kT, where T is thesingle-hop base communication latency between those tasksdecided through static analysis. The overall communicationlatency of an application is greatly affected by the allocationstrategy, which is one of the most interesting issue to beaddressed by our model.When a task is reallocated, the context of a task is trans-

ferred from one RU to another. This process results in aburst of data transfer on the NoC in a short period of time.Compared to the context transfer, inter-task communicationhappens much more frequently than reallocation, and data isoften delivered in smaller packets. These two types of datatransmission have very different requirements on the NoCdesign, thus we separate them into two NoCs. The reallocationNoC is assumed to be able to establish preset paths thatcan guarantee to finish the reallocation in a short period oftime, thus the physical distance between the context transfer'ssource and destination should not play a significant role inthe overall reallocation latency. In our model, we assume thatany reallocation takes a constant period of time, and severalreallocation can take place concurrently without blocking eachother. Physically, the configuration data communication couldshare the inter-task communication NoC, but to demonstratethe concept, we choose not to do so at the moment.

One final note about our architecture is, compare to previousworks, our approach relies more on the static analysis. Weassume that all the RUs are homogeneous, which implies thatwhen an application is partitioned into a task graph, each taskhas to be able to optimally utilize the computation power ofthe RU. It is a challenge to perform DSE with several designconstraints imposed by the RU design, especially if the high-level synthesis is used.

V. SYSTEM-C SIMULATION MODELThe general structure of our System-C model is shown in

Figure 7. Various types of modules are organized as mentionedin Figure 6 and connected with communication links defined inthe System-C master-slave communication library. The linksin solid line are used to convey resource allocation controlmessages, while the links in dashed lines are for task executioncontrol message passing. The critical design issue of ourmodel is to support task allocation, task execution and taskreallocation.

CPU

Li

C ... CL3

LT1

_t ......... ~~LT1

Sche' Sche' Sche

C e S C _______________

Resource managementmessage path <

Task executionrnessag2path ->

Fig. 7. COSMOS model structure

A. Task allocationWhen CPU requests to execute an application, it sends out a

message that includes the header description of the applicationto all the C-nodes through link LI. The information containedin the header are the application's allocation priority, de-fault distribution requirement, distribution matrix and theapplication size. Applications with higher allocation prioritycan force low allocation priority tasks to be reallocated andgive up resources. The default distribution requirementis an integer that specifies the number of S-nodes neededfor optimal allocation of an application. For instance, thetask graph in Figure 3A will be optimally executed if it isallocated on 2 S-nodes, due to its task-level parallelism. Thedistribution matrix specifies how the tasks are divided intogroups, each of which should be allocated on the same S-node. For example, the task graph in Figure 3A can have adistribution matrix= [[Ti, T3, T4],[T2, T5]]. This indicatesthat, in order to optimally utilize the task level parallelism andminimize the communication cost, M-node should attempt toallocate task 1, 3 and 4 on one S-node, and allocate task 2 and

132


5 on the other one. Application size simply stands for howmany tasks the application has been partitioned into.Upon receiving the message that requests C-nodes to start

up an application, each C-node will further send requestthrough link L2 to the M-nodes in their clusters to exam theresource distribution. M-nodes send the request further downto S-nodes through link L4, and each S-node reports how manyfree context it has to M-nodes through link L5.

At this point, each M-node has an updated resource distribu-tion map of the whole coprocessor, and needs to evaluate if theM-node itself is resource-abundant. Application is optimallyallocated if its tasks are nested into cluster, thus having clus-tered free resources around an M-node ease the allocation forthis M-node. Depending on the resource distribution arounda specific M-node, resources are weighed for this M-node.Another factor that influence the allocation is the reallocationpotential of each M-node. If there are many high-priorityapplications nested around and being controlled by an M-node, reallocation will be difficult to perform for this M-node. Thus, we sum up the priorities of all the running tasksbeing controlled by each M-node, and use it to downgrade theoverall resource count. To summarize, each M-node weighsits resource distribution map and sums up all the weighedresource to get an overall weighed resource count, then thenumber is divided by the priority sum of all the running tasks.

After the M-nodes calculate their final resource count, thenumber is sent to the C-nodes through link L3. C-nodes thendecides which M-node has the highest amount of resourceavailable for the application. Together with the applicationpriority and the distribution matrix, the decision is thenpassed through link L2 to the selected M-node for setting upthe task execution.The selected M-node first attempts to reallocate some run-

ning tasks, whose priorities are relatively lower than that ofthe new application, to free up some S-nodes till there isenough free clustered context to allocate the newly-startedapplication. Then the new application's tasks are allocated tothe free S-nodes with the guidance of distribution matrix. Ifthe distribution matrix can not be strictly followed due tothe resource availability, spanning to several more S-nodes isallowed. After each task is allocated onto an S-node, the M-node sends a "start" message to all these tasks through linkL4, the corresponding S-node and link LT3 to signal the taskexecution. And finally, some bookkeeping is done in M-nodesand S-nodes.

B. Task execution control

The task execution in COSMOS is essentially the same asin ARTS multiprocessor model. Task execution is controlledthrough the synchronizer in the M-node and the scheduler inthe S-node, as shown in Figure 7. In COSMOS, we adaptto the well-understood direct synchronization (DS) protocoland the earliest-deadline-first (EDF) scheduling for initialexperimentation.

Task interacts with the run-time system in the similar wayas ARTS tasks model. When a task is ready for execution,

it sends a "ready" message to synchronizer through link LT1.When the synchronizer and scheduler permit the task executionto start, the task receives a "run" message through link LT3and starts executing. When the task is in the "run" state, it canbe preempted by the scheduler at any point of time. Similarly,when the task is in the "preempted" state, scheduler can issue"resume" message to let the task continue executing.The synchronizer acts as a task dependency filter. Its

purpose is to block the execution of those tasks that haveunsolved dependencies to the preceding tasks. When a task isinitialized and has requested for initial execution through linkLT1, the synchronizer will immediately block the task if thereare unsolved data dependencies. Every time a task finishesits execution, the synchronizer check if any blocked task'sdependency is completely solved and ready for execution. TheM-node is selected to perform the synchronization since it hasthe control of the whole application.When a task is released by the synchronizer, a message

is sent from the synchronizer to the scheduler through linkLT2. The EDF scheduler then decide if the task should startthe execution on the S-node immediately or be blocked untilthe currently running task is finished, depending on whichtask's deadline is arriving earlier. If the task released by thesynchronizer has a tighter deadline compared to the currentlyrunning task, the currently running task is preempted andblocked in a task list. Once the running task has finished itsexecution, the blocked task that has the earliest deadline isselected for execution.

C. Task reallocation

As mentioned before, task reallocation can occur anytimebetween the time the task starts initialization and the end ofexecution. The reallocation basically involves putting the taskinto the reallocation state for a period of time and updatingthe task model's information about onto which S-node it isreallocated. The reallocation of a task goes through severaldifferent scenarios when the reallocation is initiated at differentpoint of time.The first possibility is the case that a task is reallocated

during initialization. In this case, the task hasn't been blockedby either the synchronizer or scheduler, and the task goes backto initialization right after the reallocation is finished.The second possible case is the situation that a task is

reallocated when it is in the ready state. In this case, the taskmight be blocked by either the scheduler or the synchronizer.When the task goes into the reallocate state, the synchronizeror the scheduler that blocks the reallocated task need toclean up the record of blocking. When the task finishes thereallocation, the task sends the "ready" message again to thesynchronizer to get processed again and goes into the "ready"state again. It is worth noting that, if a task is blocked by thesynchronizer when the reallocation start, it is not necessarilytrue that the same task is still blocked by the synchronizerwhen the reallocation is finished, since the task dependencycan be solved during the reallocation process.

133


The last possibility is the case that a task has been partiallyexecuted before reallocation. The task can only be blockedby the scheduler, or not be blocked at all. If the task isblocked by a scheduler, the scheduler also needs to clean upthe record of task being blocked. After the task is reallocated,the task goes into the "preempted" state and send a "ready"message to synchronizer, which will directly pass the messageto the scheduler where the task is reallocated since the taskdependency has been solved before.

D. NoC model and communication tasks

The ARTS framework explicitly models the communicationlatency between tasks if the tasks are allocated on differentprocessing elements. As shown in Figure 8, communicationbetween tasks are treated in two different ways. A localcommunication inside of a processor, e.g. the dependencybetween Ti and T3, is assumed to have no timing cost, andthe dependency is implicitly solved by the local synchronizer.But the communication between tasks allocated on differentprocessors are transformed into communication tasks withexplicit execution time, e.g. as task ci 2. A NoC scheduler isused as shown in Figure 8 and 7 to handle the communicationlatency and NoC scheduling strategy. The communicationlatencies are decided before simulation time.

Sync Sync

Sche Sche

CPUl CPU2 CPUl NoC CPU2

Fig. 8. ARTS communication task

In COSMOS model, since the tasks can be reallocated atsimulation time, any task dependency can become a com-munication task. Furthermore, the communicating source anddestination is not fixed on the RU array, thus the physicalsystem and the model should have a varying communicationlatency. In our model, we assume all the task dependencies tobe a communication task, and each communication task hasa base latency. Each time a task is reallocated, the commu-nication task that is linked to the reallocated task update itscommunication source or destination's coordinates, dependingon how the task is linked to the communication task. If acommunication task's source and destination are allocated onthe same S-Node, the communication will be finished in onesimulation clock cycle, which is negligible. If the source anddestination of a communication task are not allocated on thesame S-node, the communication latency is the product of thebase communication task latency multiplied by the number ofhops between the source and the destination s-nodes.

VI. SIMULATION RESULTS

To demonstrate the function of the model, we set upthe architecture and application as shown in Figure 9. Thearchitecture is a 3x3 RU array with one C-node, one M-node and seven S-nodes, each of which supports dual-context.Application 1, 2, and 3, whose task graphs are shown in Figure9C, start their execution at t=TI, T2 and T3, respectively,as shown in Figure 9A. The application I is assigned aslightly earlier deadline compared to the other 2 applicationsfor demonstrative purpose. We assume all the communicationtasks to have a single-hop latency of two clock cycles, and theNoC scheduler can only handle one communication messageat a time. The latencies for task initial configuration andtask reallocation are assumed to be 5 cycles. The latenciesof task staying in reconfig-preempt state and reconfigirunstate are assumed to be 3 cycles. All the numbers presentedhere, including the size of architecture and various timingfigures, are only for demonstration purpose and only servesthe purpose of helping readers to understand the function ofthe model. COSMOS is a flexible model, and there is noconstraints on how these number can be decided.An optimal system's reallocation strategy should minimize

the occurrence of task reallocation while keeping the overallcommunication overhead small. But for our experiment, inorder to demonstrate the scenario of task reallocation witha simple setting, we select a reallocation strategy that is farfrom optimal. We define the M-node to be the only clustercenter for all three applications. When each application isinitialized, its task will be allocated as close to the M-nodes aspossible, resulting in the lower priority task to be reallocatedon the S-nodes farther away from the M-node. This is achievedby weighing the resource with the distance between the S-node and the M-node during resource evaluation, selecting themost resource-optimal S-node for allocation and selecting thesecond-most resource-optimal S-node for reallocation.At t=14, CPU requests to start application 1. The M-node

first check the application's distribution matrix for allocationguidance. According to the application I's distribution matrix,which suggests that task T00 and TO1 should be allocated onthe same RU, the M-node initialize both tasks on S-node(0,2).Task T02 and T03 are both allocated on S-node(l,l) for thesame reason. After the tasks finishes the initialization and getready for execution, only task TO0 goes into the "run" state,since it's the only task without unsolved dependencies. All theother tasks are blocked by the synchronizer for the time being.

At t=22 the application 2 is initialized. Since the application2 has the same priority as application 1, it does not cause anytask reallocation. At t=30, application 3 starts its initialization.Since this application has a higher allocation priority, previ-ously allocated tasks have to be reallocated to more remote S-nodes. As shown in Figure 9B, task TO 0, TO 1 and TO 2 arereplaced by task T2 0, T2_1 and T2 2, respectively. From thewaveform, we can see that the reallocated tasks enter and exit"realloc" state at the same time. Since task TO0 is runningwhen being reallocated, after it finishes the reallocation, it

134


IT1 IT2 IT3. m

*r v0 NI 0!) 9 E0100 I 1I 01

>Cyie

TO_l

COL. __-3TO 03=3 7 2 3 X 0

12_l3 02 0~2 T

0 1 2 30o 3 oo 2 X 3 0

o 2 3 o1 2 3

o0 3 0o 2 3

0

o O

0 2 2 3 x Q

O fi 1 X 2 3 : 4 6 5 3

A. Simulation waveformState enumeration: 0=idle, 1 =init_config, 2=ready, 3= run, 4=reconfig_preempt, 5=reconfig_run, 6=preempted, 7=realloc

0 1 2

t=2 ns

t=22 ns

C

2

0 1 2

T2-L T20

LO 2 TO=3Tns

t=30 ns

texe4 TOO

C0_01 0_02

0013 00234texe=3X texe=4()texe =,4

Application 1Alloc priority = 1Distribution matrix=

[[TO O,TO 1 ],[TO_2, TO-3]]

B. Task reallocation

tex 4T10

texe=3,

C 1 _21texe= (X

Application 2Alloc Priority=1Distribution matrix=

[[Tl O,Tl1[l,[T1_2]]

4T24texe10(i)

C2_0_

t =4(94exe

C2_1_

texe-=1 0 X

Application 3Alloc Priority=2Distribution matrix=

[[T2_0,T2_1 ],[T2_2]]C. Application task graphs

Fig. 9. Demonstrative simulation

goes into "preempted" state and wait for synchronizer andscheduler to start it again, as shown in Figure 4. The othertwo reallocated tasks go back to "ready" state and wait fortheir dependencies to be solved.

After the reallocation, communication task c0O0i1 andc0_2_3 become non-local, and the communication taskc0_0_2's latency is increased by one hop. The communicationtask c2_0_1, which is made local by the distribution matrixand reallocation, cost only one clock cycle to finish.At t=98, task T2_2 goes through a few state changes, which

is caused by task T0 3. As shown in Figure 9B, these twotasks are allocated on the same S-node. At t=86, task T2_2'sdependency is solved, and the synchronizer starts its execution.When the simulation time reaches 98, task TO_3's dependencyis also solved, and the scheduler decides that T03 should startits execution since it has an earlier deadline. Task T2_2 goes

through a long preempt phase and return to the "run" stateafter task T03 finishes its execution.

VII. FUTURE WORK

Our current work concentrates on the real-time behaviorof the task being executed on the coprocessor, but the tim-ing characteristic of the management strategy has not beenthoroughly addressed. Take reallocation for an example, thedecision of which task should be reallocated to which S-node is an important decision. If the M-node can thoroughly

analyze the current resource distribution status, the reallocationstrategy will improve the result. However, the decision ismade by M-node at run-time, thus the latency on making thedecision becomes an overhead to the task execution as well.There are trade-offs between reallocation optimality and thereallocation latency, and such trade-off is not well-understoodat the moment. In our model, we currently assume that allthe resource management algorithms are executed in no time,which can be too optimistic for a large system. In the future,some improvement can be made on this aspect of our model.

One other major issue is the methodology design. In our

work, we proposed to execute tasks on a homogeneous RUarray, which requires area/resource-constrained partition andsynthesis tool to support the architecture. Not only the task-level parallelism need to be analyzed when designing thetask graph, but in order to optimally utilize the RU of thegiven size and resource, we also have to investigate the loop-level parallelism. Also, task partition has direct impact on theinter-task communication latency, which is a crucial factor on

task execution efficiency. In the future we will study some

benchmarks and investigate how challenging it is to partitionan application into equal-sized hardware tasks in order tooptimally utilize our architectures.

From the application allocation/execution scenario, we can

identify plenty of issues to be addressed in the future. The

135

TO_3

1I 0

T1 0

2-0

C2 0 1

T2121 2

2-2

0

S


strategies of allocation, scheduling and reallocation et.c. areopen for further study, and the architecture design issues interms of NoC and C/M-node design are still not thoroughlyaddressed. We are currently looking into programmable soft-processor for the C/M-node implementation, which gives thepossibilities of reconfiguring a physical S-node into C-node or

M-node, to achieve self-reparation.

VIII. CONCLUSION

We presented a general real-time execution model for tasksthat are executed on the reconfigurable architecture. Then we

proposed a general system-level real-time simulation modelof the coprocessor-coupled reconfigurable architectures. Oursimulation framework is highly scalable, and can be ex-

tended to support various run-time management algorithmsand communication strategies. Through our simulation, we

demonstrated how our simulator can be used for studying thesystem-level design, and pinpointed what architecture designissues can impact the application execution performance.

Reconfigurable system usually are highly complicated toanalyze at an early development stage, and the need ofsimulation tool support is crucial in order to understand theinterplay between the architecture, the application and therun-time management system. Our future work will focus on

the run-time management system development and the designspace/platform space exploration of the reconfigurable systemswith our framework.

REFERENCES

[1] J. Madsen, K. Virk and M. J. Gonzalez A SystemC-based AbstractReal-Time Operating System Model for Multiprocessor System-on-ChipsMultiprocessor System-on-Chips pp 283-311 Morgan Kaufmann 2004

[2] J. Madsen, S. Mahadevan, K. Virk Network-Centric System-Level Modelfor Multiprocessor System-on-Chip Simulation Interconnect-Centric De-sign for Advanced SoC and NoC, Springer, 2004

[3] S. H. Katherine Compton. Reconfigurable computing: A survey of systemsand software. ACM computing surveys, pages 171.210, 2002.

[4] B. Mei, S. Vernalde, D. Verkest, H. D. Man and R. Lauwereins ADRES:An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix, International Conference on Field Pro-grammable Technology, Hong Kong, December 2002, pages 166-173.

[5] A. Marshall, T. Stansfield, I. Kostarnov, J. Vuillemin, and B. Hutchings,1999. A reconfigurable arithmetic array for multimedia applications.ACM/SIGDA International Symposium on FPGAs, 135143

[6] A. Sudarsanam, M. Srinivasan and S. Panchanathan Resource Estimationand Task Scheduling for Multithreaded Reconfigurable Architectures Pro-ceedings of the Tenth International Conference on Parallel and DistributedSystems (ICPADS04)

[7] C. Steiger, H. Walder, and M. Platzner, Operating Systems for Reconfig-urable Embedded Platforms: Online Scheduling ofReal-Time Tasks IEEETRANSACTIONS ON COMPUTERS, VOL. 53, NO. 11, NOVEMBER2004

[8] S. Trimberger, D. Varberry, A. Johnson and J. Wong, A Time-MultiplexedFPGA The 5.th Annual IEEE Symposium on FPGAs for Custom Com-puting Machines. 1997 proceedings, page 22-28

[9] H, Liu and D. F. Wong A Graph Theoretic Optimal Algorithm for Sched-ule Compression in Time-Multiplexed FPGA Paretitioning IEEE/ACMInternational Conference on Computer-Aided Design, 1999. page 400-405

[10] www.xilinx.com Two Flows for Partial Reconfiguration: Module Basedor Difference Based XAPP290 V1.2 September 9, 2004

[11] A. Lodi, M. Toma, F Campi, A. Cappelli, R. Canegallo, R. GuerrieriA VLIW processor with reconfigurable instruction set for embeddedapplications Solid-State Circuits, IEEE Journal of Volume 38, Issue 11,Nov. 2003 Page(s):1876 - 1886

136


Documents

COSMOS: A System-Level Simulation Reconfigurable Systems · PDF fileReconfigurable Systems ... reconfigurable unit (RU) is frequently designed as a special instruction-set ... reconfigurable