SDG2KPN: System Dependency Graph to Function-level KPN … · 2014-05-11 · Figure 2 depicts an example streaming application (i.e., MPEG-2 encoding) where several can-didate functions

SDG2KPN: System Dependency Graph to Function-levelKPN generation of Legacy Code for MPSoCs

Jude Angelo Ambrose, Jorgen Peddersen,Sri Parameswaran Alvin Labios, Yusuke Yachide

School of Computer Science and Engineering Canon Information Systems Research Australia (CiSRA)University of New South Wales AustraliaSydney, Australia NSW 2052

e-mail: {ajangelo, jorgenp, sridevan}@cse.unsw.edu.au {Alvin.Labios,Yusuke.Yachide}@cisra.canon.com.au

ABSTRACTThe Multiprocessor System-on-Chip (MPSoC) paradigm as a vi-able implementation platform for parallel processing has expandedto encompass embedded devices. The ability to execute code inparallel gives MPSoCs the potential to achieve high performancewith low power consumption. In order for sequential legacy codeto take advantage of the MPSoC design paradigm, it must first bepartitioned into data flow graphs (such as Kahn Process Networks— KPNs) to ensure the data elements can be correctly passed be-tween the separate processing elements that operate on them. Exist-ing techniques are inadequate for use in complex legacy code. Thispaper proposes SDG2KPN, a System Dependency Graph to KPNconversion methodology targeting the conversion of legacy code.By creating KPNs at the granularity of the function-/procedure-level, SDG2KPN is the first of its kind to support shared and globalvariables as well as many more program patterns/application types.We also provide a design flow which allows the creation of MPSoCsystems utilizing the produced KPNs. We demonstrate the applica-bility of our approach by retargeting several sequential applicationsto the Tensilica MPSoC framework. Our system parallelized AES,an application of 950 lines, in 4.8 seconds, while H.264, of 57896lines, took 164.9 seconds to parallelize.

1. INTRODUCTIONParallel processing has been touted as a method to accommo-

date the demand for functional complexity by using increases in theavailability of parallel resources. To exploit parallelism, the Mul-tiprocessor System-on-Chip (MPSoC) paradigm has emerged as aviable implementation platform within embedded systems. In theseMPSoC systems, multiple processors share the workload of an ap-plication by executing parts of the application (i.e., tasks) on sep-arate processors to achieve better performance than executing theapplication on a single processor [5, 6, 7]. Applications designedfor execution on an MPSoC are typically created with the tasks ofeach processor predetermined to communicate data between tasksexecuting on other processors. Thus, the data to be transmitted isknown when the code is written [15].

Data communication between tasks is rarely considered whenwriting sequential code intended to execute on a single processor.Hence, code written in this manner (i.e., legacy code) requires anal-ysis and/or a manual redesign before it can be split to execute onmultiple processors, to determine which data must be transferredand when [4]. This analysis typically involves conversion to adata flow graph (DFG) [4, 15], where the application is partitionedinto communicating tasks/functions. Tasks from these DFGs aremapped to multiple processors in the MPSoCs to improve perfor-mance, by exploiting parallelism [19], pipelining [6] and time mul-tiplexing of tasks from different applications [15]. The Kahn Pro-

cess Network (KPN) is considered a suitable and general model fora DFG [4, 19], consisting of concurrent processes/tasks/nodes thatare able to communicate data. When modifying existing code toutilize parallelism, a designer is required to rewrite code to suit thesystem architecture and communication patterns, or they require anin-depth understanding of all parts of the original code and the in-teraction between its various components to construct the KPN ofthe application. This paper provides techniques to automate thisprocess of creating a KPN from legacy code, making it easier andfaster for a designer to modify existing code to make use of newerprocessing environments.

There are several key challenges that have been identified asbeing missing from previous methods for KPN generation fromlegacy code. 1), techniques [9, 13, 18] and tools (such as COM-PAAN [9] and KPNGen [13]) exist to generate a KPN from a re-stricted piece of code. These techniques support static-affine nestedloops and require manual modifications to the legacy code [4, 1].This manual analysis is tedious and time consuming; 2), there wasno method to efficiently detect variable dependencies across func-tions; 3), existing register and instruction level techniques to cre-ate dependence graphs [14] are restricted to specific hardware plat-forms, making them less portable; and 4), the state-of-the-art toolsrequire significant manual modifications to the legacy code (i.e.,completely rewriting most of the code and require heavy modifi-cation to the code by a talented designer in order for the code towork). Hence, reducing the amount of manual intervention willenable more legacy applications to be ported to MPSoCs.

For the first time, we propose the use of the System DependencyGraph (SDG) in order to detect dependencies between functionsand/or tasks for KPN generation of legacy code. The SDG is a di-rected graph which represents inter-function and sub-function de-pendencies, at program point granularity, of legacy sequential code[20, 10]. SDG2KPN allows analysis of shared variables, unlikeprevious approaches. SDG2KPN performs automated analysis ofthe code and detects the usage of global variables between differ-ent functions. Finding variable usage between functions allows ex-isting code to be implemented within a multi-processing environ-ment in a fraction of the time of existing approaches, as the useronly needs to identify potentially important functions and variableswithin the code rather than studying the interoperation of the func-tions and determining data flow manually, both of which wouldrequire the user to read and understand most or all of the legacycode. For simplicity, this paper focuses only on data dependenciesbetween functions. The process can be extended to be applied atthe granularity of basic blocks, although the extensions to achievethis are outside the scope of this paper.

Figure 1 depicts an example of our KPN generation methodol-ogy. The legacy code, shown in Figure 1(a), is received as an input.

978-1-4799-2816-3/14/$31.00 ©2014 IEEE 267

4S-1

��

��

�� !��""��#��$$�� %��&'(�� %��

))

*��$ �$$�� +� �� ,$��,$ � ,$"#� " +�

)

*��$ &'(�� ,-��,- � ,-"�"��

)

(a)

.//01 234015.667289: 5.667289:

. ;< =>:2:?

@.8A

:.

.593.678A.593.678A.593.678A .593.67B39

:CD>:228BA <<?EF

GB>08HIF.?IF

:CD>:228BA

<

J6B4.67.593.678A

;<.593.678A

KLMNOP QOROSTOSULVWXRY

(b)

Z[\][\ ^_`\a\abcde fbed

ghijkhllmnoijp qr hstur vwxyzmisuwi{tuwi||no

q s}uwvshllmhr ~qnw��

�

ghijk��mnijp qr hstur vwxyzmisuwi{tuwi||no

��mvr hr ~qnw

��

�yil hllmijp �r ijp �lno�l s �l|}u | �w

�

�yil ��mijp vr ijp xrijp ��no

�� s ��|x|vw�

(c)

��

�� ¡¢£ ¤¤

� ¡¢£¥¦§

�

¨©¨ª

(d)

Figure 1: An Example KPN Generation Methodology

An intermediate representation of this code is produced in the formof a System Dependency Graph (SDG), as shown in Figure 1(b). Asshown in the example SDG, the main function (referred to as theparent function) is decomposed into program points/vertices, rep-resenting all the statements of the code and their control and datadependencies. The goal of our SDG2KPN approach is to create aKPN for the parent function (i.e., main) by partitioning the add()function call and the sub() function call and identifying the vari-able dependencies between these function calls. The functions en-visaged for partitioning are referred to as candidate functions. Theadd() and sub() candidate function calls (i.e., call-sites) are con-verted to KPN processes, main_add and main_sub, which transferdata within variables h and e, as shown in Figure 1(c). As shownin the example, the sub() call-site depends on the add() call-sitefor variables e and h. The variable e is updated by the add func-tion and read by the sub function, whereas the variable h is passedby reference, modified by add() and passed to sub() for reading.Finding dependencies of shared global variables, such as h, acrosscandidate functions is a challenge which has not been addressedbefore, with previous techniques only looking at explicitly defineddata flow specified in the function’s parameters and return value.We provide an automated rule based SDG analysis to detect suchshared variable dependencies (including all the other variable de-pendency types). Such dependencies necessitate a communicationlink (such as FIFO, DMAs, shared memory, etc.) for transferringthese variables from add() to sub(). Once a KPN is generated, wepropose a design flow to generate partitioned code to execute inan MPSoC platform (as shown in Figure 1(d)). For example, thegenerated code shown in Figure 1(c), contains two main functionsmain_add() and main_sub() which will be executed in two differentprocessors. Variables are transferred from main_add() using sendcommands, which write the contents of the variables to the FIFO,and are received at main_sub using receive commands, which copythe contents into appropriate variables. Figure 2 depicts an examplestreaming application (i.e., MPEG-2 encoding) where several can-didate functions are generated into KPN processes (shown right).The initialization functions/code, i.e., those that appear outside ofthe candidate functions, are either mapped to the first processor ordistributed to their respective processes of the KPN. The processesof the generated KPN are mapped to the processors in the MPSoCplatform for parallel execution.

The rest of the paper is organized as follows; Section 2 and Sec-tion 3 detail the related work and methodology respectively. Theresults are presented in Section 4. Conclusions are provided in Sec-tion 5.

2. RELATED WORKSeveral techniques and tools exist in literature for converting se-

quential code to partitioned concurrent code for MPSoCs.

«¬ ®¯«¬°±²«¬««¯³«´¯«µ¬°±¶·µ¸°«¹º¶«»¼½¾¿ÀÁÂÃÄ¶«ÅÅ±²«¬««¯³«´¯«µ¬°±¶

®µ«µ¬¾Æ´«®¯«µ¬°±¶Ç¸ÆÈ«É«µ¬°±¶¸¯¬´·µ¸®°±¶ÇÊÇ«É°±¶«ËÊ¯¬¾«¬¸¯°±¶«¸¯¬´·µ¸®°±¶Ì¸«Æ·¸¯®Æ°±¶ Í

Í

ÎÏ

ÐÑ ÐÐ

ÒÓ

ÒÑ

ÔÕ

Ö×ØÙ ÐÚÛÜÝÞÞßÝà�ÛÚ� �ÖÐß�

�¯¸Æ¬ ¿Ê¬É«µ¬�¬« ´Æ�®Æ¬

��

��

Figure 2: A Streaming Application Template

KPNGen [13] from the Daedalus framework allows creation ofKPNs, where the user has to explicitly modify the function bound-aries (parameters) to create the dependency links between the KPNprocesses. This puts much of the work in determining which vari-ables are used onto the user. The passing of data via non-explicittechniques, such as global variables, is not handled by this ap-proach.

COMPAAN [9] automatically creates a KPN process for eachfunction of an Affine Nested Loop Program (ANLP) written inMATLAB syntax. Any legacy code written in other languages(such as in C) has to be manually converted to ANLP programsin MATLAB. An array data flow analysis is performed in COM-PAAN [9], at the statement level, and therefore the task boundariescannot reside across nonlinear or data-dependent conditions [2].

The authors in [4] suggest an overview for a sequential code toKPN transformation methodology, utilizing the call sequence graphand the control flow graph of sequential C code. Both these graphswere proposed for the analysis to find data dependencies betweenfunctions and assumed to have been created from a modified Ab-stract Syntax Tree (AST). No details were provided for the analysismethodology for task partitioning in [4] and it is explicitly statedthat this is a suggested approach, which has not been implemented.The authors in [4] further suggest that they will modify the AST af-ter task partitioning, which is quite a complicated and challengingendeavor.

Compared to these previous approaches, our methodology con-siders shared variables (such as globals and pointers) to detect datadependencies between functions, while requiring much less userinput and does not require the legacy code to be rewritten or sig-nificantly modified. Our approach can be ported to any hardwareplatform and ISA, since it is performed at the code level. We donot require the code to be in a static affine nested loop form, asrequired by [9] and allow both pipelining and parallelization, incontrast to [14]. The closest approach to our methodology is [4],which proposes utilizing the AST, call sequence graph and controlflow graph to create a KPN. The KPN is proposed to be generated

268

4S-1

from a modified AST. The AST is much finer-grained than SDG,hence including a significantly large amount of unnecessary infor-mation (i.e., much more nodes in the graph) which the analysis hasto go through (if implemented). Hence our SDG based approach isless complicated and we demonstrate the feasibility and practicalityof using the SDG in this paper.

2.1 ContributionsSome of our novel contributions are as follows:• An SDG based function-level KPN generation methodology

is proposed. This methodology is platform agnostic and sup-ports shared variables.

• A rule-based traversal of the SDG is proposed to detect de-pendencies between functions. Such a traversal optimallydetects variable dependencies.

• A complete design flow is demonstrated to generate applica-tion specific MPSoC systems.

3. SDG2KPN METHODOLOGYFigure 3 depicts the SDG2KPN methodology, illustrating the in-

tegral components. The inputs are the legacy code and the applica-tion input data. An application specific MPSoC system is the finaloutput. The Networker (detailed in Section 3.2) is the componentwhich enables the novel contributions of this paper. The remain-ing portions of the methodology, such as the Abstractor, Mapper,Target Generator, Simulator, Annotator, Balance Checker, GraphOptimizer and Tuner, are developed based on existing state-of-the-art approaches.

The Abstractor (detailed in Section 3.1) receives the legacy codeand creates a System Dependency Graph (SDG). An initial KPN iscreated by the Networker by analyzing the SDG. Each process/nodein the initial KPN will correspond to a candidate function. TheSDG is analyzed to determine the dependencies between candidatefunctions, by checking which variables are read or used by othercandidate functions. Once dependencies are detected between twoKPN processes, a corresponding communication link is establishedbetween them, to transfer the variables. The initial KPN is thenpassed to the Mapper (detailed in Section 3.3) which maps the KPNto processors. An MPSoC system for the analyzed legacy applica-tion is generated by the Target Generator (detailed in Section 3.4).MPSoC simulations are carried out to the generated MPSoC systemwith exemplary application input data (for example, image inputfiles for JPEG).

��

��

��

��

��

��

��

��

��

��

��

��

��

Figure 3: The SDG2KPN Flow

The Simulator (detailed in Section 3.5), performing cycle accu-rate simulation of the MPSoC system, reveals the total time con-sumed by each process which is then extracted by the Annotator(detailed in Section 3.6) to form the Load Annotated KPN. TheBalance Checker (detailed in Section 3.7) evaluates the load and

considers any possible optimizations that can be made to the KPNto improve performance on the target architecture (such as load bal-ancing for pipelined systems). Such optimizations are performedby the Graph Optimizer (detailed in Section 3.8), which then gen-erates a modified KPN. This is an iterative step, which will ter-minate once there is no further performance improvement to beachieved. If the KPN is already verified by the Balance Checkerto have no further identifiable improvements, the Tuner (detailedin Section 3.9) executes a fine grained adjustment to the hardwareresources (e.g., processors) to further improve the performance ofthe MPSoC system, if possible. The final MPSoc system is outputby the Tuner.

3.1 AbstractorThe Abstractor creates the SDG in three steps by: 1) creating

an Abstract Syntax Tree (AST) 1 of the sequential legacy code;2) creating the Program Dependency Graph (PDG) 2; and then 3)creating the SDG by examining the dependencies across proce-dures/functions.

Our SDG2KPN methodology utilizes the commercial tool, Code-Surfer [3], as an Abstractor to produce the SDG. As shown in theexample in Figure 1, the SDG contains vertices for each programpoint of the code. Each vertex has a type and is attached to a vari-able. For example, the a=20 vertex is of type expression and thevariable attached is a. The main procedure has two candidate func-tion call-sites add() and sub(). A parameter of a candidate func-tion call-site is referred to as an actual-in and the return variable isreferred to as an actual-out. For example, variable a is an actual-in of call-site add() whereas variable $res is an actual-out of theadd() call-site. Any pointer variable or global variable will con-tain a global vertex, global-actual-in for variables being read andglobal-actual-out for variable being written using that call-site. Forexample, the variable &h being passed by reference creates a ver-tex global-actual-in for h. The arguments of the actual function arereferred to as formal parameters (not shown in Figure 1). Inter- andintra- procedural edges are identifiable in the SDG (not shown).Readers are referred to [3] for more details about the SDG and itselements.

3.2 NetworkerFigure 4 depicts the component flow of the Networker. The user

specification is provided by the user, listing all the candidate func-tion names and their line numbers from the legacy code. Thisallows the user to choose or override which functions would beconverted to candidate functions. Candidate functions can also beautomatically chosen if there are no user specifications. The Func-tion Mapper component performs the extraction of these functions(which are either user specified or automatically chosen) from theSDG.

The Traverser defines the characteristics of the communicationstructures for the initial KPN. A Variable Dependency Graph (VDG)is generated to indicate the candidate functions and their variabledependencies to other candidate functions. A variable dependencyexists when a variable set by one candidate function is used in an-other candidate function. Such a variable may be a global variable,passed to a candidate function as parameter of the function callor returned to the calling function at completion of the candidatefunction. Figure 5 provides an example for VDG, where the sub()call-site’s variables e, c and h depend on add() call-site. These de-pendent variables form a communication link in the KPN, linking

1an AST is a graph representation of the syntactic structure of thecode [4]2PDG is a graphical representation of a logical order of executionof statements within each function

269

4S-1

the add and sub candidate functions. This data may be communi-cated by several techniques, such as FIFOs or shared memory. Inthis example, FIFOs have been chosen to be used for communica-tion. Thus, the communication path is labelled FIFO1 in Figure 5.The Traverser creates the VDG by analyzing the SDG for all thevariable dependencies across candidate functions. The examina-tion of the SDG involves the use of Traversal Rules, which describemethods to determine the variables shared between two candidatefunctions that have a dependency between them. For example, thedependency of returned variable e from the add candidate functionin Figure 5 to the sub candidate function is detected using a traver-sal rule. Section 3.2.1 details the Traversal Rules and discussesfurther examples.

��

��

��

��

��

Figure 4: The Networker Flow

!"#$ %&'()** +),-'&. /!%+0

123455 6789:;

7< 7

4

=>=?@

=>=?A

9BC <DEF GDEF 7H9BC I49BJKL9BC 4F3H7 D 455JM<F MGF M4KH123JM7F <F GKH6789:;J4F 7KH87C28B

N

123JK

7 < G

455JK

=>=?O

<G

4

6789:;JK

74

P"&Q"RS)* T)U)$V)$(W X&"U# /PTX0

G

Figure 5: An Example for VDG to KPN conversion

The Binder receives the VDG and creates a corresponding KPNprocess for every candidate function in the VDG. Communicationchannels (FIFOs in the example) are created between candidatefunctions to establish communication for dependent variables, andare thus denoted in the VDG. Where maximum communicationsize is important, uch as when using FIFOs, these communicationchannels are given finite size and depth dimensions calculated to beat least as large as the total data transferred via that channel. Forexample, FIFO1, FIFO2 and FIFO3 of the KPN in Figure 5 are ofsize 12, 4 and 4 bytes respectively.

3.2.1 Traversal Rules for DependenciesThe SDG is analyzed using traversal rules to extract variable de-

pendencies between candidate functions. Finding these dependen-cies allows the contributions of the paper to be realised.

As shown in the SDG in Figure 1, the actual-in of the sub() call-site has an intra-predecessor dependency to the expression “e=”.This expression further depends on the actual-out $res. $res is theactual-out of another candidate function call-site add(). Such adependency traversal identifies that variable e in the sub candidatefunction depends on the add candidate function.

To implement this example traversal to find the dependency (asa result, the VDG), the Traverser executes three nested loops. Theoutermost loop iterates through the candidate functions. The nextloop iterates through each input variable of each candidate function.Detection of variable dependency using rule traversal is performedat the innermost loop. Table 1 summarizes all the major rules tocreate the VDG. Further rules can be added to consider complicatedprogram patterns and dependencies.

Rule 1 is applied when the outermost loop iterates through eachcandidate function and finds multiple call-sites of the same can-didate function. This step determines the first candidate function

and applies traversal Rule 2. The second step of the traversal pro-cess iterates through each actual-in and global-actual-in variable(also referred to as input variables) of a candidate function call-site. These input variables are analyzed using the traversal rulesto determine which candidate functions and variables a particularinput variable depends on. As shown in Table 1, non-global inputvariables are analyzed using Rules 3, 6 and 7, whereas the globalinput variables are analyzed using Rules 4, 5, 7 and 8.

main

call−site

sub()add()

call−site

a

$resultb

e = add()

Actual−in

Actual−inActual−out

expression

b

e

Actual−in

Actual−in

main(){parent

}sub(e, b);

e = add(a, b);

verify();

int a, b, e;

call−site

verify()

$return

add

Formal−out

b

a

Formal−in

Formal−in

Rule 3

Figure 6: Example Rule 3 in Traverser

No. Rule Name Description

1 call-sites & - multiple call-sites per callKPN processes - create unique KPN process per call

2 first candidate - mapped to the first KPN

3 output of - actual-in links to an intra-predecessora call-site - the intra-predecessor is an actual-out

4 global-actual-in - global-actual-in links to global-actual-outto global-actual-out - global-actual-out of a predecessor

- traverse via intra-predecessor vertices5 global-actual-in but not - none functions modify the global variable

to global-actual-out - dependency link from parent6 actual variables - variables passed as pointers

modified using pointers - modified by the predecessors7 C structures - dependency in complicated types

- entire type is considered as modified8 Global variables - static variables also fall into this rule

only within function - none link betweens global actuals

Table 1: Traversal Rules Summary

Figure 6 depicts the SDG and its traversal path for Rule 3. Vari-able e is returned by the add function and passed as a parameterfor the sub function, revealing a dependency which is captured byRule 3. As mentioned above, the traversal first iterates for ev-ery candidate function call-sites (in this case, verify(), add() andsub()), and then the input variables of each candidate function’scall-site is iterated (such as e, b in sub()). The third step is to tra-verse the SDG from an input variable to determine whether anyrules are found, which in turn detects the dependency links. TheSDG shown includes an actual-in vertex for e at the sub call-sitewhich has an intra-predecessor (i.e., within the main procedure)link to an expression vertex (i.e., e=add()). An intra-predecessorlink exists from the expression to an actual-out, which is labeled as$result. This actual-out belongs to the add candidate function call-site. Such a traversal path with the vertex types captures Rule 3,creating a link in the VDG between add function and sub functionwith variable e.

270

4S-1

verify(int *a){

int c;

c = *a +10;

} d= d + 1;

mainb

parent

sub()add()

call−site

b

global−formal−in

verify()

b

RULE 4

aactual−in

&a

a &a

a

b

global−actual−inglobal−actual−in

a

actual−in

dd

global−actual−in

actual−in

global−actual−out

global−actual−out global−actual−in

d



d

b = 10;

}

add(int *a){

e = e + 2;

main(){

int a, i;

for(i=0;i<10;i++){

verify(&a);

add(a);

sub(&a, &f);}

}

int b;

int d=5, f=10;int e=2;

int c;

c = b + 1;

*a = c + 5;

d = *a;

e

e




e = e + 4;

}

sub(int *a, int *f){

*f = *f + 40;

f

fcall−site




Rule 4 (global ’d’) Rule 6 (pointer ’a’)

Rule 8

Figure 7: Examples for Rule Traversal

Figure 7 depicts a detailed example of Rule 4, 6 and 8 (i.e.,traversing the SDG for dependencies on global variables and point-ers). The sub function shown in Figure 7 modifies the global vari-able d, which creates a global-actual-out vertex in the SDG. As canbe seen, the verify function is reading the modified variable d fromthe sub function, creating a global-actual-in for variable d in theSDG. When traversing from this global-actual-in vertex of d in anintra-predecessor fashion (i.e., within the procedure), an existenceof a direct link to a global-actual-out will reveal a dependency link,captured by Rule 4. The global variables e and b are captured withthe same rule to show a link between the sub and add functions (de-pendency links indicated in skyblue for variable e and blue for b inFigure 7).

Once the rules are analyzed and dependencies detected the Net-worker generates the initial KPN (using the Binder as shown inFigure 4) by combining the information gathered about the KPNprocesses and FIFOs from the SDG, VDG and the functions of thecode. Figure 5 demonstrates an example of VDG to KPN conver-sion, where the dependency variables are combined to form FIFOlinks.

3.3 MapperThe Mapper creates a resource map based on the current KPN.

The resource map specifies the assignment of each node of the cur-rent KPN to one processor, amongst the plurality of processors inthe target MPSoC system.

3.4 Target GeneratorThe Target Generator uses the resource map and the current KPN

to create platform-specific target code. A set of individual buildfiles, such as Makefiles and programs per KPN process are gener-ated. The code for each processor is created by combining the codeof the candidate function in the legacy sequential code that corre-sponds to the KPN process, with the appropriate FIFO read andwrite commands. A FIFO read command is used by an individual

program to take the contents of a variable from the FIFO, whileA FIFO write command is used to place the contents of a variableinto a FIFO. Both these FIFO commands enable communicationof variables between programs which are executed on individualprocessors.

3.5 SimulatorThe Simulator configures a simulation platform, instantiating pro-

cessors and memory blocks for FIFO communication. Individualexecutables are created for the target code to run on each proces-sor. The executables dictate the operation of a processor during thesimulation. Application specific input data, if required, is receivedby the simulator (stored in a predetermined block of memory).

3.6 AnnotatorThe Annotator takes the information regarding the load values

(i.e., the results of the Simulator) of individual processors and cre-ates a normalized representation as an estimate of the load per-formed by the corresponding KPN processes. It creates the loadannotated KPN by appending the normalized load values to the de-scription of each KPN process.

3.7 Balance CheckerThe Balance Checker receives and evaluates the annotated KPN

to determine whether or not the load distribution in the annotatedKPN is sufficiently balanced. A KPN with a balanced load repre-sents an efficient implementation of parallel pipelined code for anMPSoC system. This means that the partitioned application codeenables maximum use of all respective processors, thereby mini-mizing the idle time.

3.8 Graph OptimizerThe Graph Optimizer creates a new optimized KPN from the

current KPN. Graph optimization algorithms (such as the ones pro-posed in [12, 17]) are applied to either split or merge KPN pro-

271

4S-1

cesses after the evaluation of the load values. Design constraints,such as the number of processors in the MPSoC system, commu-nication cost based on FIFO dimensions, performance and powerbudgets are manually entered by the user via an input file.

3.9 TunerWhen the Tuner receives the final KPN, a fine grained adjust-

ment/optimization is made to the hardware resources of the MPSoCsystem. For example, if a KPN process has a relatively small load,the cache resources of the corresponding processor can be reducedin size to match the required load. Techniques presented in [6, 16]can be utilized for the Tuner. We utilize the automated tuning ofcode using custom instructions [6] from Tensilica.

4. RESULTSWe performed two separate experimentations for our SDG2KPN

approach. 1), evaluation of MPSoC system generation using ruleanalysis, from legacy code to XTMP platform (for Tensilica setup)generation; and 2), evaluation of the generated MPSoC platform.

Table 2 depicts the SDG2KPN details for KPN generation andrule evaluation. Column one indicates the application benchmarkstested, whereas column two reveals the number of candidate func-tions (CFs) considered per application. Columns three through tenreport the number of rules captured in each application. Columneleven presents the generation time (in seconds), time taken fromreading the legacy code to the end of XTMP MPSoC platform gen-eration using SDG2KPN. The lines of code (LOC) of each appli-cation benchmark (the legacy version) is reported in the twelth col-umn.

The main function in each application (freely available applica-tions were used) was either used as is, if it already contained func-tions with substantial load, or converted into multiple functions bysimply combining certain segments of the code. AES, MPEG-2encoder, MJPEG and H.264 encoder were used as is, whereas thecode segments related to each step in ADPCM encoder applicationwas formed into a function.

A significant number of rules are captured for Rules 4 and 5 onthe tested benchmarks. The H.264.enc application revealed 385rules instances detected for Rule 4, showing that the applicationsignificantly exchanged data via global variables. H.264.enc fur-ther reveals that it captures 621 instances of Rule 5, which indicatesthat many global variables are not used to exchange data acrosscandidate functions but defined as global. These variables werereported to the user by our SDG2KPN for manual modifications(moving the variable declaration and assignment locally to relevantcandidate function). This will heavily optimize performance, elim-inating unnecessary data being communicated. None of the testedapplications have Rule 3, even though passing a returned variableis a widely used program pattern in other applications. Rule 1 hasone to one connection with the number of candidate functions (i.e.,each candidate function will have its unique call-site). H.264 hasthe highest count on Rule 7. This Rule 7 forces to send the entirestructure via FIFOs instead of passing only a dependent elementfrom the structure. Rule 2 constrains the first candidate functionto the first processor hence no predecessor dependencies analyzed(only one occurrence per application).

As shown, MPEG-2.enc has the highest number of candidatefunctions considered (i.e., 10). Due to the huge amount of compila-tion time required, H.264 consumed approximately 164.9 seconds.The next longest one is MPEG-2 encoder, taking 21.4 seconds.The actual SDG2KPN time is much smaller than these values since

3No Balancing or Graph Optimization enforced, but one-to-oneMapping and Tuning applied

Apps. C No. of Rules Gen. LF 1 2 3 4 5 6 7 8 Time Os (sec) C

AES 2 2 1 0 1 3 0 1 8 4.8 950MPEG-2 10 10 1 0 21 56 1 7 1 21.4 8k(enc)MJPEG 6 6 1 0 8 55 0 3 2 8.0 2kH.264 9 9 1 0 385 621 0 47 1 164.9 58k(enc)ADPCM 6 6 1 0 11 2 0 0 7 4.3 285(enc)

Table 2: KPN Generation and Rule Evaluation3

the generation time includes the compilation time and CodeSurferSDG generation time as well (which are the major contributors tothe overall generation time).

Our Tensilica MPSoC system used homogeneous processors, eachcontaining 32kB instruction and data caches, 4-way set associativeand 32-byte line size. 65nm technology was used with 1GHz fre-quency. Table 3 depicts the numbers from the XTMP simulationsof the generated MPSoC platform. Column one shows the appli-cation benchmarks tested (MPEG-2 and H.264 were avoided dueto their large XTMP simulation times). Column 2 indicates thenumber of processors used. Columns three, four and five report thelatency, power and energy of the generated MPSoC system respec-tively. Columns six, seven and eight report the latency, power andenergy of the application executing on a single Xtensa processor re-spectively (the same Xtensa processor from the MPSoC executionis utilized).

The MJPEG application is generated for both frame level (F)and macro block level (MB), by arranging the iterative loop bound-ary and the FIFO communication. Speedups of 3.4%, 16.8% and5.15% are achieved for the XTMP executions of AES, MJPEG(Frames) and MJPEG (Macoblock) respectively, compared to thesingle processor execution. The ADPCM.enc has not gained anyspeedup but slowed down due to feedbacks from successive pro-cesses of the KPN to their predecessors. The code was not furtherpartitioned to create more parallelism to improve speedup, since itwas not the focus of this paper. The total power and energy haveincreased across all the bechmarks due to more processors in theMPSoC.

For fair comparison with SDG2KPN generation time, we man-ually created KPNs for several benchmarks, such as MPEG-2 andMJPEG. MPEG-2 manual generation took a few weeks due to thecomplexity in identifying global variables and their dependencies,whereas MJPEG manual generation took a few days. Our SDG2KPNtook only a few minutes (worst case) to generate the code and con-sumed another few hours for manual modifications.

Table 4 depicts a feature comparison between our SDG2KPNtool and other state-of-the-art tools.

5. CONCLUSIONWe propose an SDG to KPN conversion methodology using a

rule based traversal of the SDG. All the variable constructs of thelegacy code, including shared variables such as globals and point-ers, are supported, which was hitherto not possible. We developedour technique in a complete design flow to demonstrate the practi-cality of the solution. The entire MPSoC platform was generatedin a period measuring a few seconds to a few minutes for a simpleAES application to a much complicated H.264 encoder applicationrespectively. Our approach requires user modifications and inputswhich are far less complicated than the state-of-the-art solutions.

272

4S-1

Apps. No. of Processors Latency Power Energy (KPN) Latency Power EnergyKPN (cycles) KPN (mW) KPN (uJ) single (cycles) single (mW) single (uJ)

AES 2 130,631,568 180.54 23,573.59 135,289,982 123.34 16,687.53MJPEG (F) 6 445,589,604 494.39 119,064.79 535,890,526 113.55 60,851.76MJPEG (MB) 6 508,260,161 367.02 183,617.70 535,890,526 113.55 60,851.76ADPCM.enc 6 223,964,173 579.19 129,718.42 113,307,334 131.25 14,872.71

Table 3: Application KPNs and MPSoC Executions

Features SDG2KPN COMPAAN[9] KPNGen[13] FP-MAP[8] SPRINT[2] DSWP[14] Harmonic[11] [4]Sharedvariable Yes No No No No No No Nosupport

Code rewrite/ tool assisted convert manual convert to convert to single-entry convert to convert to notmodification minimal copying to ANLP nested loop single-exit tasks loop tasks implemented

Analysis SDG code code code CFG loop unknown ASTgranularity statement statement statement threads

Use KPN Yes Yes Yes No Yes No Yes Yes

Generates Yes No No Yes No (concurrent No No (hardware No (notMPSoC platform (pipeline) SystemC) (threads) units) implemented)

Table 4: Tools and Support

6. REFERENCES[1] J. Ceng, J. Castrillon, W. Sheng, H. Scharwachter,

R. Leupers, G. Ascheid, H. Meyr, T. Isshiki, and H. Kunieda.MAPS: An integrated framework for MPSoC applicationparallelization. In DAC, pages 754 –759, 2008.

[2] J. Cockx, K. Denolf, B. Vanhoof, and R. Stahl. SPRINT: atool to generate concurrent transaction-level models fromsequential code. EURASIP J. Appl. Signal Process., pages213–213, 2007.

[3] CodeSurfer. http://www.grammatech.com/products/codesurfer/overview.html.

[4] V. K. Danish Ather, Raghuraj Singh. Transformation ofsequential program to kpn - an overview. InternationalJournal of Computer Applications, 2012.

[5] A. Hansson, K. Goossens, M. Bekooij, and J. Huisken.Compsoc: A template for composable and predictablemulti-processor system on chips. ACM Trans. Des. Autom.Electron. Syst., 2009.

[6] H. Javaid, A. Ignjatovic, and S. Parameswaran. Rapid designspace exploration of application specific heterogeneouspipelined multiprocessor systems. IEEE Trans. in CAD,pages 1777–1789, 2010.

[7] H. Javaid, A. Janapsatya, M. S. Haque, andS. Parameswaran. Rapid runtime estimation methods forpipelined MPSoCs. In DATE, pages 363–368, 2010.

[8] I. Karkowski and H. Corporaal. FP-map-an approach to thefunctional pipelining of embedded programs. In HiPC, pages415 –420, 1997.

[9] B. Kienhuis, E. Rijpkema, and E. Deprettere. Compaan:deriving process networks from matlab for embedded signalprocessing architectures. In CODES, pages 13 –17, 2000.

[10] D. Liang and M. Harrold. Slicing objects using systemdependence graphs. In Workshop on Source Code Analysisand Manipulation, pages 358 –367, 1998.

[11] W. Luk, J. Coutinho, T. Todman, Y. Lam, W. Osborne,K. Susanto, Q. Liu, and W. Wong. A high-level compilationtoolchain for heterogeneous systems. In SOC, pages 9 –18,2009.

[12] S. Meijer, H. Nikolov, and T. Stefanov. Combining processsplitting and merging transformations for Polyhedral ProcessNetworks. In ESTIMedia, pages 97 –106, 2010.

[13] H. Nikolov, M. Thompson, T. Stefanov, A. Pimentel,S. Polstra, R. Bose, C. Zissulescu, and E. Deprettere.Daedalus: toward composable multimedia MP-SoC design.In DAC, pages 574–579, 2008.

[14] G. Ottoni, R. Rangan, A. Stoler, M. J. Bridges, and D. I.August. From Sequential Programs to Concurrent Threads.IEEE Comput. Archit. Lett., pages 2–9, 2006.

[15] S. Stuijk, M. Geilen, and T. Basten. SDF3: SDF For Free. InACSD, pages 276–278, 2006.

[16] F. Sun, S. Ravi, A. Raghunathan, and N. Jha.Custom-instruction synthesis for extensible-processorplatforms. IEEE Trans. on CAD, pages 216 – 228, 2004.

[17] H. Toivonen, F. Zhou, A. Hartikainen, and A. Hinkka.Compression of weighted graphs. In KDD, pages 965–973,2011.

[18] A. Turjan, B. Kienhuis, and E. Deprettere. Translating affinenested-loop programs to process networks. In CASES, pages220–229, 2004.

[19] I. Viskic and D. Gajski. Modeling kahn process networks onmpsoc platforms. Technical Report CECS-08-08, Center forEmbedded Computer Systems, University of California,Irvine, July 2008.

[20] N. Walkinshaw, M. Roper, and M. Wood. The Java systemdependence graph. In Workshop on Source Code Analysisand Manipulation, pages 55 – 64, 2003.

273

4S-1

Documents

SDG2KPN: System Dependency Graph to Function-level KPN … · 2014-05-11 · Figure 2 depicts an example streaming application (i.e., MPEG-2 encoding) where several can-didate functions