A complete compiler approach to auto-parallelizing C programs for multi-DSP systems

A Complete Compiler Approach toAuto-Parallelizing C Programs for

Multi-DSP SystemsBjorn Franke and Michael F.P. O’Boyle

Abstract—Auto-parallelizing compilers for embedded applications have been unsuccessful due to the widespread use of pointer

arithmetic and the complex memory model of multiple-address space digital signal processors (DSPs). This paper develops, for the

first time, a complete auto-parallelization approach, which overcomes these issues. It first combines a pointer conversion technique

with a new modulo elimination transformation for program recovery enabling later parallelization stages. Next, it integrates a novel data

transformation technique that exposes the processor location of partitioned data. When this is combined with a new address resolution

mechanism, it generates efficient programs that run on multiple address spaces without using message passing. Furthermore, as

DSPs do not possess any data cache structure, an optimization is presented which transforms the program to both exploit remote data

locality and local memory bandwidth. This parallelization approach is applied to the DSPstone and UTDSP benchmark suites, giving an

average speedup of 3.78 on four Analog Devices TigerSHARC TS-101 processors.

Index Terms—Parallel processors, interprocessor communications, real-time and embedded systems, signal processing systems,

measurement, evaluation, modeling, simulation of multiple-processor systems, conversion from sequential to parallel forms,

restructuring, reverse engineering, and reengineering, performance measures, compilers, arrays.

�

1 INTRODUCTION

MULTIPROCESSORDSPs provide a cost effective solution toembedded applications requiring high performance.

Although there are sophisticated optimizing compilers andtechniques targeted at single DSPs [1], there are nosuccessful parallelizing compilers. The reason is simple,the task is complex. It requires the combination of a numberof techniques to overcome the particular problems encoun-tered with compiling for DSPs, namely, the programmingidiom used and the challenging multiple-address spacearchitecture.

Applications are written in C and make extensive use ofpointer arithmetic [3]. This alone will prevent most auto-parallelizing compilers from attempting parallelization. Theuse of modulo addressing prevents standard data depen-dence analysis and will also cause parallelization failure.This article describes two program recovery techniques thatwill translate restricted pointer arithmetic and moduloaddresses into a form suitable for optimization.

Multiprocessor DSPs have a multiple-address memorymodel, which is globally addressable, similar to the CrayT3D/E [4]. This reduces the hardware cost of supporting asingle-address space, eliminating the need for hardwareconsistency engines, but places pressure on the compiler toeither generate message-passing code or some other meansto ensure correct execution. This paper describes a mappingand address resolution technique that allows remote data to

be accessed without the need for message-passing. Itachieves this by developing a baseline mechanism similarto that used in generating single-address space code whilstallowing further optimizations to exploit the multiple-address architecture.

As there is no cache structure, the compiler can neitherrely on caches to exploit temporal reuse of remote data noron large cache line sizes to exploit spatial locality. Instead,multiple-address space machines rely on effective use ofDirect Memory Access (DMA) transfers [3]. This articledescribes multiple-address space specific locality optimiza-tions that improve upon our baseline approach. This isachieved by determining the location of data and trans-forming the program to exploit locality in DMA transfers ofremote data. It also exploits the increased bandwidth that istypically available to data that is guaranteed to be on-chip.These location specific optimizations are not required forprogram correctness (as in the case of message-passingmachines [4], [5], [6]) but allow a safe, incremental approachto improving program performance.

In this paper, new techniques are developed andcombined with previous work in a manner that allows,for the first time, efficient mapping of standard DSPbenchmarks written in C to multiple-address space em-bedded systems.

This paper is organized as follows: Section 2 provides amotivating example and is followed by four sections onnotation, program recovery, data parallelization, andlocality optimizations. Section 7 provides an overall algo-rithm and extensive evaluation of our approach on theDSPstone [2] and UTDSP [7] benchmark suites, whichcontain kernels and full scale embedded applications. Thisis followed by a review of the extensive related work andsome concluding remarks.

234 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 3, MARCH 2005

. The authors are with the University of Edinburgh, Institute for ComputingSystems Architecture (ICSA), James Clerk Maxwell Building, MayfieldRoad, Edinburgh EH9 3JZ, United Kingdom.E-mail: {bfranke, mob}@inf.ed.ac.uk.

Manuscript received 23 Mar. 2004; accepted 2 Aug. 2004; published online 20Jan. 2005.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPDS-0081-0304.

1045-9219/05/$20.00 � 2005 IEEE Published by the IEEE Computer Society

2 MOTIVATION AND EXAMPLES

Auto-parallelizing compilers that take as input sequentialcode and produce parallel code as output have been studiedin the scientific computing domain for many years. In theembedded domain, multiprocessor DSPs are a more recentcompilation target. At first glance, DSP applications seemideal candidates for auto-parallelization; many of themhave static control-flow and linear accesses to matrices andvectors. However, auto-parallelizing compilers have notbeen developed due to the widespread practice of usingpostincrement pointer accesses [2]. Furthermore, multi-processor DSPs typically have distributed address spacesremoving the need for expensive memory coherencyhardware. This saving at the hardware level greatlyincreases the complexity of the compiler’s task.

To illustrate the main points of this paper, two examplesare presented allowing a certain separation of concerns. Thefirst example demonstrates how program recovery can beused to aid later stages of parallelization. The secondexample demonstrates how a program containing remoteaccesses is transformed incrementally to ensure correctnessand, wherever possible, exploits the multiple-address spacememory model.

2.1 Program Recovery Example

The code in Fig. 1a is typical of C programs written for DSPprocessors and contains fragments from the DSPstone andUTDSP benchmark suites. The use of postincrement pointertraversal is awell-known idiom [2]. Although the underlyingalgorithm of the first loop nest is linear matrix algebra, thiscurrent form will prevent optimizing compilers fromperforming aggressive optimization and attempts at paralle-lization.Circular buffer access is a frequently occurring idiomin DSP programs and is typically represented as a moduloexpression in one or more of the array subscripts as can be

seen in the second loop nest. Such nonlinear expressions willagain defeat most data dependence techniques and preventfurther optimization and parallelization.

In our program recovery scheme, the pointers are firstreplaced with array references based on the loop iterator,and the modulo accesses are removed by applying asuitable strip-mining transformation to give the new codein Fig. 1b. Removing the pointer arithmetic and repeatedstrip-mining gives the code in Fig. 1c. The new form is nowsuitable for parallelization, and although the new codecontains linear array subscripts, these are easily optimizedby code hoisting and strength reduction in standard nativecompilers.

2.2 Memory Model

Our approach exploits the fact that, although multiproces-sor DSP machines typically have multiple address spaces,part of each processor’s memory space is visible from otherprocessors, unlike pure message-passing machines. Eachprocessor has its internal address space, which is purelylocal and not reflected on the external bus. In addition, partof each processor’s memory forms a global address spacewhere each processor is assigned a certain range ofaddresses (see Fig. 2 where the shaded region denotes theportion of global addresses physically resident on thatprocessor). This global address space is used for bus-basedaccesses to remote data. Thus, each processor has its ownprivate internal memory address range and can refer toglobal addresses on remote processors.

However, unlike single address space machines, there isno global memory allocation. Each processor may onlyallocate data either to its internal memory or to that part ofits address space visible to other processors. Thus, in orderto refer to remote data, a processor must know both the

FRANKE AND O’BOYLE: A COMPLETE COMPILER APPROACH TO AUTO-PARALLELIZING C PROGRAMS FOR MULTI-DSP SYSTEMS 235

Fig. 1. Example showing the substitution of linear pointer based array traversals with explicit array references (Loop 1) and the elimination of moduloarray indexing (Loop 2) by repeated strip-mining. (a) Original code, (b) array recovery and strip-mine, and (c) remove pointers and strip-mine.

identity of the remote processor and the location in memoryof the required data value.

To further complicate matters, the internal memoryaddress space of a processor and the address space globallyvisible by other processors actually refer to the samephysical memory. This is illustrated in Fig. 2 where theinternal memory addresses and globally visible addressescorrespond to the same locations. Local data can thereforebe accessed both directly using its local address or via theglobally visible multiprocessor address space. Accesses tolocal data via the local address space are, however, fasterthan accesses to the same data via the shared multi-processor address space as no bus transactions are required.

We have developed a novel technique, which combinessingle-address space parallelization approaches with anovel address resolution mechanism that, for linearaccesses, determines at compile time the processor andmemory location of all data items. Nonlinear accesses areresolved during runtime by means of a simple descriptordata structure. This allows us to take advantage of thehigher bandwidth to and the lower latency of local memorywhilst maintaining correctness of the overall parallelizationscheme.

2.3 Parallelization Example

Once we have recovered the linear access structure of aprogram, we must then partition and map it efficiently ontothe multiple-address space memory model of the machine.

Consider the first array recovered loop nest in Fig. 1c. Itis a simple outer vector product and is used to illustrate thetechniques needed for multiple-address space paralleliza-tion. Assuming four processors, single address spacecompilers [8] would simply partition the iteration spaceacross the four processors. This is shown in Fig. 3a. Eachprocessor has an identical copy of this program and makesa runtime call to determine its processor ID. It then

evaluates a subset of the original iteration space and refersto globally allocated arrays.

To access remote data inmultiple address spacemachines,however, the processor location of partitioned data needs tobeexplicitly available.Our schemeachieves this bydata strip-mining [9] each array to form a two-dimensional arraywhoseinner index is to correspond with the four processors. Fig. 3bshows the program after partitioning by data strip-miningand applying a suitable automatically generated looprecovery [9] transformation. Assuming the z loop isparallelized, array a is now partitioned such thata[0][0...7] is written on processor 0 and a[1][0...7]

written on processor 1, etc. Similarly for arrays b and c. For amultiple-address space machine, we now need to generate aseparate program for each processor, i.e., explicitly enumer-ate the processor ID loop, z. The partitioned code forprocessor 0 (as specified by z) is shown in Fig. 3c. The codefor processors 1, 2, and 3 are identical except for #define z

1, 2, or 3. Multiple address spacemachines require remote,globally-accessible data to have a distinct name to local data.1

Thus, each of the globally-accessible subarrays are renamedas follows: a[0][0...7] becomes a0[0...7] anda[1][0...7] becomes a1[0...7], etc. Similarly forarrays b and c. On processor 0, a0 is declared as a variableresiding on that processor, while a1, a2, and a3 are declaredextern (see Fig. 3d). For processor 1,a1 is declared local andthe remaining arrays are declared extern.

To access both local and remote data, a local pointerarray is set up on each processor. This simple descriptordata structure is an array containing four pointer elements,which are assigned to the start address of the local arrays onthe four processors. We use the original name of the arraya[][] as the pointer array *a[] and then initialize thepointer array to point to the four distributed arrays int

*a[4] = {a0, a1, a2, a3} (see Fig. 3d). Using theoriginal name means that we have exactly the same arrayaccess form in all uses of the array as in Fig. 3c. This hasbeen achieved by using the property that multidimensionalarrays in C are arrays of arrays and that higher dimensionsarrays are defined as containing an array of pointers tosubarrays.2 From a code generation point of view, thisgreatly simplifies implementation and avoids complex anddifficult to automate message passing.

While this program will execute correctly and provides abaseline means of generating code for any references,remote or otherwise, on a multiple-address space machineit introduces runtime overhead and does not exploitlocality. Each array reference requires a pointer look-upand, as the native compiler does not know the eventuallocation of the data, it must schedule load/stores that willfit on an external interconnect network or bus. Asbandwidth to on-chip SRAM is greater, this will result inunderutilization of available bandwidth. It is straightfor-ward to identify local references and to replace the indirectpointer array access with the local array name by examiningthe value of the partitioned indices to see if it equals thelocal processor ID, z.

Data references that are sometimes local and sometimesremote can be isolated by index splitting the programsection and replacing the local references with local names.This is shown in Fig. 4a. The access to c in the second loop


Fig. 2. Logical memory organization and aliasing of internal and

multiprocessor addresses for a four processor system.

1. Otherwise, they are assumed to be private copies.2. As defined in Section 6.5.2.1 of the ANSI C standard, paragraphs 3

and 4.

is exclusively local (j1 ¼ z) and, thus, the reference isreplaced by the local name c0. The other, remote referencesto c in the first and last loop still access the data via the c

descriptor pointer array.Element-wise remote access is expensive and, therefore,

group access to remote data, via DMA transfer, is aneffective method to reduce start-up overhead. In ourscheme, remote data elements are transferred into a localtemporary storage area. This is achieved by inserting loadloops for all remote references as shown in Fig. 4b. As spaceis constrained, a check is made before load loop insertion toguarantee that SRAM space is available.

The transfers are performed in such a way as to exploittemporal and spatial locality and map potentially distinctmultidimensional array references, occurring throughoutthe program, into a single dimension temporary area, whichis reused. This is shown in Fig. 4c.

Thus, we have a baseline method that exposes theprocessor ID of partitioned data and generates correct codefor multiple-address space architectures using an addressresolution scheme. When the location of data can be

statically determined, local memory bandwidth and DMAtransfers from remote memory can be exploited.

3 NOTATION

Before describing the partitioning and mapping approach,we briefly describe the notation used. The loop iterators canbe represented by a column vector JJ ¼ ½j1; j2; . . . ; jM �T ,where M is the number of enclosing loops. Note the loopsdo not need to be perfectly nested and occur arbitrarily inthe program. The loop ranges are described by a system ofinequalities defining the polyhedron or iteration space BJJ � bb.The data storage of an array can also be viewed as apolyhedron. We introduce array indices II ¼ ½i1; i2; . . . ; iN �Tto describe the array index space. This space is given by thepolyhedron AII � aa. We assume that the subscripts in areference to an array can be written as UJJ þ uu, where U isan integer matrix and uu is a vector forming an accessmatrix/vector pair (U; uuÞ.

As an example of this notation, the loop bounds of thefirst loop nest in Fig. 1c are represented by


Fig. 3. Example contrasting (a) the standard approach to shared-memory parallelization and (b) the novel scheme including data transformation,

(c) creation one program per processor, and (d) address resolution applied to first loop in Fig. 1a.

�1 00 �11 00 1

2664

3775 j1

j2

� ��

003131

2664

3775; ð1Þ

and the array declaration a[32] in Fig. 1c is represented ina similar manner:

�11

� �½i1� �

031

� �; ð2Þ

i.e., the index i1 ranges over 0 � i1 � 31. The subscript of a,a[i], is simply

1 0½ � j1j2

� �þ 0½ �: ð3Þ

When discussing larger program structures, we introducethe notion of computation sets where Q ¼ ðBJJ � bb; ðsijQiÞÞis a computation set consisting of the loop bounds BJJ � bb,and either enclosed statements ðs1; . . . ; snÞ or further loopnests ðQ1; . . . ; QnÞ.

4 PROGRAM RECOVERY

This section describes two program recovery techniques to

aid later parallelization.

4.1 Array Recovery

Array recovery consists of two main stages. The first stagedetermines whether the program is in a form amenable toconversion and consists of a number of checks. The secondstage gathers information on arrays and pointer initializa-tion, pointer increments, and the properties of loop nests.This information is used to replace pointer accesses by thecorresponding explicit array accesses and to remove pointerarithmetic completely. For more details, see [10].

4.1.1 Pointer Assignments and Arithmetic

Pointer assignment and arithmetic are restricted in ouranalysis. Pointers may be initialized to an array elementwhose subscript is an affine expression of the enclosingiterators and whose base type is scalar. Simple pointerarithmetic and assignment are also allowed.


Fig. 4. Example showing incremental locality optimization ((a) isolation of local and remote array accesses, (b) introduction of separate loops loadingremote data, and (c) substitution of element-wise remote accesses with block-wise DMA transfers) applied to the code in Fig. 3c. (a) Isolate local/remote references, (b) introduce load loops, and (c) DMA remote access.

4.1.2 Pointers to Other Pointers

Pointers to pointers are prohibited in our scheme. Anassignment to a dereferenced pointer may have side effectson the relation between other pointers and arrays that aredifficult to identify and, fortunately, rarely found in DSPprograms.

4.1.3 Dataflow Information

Once the program has been checked, the second stage of thealgorithm gathers information on pointer usage beforepointer conversion. Pointer initializations and updates arecaptured in a system of dataflow equations, which is solvedby an efficient one-pass algorithm [10].

4.1.4 Pointer Conversion

The index expressions of the array accesses are nowconstructed from the dataflow information (see first loopin Fig. 1b). In a separate pass, pointer accesses andarithmetic are replaced and removed as shown in the firstloop of Fig. 1c.

4.2 Modulo Removal

Modulo addressing is a frequently occurring idiom in DSPprograms. We develop a new technique to remove moduloaddressing by transforming the program into an equivalentlinear form, if one exists. This is achieved by using the rank-modifying transformation framework [9] that manipulatesextended linear expressions including mod s and divs. In[13], this was mainly used to reason about reshaped arrays,here we use it as a program recovery technique.

We restrict our attention to simple modulo expressions

of the form ðat � jtÞ%ct, where jt is an iterator, at; ctconstants, and t is the reference containing the modulo

expression. More complex references are highly unlikely,

but may be addressed by extending the approach below to

include skewing.Let l be the least common multiple of ct. In the second

loop in Fig. 1a, we have c1 ¼ 8, c2 ¼ 4 from the accesses to g

and h and, hence, l ¼ 8.We then apply a loop strip-mining transformation S

based on l to the loop nest. The particular formulation weuse is based on rank-modifying transformations [9] thathave the benefit of being well-formulated and can fit into ageneral, linear algebraic transformation framework. Asl ¼ 8,

S ¼ 1 00 s8

� �¼

1 00 ð:Þ=80 ð:Þ%8

24

35; ð4Þ

where s8 ¼ ð:Þ=8 ð:Þ%8½ �T . When applied to the iterator,

JJ 0 ¼ SJJ , we have the new iteration space:

B0JJ 0 � bb0ð Þ ¼ S 00 S

� �BSySJJ � S 0

0 S

� �bb

� �; ð5Þ

where Sy is the pseudoinverse [9] of S, in this case:

Sy ¼ 1 0 00 8 1

� �: ð6Þ

When applied to the second loop nest shown in (1), we

have the new iteration space:

�1 0 00 �1 00 0 �11 0 00 1 00 0 1

26666664

37777775

j1j2j3

24

35 �

0003137

26666664

37777775; ð7Þ

or the loop nest shown in the second loop nest of Fig. 1b. The

new array accesses are found by U0 ¼ USy, giving the newaccess shown in the second loop of Fig. 1b. This process is

repeateduntil nomodulooperations remain.Theonemodulo

expression in the array g subscript remaining in the second

loop of Fig. 1b is removed by applying a further strip-miningtransformation to give the code in Fig. 1c.

5 DATA PARALLELISM

This section briefly describes the baseline parallelization

approach of our scheme.

5.1 Partitioning

We attempt to partition the data along those aligneddimensions of the array that may be evaluated in paralleland minimize communication. More sophisticated ap-proaches are available [11], [12], but are beyond the scopeof this article. Partitioning based on alignment [4], [13], [14],[15] tries to maximize the rows that are equal in a subscriptmatrix.

The most aligned row, MaxRow, determines the index to

partition along. We construct a partition matrix P defined:

Pi ¼ eTi i ¼ MaxRow0 otherwise;

�ð8Þ

where eTi is the ith row of the identity matrix Id. We also

construct a sequential matrix S containing those indices not

partitioned such that P þ S ¼ Id. In the example in Fig. 3,

there is only one index to partition along. Therefore,

P ¼ 1½ � and S ¼ 0½ �.

5.2 Mapping

Once the array indices to partition have been decided, wedata strip-mine the indices II based on the partition matrix Pand strip-mine matrix S to give the new array indices II 0.The data strip-mining matrix S is defined:

S ¼Idk�1 0 00 sp 00 0 IdN�k

24

35; ð9Þ

where sp ¼ ð:Þ%p ð:Þ=p½ �T and p is the number of

processors. Within embedded systems, it is realistic to

assume that this number is constant and already known at

compile time. In our example in Fig. 3, p ¼ 4. For further

details of the transformation framework, see [9]. Here, the

transformation is used to exposes the processor ID of all

array references and is critical to our scheme. To show this,

let T be the mapping transformation T ¼ PS þ S. Thus, the

partitioned indices are strip-mined and the sequential

indices left alone. In our example, T ¼ ½1�S þ ½0� ¼ S and

the new array indices, II 0, are given by II 0 ¼ TII. The new

array bounds, A0II 0 � aa0, are then found using


T OO T

� �AT�1II 0 � T O

O T

� �aa; ð10Þ

which transforms the array bounds in (7) to:

�1 00 �11 00 1

2664

3775 i1

i2

� ��

0037

2664

3775; ð11Þ

i.e., a[4][8]. The new array subscripts are also found:U0 ¼ TU In general, without any further loop transforma-tions, this will introduce mods and divs into the arrayaccesses. In our example in Fig. 3, we would havea[i%4,i/4] += b[i%4,i/4] + c[j%4,j/4].

However, this framework [9] always generates a suitablerecovery loop transformation, in this case, the sametransformation T . Applying T to the enclosing loopiterators, we have JJ 0 ¼ TJJ and updating the access matriceswe have, for array a:

U00 ¼ TUT�1 ¼ 1 00 1

� �j1j2

� �; ð12Þ

i.e., a[z][i]. The resulting code is shown in Fig. 3b wherewe have exposed the processor ID of each reference withoutany expensive subscript expressions.

5.3 Address Resolution

Once the array is partitioned across several processors, eachlocal partition has to be given a new distinct name as there isno single address space supported. We therefore introduce anew name equal to the old name followed by the processoridentitynumber, for each local subarray.Wewishoneof thesearrays to resideandbeallocatedon the localprocessorand theremainder be declared extern allowing remote references,the address of which will be resolved by the linker.

The complete algorithm is given in Fig. 5a, where thefunction insert inserts the declarations. Fig. 3d shows thedeclarations inserted for the arrays a, b, and c. The onlyfurther change is to the type declaration whenever thearrays are passed into a function. The type declaration ischanged from int[][] to *int[] and this must bepropagated interprocedurally. Once this has been applied,no further transformation or code modification is required.

5.4 Synchronization

Once the program and data have been partitioned acrossthe processors, we must guarantee that the programexecutes correctly. Synchronization primitives must beinserted so that the correct data values are communicatedand no race conditions occur. In message-passing machines,this can be achieved by synchronized communication.However, as we have an unsynchronized read/write modelof memory access, it is the compiler’s responsibility to insertsynchronization where appropriate.

Synchronization need only be enforced when there existsa cross-processor data dependence [16]. If the source andsink of a dependence are on the same processor, there is noneed for synchronization as normal von Neumann orderingof instructions ensures that such dependencies are honored.However, determining the location of the source and sink ofeach data dependence is generally nontrivial. Previouswork [20] required array section analysis to determine thelocation of each source and sink and relied on the propertiesof owner-computes scheduling and data alignment to

reduce the complexity of the task. However, we can takeadvantage of the mapping approach described above,which makes the location of each array access explicit. Iftwo references involved in a dependence have the sameprocessor ID after partitioning, then the dependence isremoved from further consideration. All remaining depen-dences are considered as being cross-processor and requir-ing synchronization.

Determining the equality of processor IDs is trivial as

their values are either reduced to constants after macro

expansion of #define z and constant folding or general

functions. We restrict our attention to constant equality as

determining function equality is, in general, undecidable.Let ðUsource; uusourceÞ and ðUsink; uusinkÞ be the access matrix/

vector pair of the source and sink of a data dependence.

Then, for each partitioned index x, corresponding to the

nonzero row entries of the partitioning matrix P, if

Usourcex ¼ Usink

x ¼ 0 ^ uusourcex ¼ uusink

x ; ð13Þ

then we can remove this dependence from further

consideration.This analysis can be expanded for nearest-neighbor

synchronization, i.e.,

Usourcex ¼ Usink

x ¼ 0 ^ uusourcex � uusink

x ¼ c; ð14Þ

where c is a constant allowing post/wait synchronization.However, for the purposes of this paper, we only considerbarrier synchronization.

Once these local dependences have been removed,synchronization must be inserted so that all the cross-processor dependences are satisfied. We use the algorithmdescribed in [17] that has provably minimal insertion of


Fig. 5. Algorithms for address resolution and overall parallelization.

(a) Address resolution algorithm and (b) overall parallelization algorithm.

barrier synchronizations in basic blocks, perfect loops, andcertain classes of imperfect loop nest.

6 LOCALITY ANALYSIS

Once partitioning and address resolution have been appliedto the program, it will execute correctly on a multiprocessor.The source code generated looks very similar to thatgenerated for a single-address space machine [8], with theaddition of a local pointer array for each distributed array.This straightforward code, however, introduces runtimeoverhead and does not exploit locality.

6.1 Exploiting Local Accesses

Embedded processors are almost entirely statically sched-uled (i.e., not superscalar) and must make the conservativeassumption that the final location of each element accessedis remote unless proven otherwise. Bandwidth to on-chipSRAM is greater than via the external bus, hence localbandwidth will be underutilized for local accesses. Ingeneral, determining whether an array reference is entirelylocal throughout the runtime of a program is nontrivial.However, as our partitioning scheme explicitly incorporatesthe processor ID in the array reference, we simply check tosee if it equals the processor ID, z.

This a simple syntactic transformation. Given an array

access UJJ þ uu and a pointer array name X and the syntactic

concatenation operator :, we haveX½UJJ þ uu�7!X : uu1½U2;...;N

JJ þ uu2;...;N �.Applying this to the example in Fig. 3c, we have the code

in Fig. 3d where all accesses to a0, b0, and c0 can bestatically determined as local by the native compiler.

6.2 Locality in Remote Accesses

The memory hierarchy of a single processor is relativelystraightforward as there are no caches. Exploiting locality islargely achieved by ensuring data fits in the on-chip SRAMand effective register allocation.

The lack of cache structure has a significant impact onremote accesses in multiDSPs, however. Repeated referenceto a remote data item will incur multiple remote accessesunless the compiler explicitly restructures the program toexploit the available temporal locality.

Our approach is to determine those elements likely to beremote and to transfer the remote data to a local temporary,which is then used in the computation. The remote datatransfer code is transformed to exploit temporal and spatiallocality when using the underlying DMA engine.

6.2.1 Index Splitting

We first separate local and remote references by splittingthe iteration space into regions where data is either local orremote. As the processor ID is explicit in our framework, wedo not need array section analysis to perform this.

For each remote reference, the original loop is partitionedinto n separate loop nests using index set splitting:

QðAJJ � bb;Q1Þ7!QiðAJJ � bb ^ Ci;Q1Þ; 8i 2 1; . . . ; n; ð15Þwhere n ¼ 2dþ 1 and d is the number of dimensionspartitioned. In the example in Fig. 3d, we partition on justone dimension, hence n ¼ 3, and we have the followingconstraints:

C0 : 0 � j1 � z� 1 ð16ÞC1 : j1 ¼ z ð17ÞC2 : zþ 1 � j1 � 3: ð18Þ

This gives the program in Fig. 4a.

6.2.2 Remote Data Buffer Size

Before any remote access optimization can take place, theremust be sufficient storage. Let s be the storage available forremote accesses. We simply check that the remote data fits,i.e., kUJJ þ uuk � s.

The Omega Calculator [22] can determine this valueusing enumerated Presburger formulae. If this condition isnot met, we currently abandon further optimization. Strip-mining, array contraction and loop fusion can reduce thetemporary size, but is not explored further here.

6.2.3 Load Loops

Load loops are introduced to exploit locality of remoteaccesses. A temporary, with the same subscripts as theremote reference, is introduced, which is then followed byloop distribution. This is always legal as there are nodependence cycles between statements.

The transformation is of the form Q7!ðQ1; . . . ; QKÞ. Inother words, a single loop nest Q is distributed so that thereare now K loop nests, K � 1 of which are load loops. In ourexample in Fig. 4b, there is only one remote array, henceK ¼ 2.

6.2.4 Transform Load Loops to Exploit Locality

Temporal locality in the load loops corresponds to aninvariant access iterator or the null space of the accessmatrix, i.e., NðUÞ. There always exists a transformation T ,found by reducing U to Smith normal form, that transformsthe iteration space such that the invariant iterator(s) isinnermost and can be removed by Fourier-Motzkin elim-ination. The i loops of both the load loops in Fig. 4b areinvariant and are removed as shown in Fig. 4c.

6.2.5 Stride

In order to allow large amounts of remote data to beaccessed in one go rather than a separate access per arrayelement, it must be accessed in stride-1 order. This can beachieved by a simple loop transformation T whichguarantees stride-1 access. The transformation T is foundto be T ¼ U. In our example, Fig. 4b, T is the identitytransformation as the accesses in the load loop are alreadyin stride-1 order.

6.2.6 Linearize

Distinct remote references may be declared as havingvarying dimensions, yet the data storage area we set asidefor remote accesses is fixed and one-dimensional. Therefore,the temporary array must be linearized throughout theprogram and all references updated accordingly: U0

t ¼ LUt.In our case, L ¼ 8 1½ �, and transforms the array accesses

from temp[j1][j2] to temp[8*j1+j2] in Fig. 4b.

6.2.7 Convert to DMA Form

The format of a DMA transfer requires the start address ofthe remote data and the local memory location where it is tobe stored plus the amount of data to be stored. This isachieved by effectively vectorizing the inner loop byremoving it from the loop body, and placing it within the


DMA call. The start address of the remote element is givenby the lower bound of the innermost loop and the size isequal to its range. Thus, we transform the remote arrayaccess as follows: UM ¼ 0; uuM ¼ minðJJmÞ.

The temporary array access is similarly transformed andthe innermost loop JJm is eliminated by Fourier-Motzkinelimination. Finally, we replace the assignment statement byagenericDMAtransfer callget(&tempref,&remoteref,size) to give the final code in Fig. 4c.

7 EXPERIMENTS

Our overall parallelization algorithm is shown in Fig. 5b.We currently check if the loop trip count is greater thaneight before continuing beyond step 2. We prototyped ouralgorithm in the SUIF 1.3 compiler. SUIF [19] was selectedas it is the best publicly available, auto-parallelizing Ccompiler, though it targets shared-memory platforms.

We evaluated the effectiveness of our parallelizationscheme against two different benchmark sets: DSPstone [2]and UTDSP [7]. The programs were executed on aTranstech TS-P36N board with a cluster of four cross-connected 250MHz TigerSHARC TS-101 DSPs, all sharingthe same external bus and 128MB of external SDRAM. Theprograms were compiled with the Analog Devices Vi-sualDSP++ 2.0 Compiler (version 6.1.18) with full optimiza-tion; all timings are cycle accurate.

Due to its emphasis on code generation issues, DSPstonefrequently considers artificially small data set sizes. There-fore, we have evaluated it using its original small data setsas well as a scaled version wherever appropriate. UnlikeDSPstone, UTDSP comprises a set of applications as well ascompute-intensive kernels, of which many are available intheir original pointer-based form as well as in an array-based form.

7.1 Program Recovery

Fig. 6a shows the set of loop-based DSPstone programs.Initially, the compiler fails to parallelize these programsbecause they make an extensive use of pointer arithmetic forarray traversals, as shown in the second column. However,after applying array recovery (column 3) most of theprograms become parallelizable (column 4). In fact, the onlyprogram that cannot be parallelized after array conversion(biquad) contains a cross-iteration data dependence thatdoes not permit parallelization.adpcm is the only program inthis benchmark set that cannot be recovered due to itscomplexity. The fifth columnofFig. 6a showswhetherornot aprogramcanbeprofitablyparallelized. Programs comprisingonly very small loops such as dot_product and ma-

trix1x3 perform better when executed sequentially due tothe overhead associated with parallel execution and arefiltered out at stage 2, by our algorithm.

The impact ofmodulo removal can be seen in Fig. 6b. Fourof the UTDSP programs (IIR, ADPCM, FIR, and LMSFIR) canbe converted into amodulo-free formbyour scheme.Moduloremoval has a direct impact on the parallelizer’s ability tosuccessfully parallelize those programs—three out of fourprograms could be parallelized after the application of thistransformation. ADPCM cannot be parallelized after moduloremoval due to data dependences.

Although program recovery is used largely to facilitateparallelization and multiprocessor performance, it canimpact sequential performance aswell. The first two columnsof each set of bars in Figs. 7 and 8 show the original sequentialtime and the speedup after program recovery. Three out the

of eight DSPstone benchmarks benefit from this transforma-tion whereas only a single kernel (fir) experiences aperformance degradation after program recovery. In fir2-

dim, lms, and matrix2, array recovery has enabled betterdatadependence analysis andalloweda tighter scheduling ineach case. fir has a very small number of operations suchthat the slight overheadof enumerating array subscripts has adisproportionate effect on its performance. Fig. 8 shows theimpact ofmodulo removal on the performance of the UTDSPbenchmark. Since a computation of a modulo is a compara-tively expensive operation, its removal positively influencesthe performance of the three programs wherever it isapplicable.

7.2 Data Parallelism, Data Distribution, and AddressResolution

The third column of each set of bars in Figs. 7 and 8 showsthe effect of blindly using a single-address space approachto parallelization without data distribution on a multiple-address space machine. Not surprisingly, performance isuniversally poor. The fourth column in each figure shows


Fig. 6. Exploitable parallelism in DSPstone and UTDSP. (a) DSPstone

and (b) UTDSP.

the performance after applying data distribution, mappingand address resolution. Although some programs experi-ence a speedup over their sequential version (convolu-tion and fir2dim), the overall performance is stilldisappointing. After a closer inspection of the generatedassembly codes, it appears that the Analog Devicescompiler cannot distinguish between local and remote data.It conservatively assumes all data is remote and generates“slow” accesses, i.e., double word (instead of quad word)accesses to local data are generated and an increasedmemory access latency is accounted for in the producedVLIW schedule. In addition, all remote memory transac-tions occur element-wise and do not effectively utilize theDMA engine.

7.3 Localization

The final columns of Figs. 7 and 8 show the performance afterthe locality optimizations are applied to the partitioned code.

Accesses to local data are made explicit, so the compiler canidentify local data and is able to generate tighter and moreefficient schedules. In addition, remote memory accesses aregrouped to utilize the DMA engine. In the case of DSPstone,linear, or superlinear speedups are achieved for all programsbar one (fir), where the number of operations is very small.Superlinear speedup occurs in precisely those cases whereprogram recovery has given a sequential improvement overthe pointer based code. The overall speedups vary between1.9 (fir) and 6.5 (matrix2), their average is 4.28 on fourprocessors.

The overall speedup for the UTDSP benchmarks is lessdramatic, as the programs are more complex, including fullapplications, and have a greater communication overhead.These programs show speedups between 1.33 and 5.69, andan average speedup of 3.65. LMSFIR and Histogram fail togive significant speedup due to the lack of sufficient dataparallelism inherent in the programs. Conversely, FIR,


Fig. 7. Incremental performance development for profitably parallelizable DSPstone benchmark kernels after program recovery, parallelization,

partitioning and address resolution, and localization.

Fig. 8. Incremental performance development for profitably parallelizable UTDSP benchmark kernels after program recovery, parallelization,

partitioning and address resolution, and localization.

MULT(large), Compress, and JPEG Filter give super-linear speedup due to improved sequential performance ofthe programs after parallelization. As the loops are shorterafter parallelization, it appears that the native loopunrolling algorithm performs better on the reduced tripcount.

7.4 Scalability

An important characteristic of a parallel program is itsscalability with the data set size. Fig. 9a shows theperformance of the DSPstone programs for small and largedata sets. The small data set sizes are default settings, whichdo not necessarily represent real-world conditions. Forexample, the two-dimensional image filter fir2dim isapplied to a 4� 4 image. Obviously, there is littleparallelism available in such an artificially reduced dataset. However, our parallelization approach still achieves aspeedup on four out of eight programs with minimal datasets. n_real_updates and n_complex_updates comeclose to linear speedup, whereas both matrix multiplicationimplementations matrix1 and matrix2 experience asuper-linear speedup even for 10� 10 matrices. Theremaining four programs convolution, fir, fir2dim,and lms do not profit from parallelization for very smalldata set sizes (approximately 16 data elements).

While 19 of the 30 UTDSP programs are parallel, only 9of the 19 are determined profitably parallelizable by ourtechnique. The others are either sequential or exhibit toolittle work to parallelize. Fig. 9b shows how our approachscales with the number of processors for the Compress,FIR 256/64, and Mult_10_10 benchmarks. Mult_10_10only contains a 10� 10 array, yet after distributing the dataacross the processors, some speedup is available. In theother cases, with larger data sizes, the full potential ofparallel execution can be exploited giving linear speedupwith the number of processors.

8 RELATED WORK

There is an extremely large body of work on compilingFortran dialects for multiprocessors. A good overview canbe found in [20]. Compiling for message-passing machineshad largely focused on the HPF programming language [6].The main challenge is inserting correctly efficient message-passing calls into the parallelized program [4], [6] withoutrequiring complex runtime bookkeeping [5], [21].

Although when compiling for distributed shared mem-ory (DSM), compilers must incorporate data distributionand data locality optimizations [8], they are not faced withthe problem of multiple, but globally-addressable addressspaces. Compiling for DSM has moved from primarily loop-based parallelization [19] to approaches that combine dataplacement and loop optimization [22] to exploit bothparallelism.

Both message-passing and DSM platforms have bene-fited from the extensive work in automatic data partitioning[11], [12] and alignment [13], [14], [15], potentially removingthe need for HPF directives (pragmas) for message-passingmachines and reducing memory and coherence traffic in thecase of DSMs.

The work closest to our approach, from the parallelizingFortran community, is [23]. This work successfully exam-ines auto-parallelizing techniques for the Cray T3D. Toimprove communication performance, it introduces privatecopies of shared data that must be kept consistent using acomplex linear memory array access descriptor. In contrast,we do not keep copies of shared data, instead we use anaccess descriptor as a means of having a global name fordata. In [23], an analysis is developed for nearest-neighborcommunication, but not for general communication. As ourpartitioning scheme exposes the processor ID, it is eliminat-ing the need for any array section analysis and handlesgeneral global communication.

In the area of auto-parallelizing C compilers, SUIF [19] isthe most significant work, though it targets single-addressspace machines. Modulo recovery is considered in [24],where a large, highly specialized framework based onDiophantine equations ispresented to solvemodulo accesses.It, however, introduces floor, div, and ceiling functions andits effect on other parts of the program is not considered.There is a large body of work on developing loop and datatransformations to improvememory access [22], [25]. In [8], adata transformation, data tiling, is used to improve spatiallocality, but the representation does not allow easy integra-tion with other loop and data transformations.

As far as DSP parallelization is concerned, an early paper[26] described how such programs may be parallelized, butgave no details or experimental results. Similarly, in [27], aninteresting overall parallelization framework is described,but no mechanism or details of how parallelization mighttake place is provided. In [28], the impact of differentparallelization techniques is considered, however, this was


Fig. 9. Scalability of the novel parallelization approach with (a) the data set size and (b) the number of processors.

user-directed and no automatic approach was provided. In[29], a semiautomatic parallelization method to enabledesign-space exploration of different multiprocessor con-figurations based on the MOVE architecture is presented.However, no integrated data partitioning strategy wasavailable and data was allocated to a single processor in theexample codes. Furthermore, in the experiments, commu-nication was modeled in their simulator and, thus, the issueof mapping parallelism combined with distributed addressspaces was not addressed.

9 CONCLUSION

Multiple-address space embeddedsystemshaveproven tobea challenge to compiler vendors and researchers due to thecomplexity of the memory model and idiomatic program-ming style of DSP applications. This paper has developed anintegrated approach that gives an average of 3.78 speedup onfour processors when applied to 17 benchmarks from theDSPstone and UTDSP benchmarks. This is a significantfinding and suggests that multiprocessor DSPs can be a costeffective solution to high performance embedded applica-tions and that compilers can exploit such architecturesautomatically. Future work will consider other forms ofparallelism found inDSP applications and integrate thiswithfurther uniprocessor optimizations.

REFERENCES

[1] R. Leupers, “Novel Code Optimization Techniques for DSPS,”Proc. Second European DSP Education and Research Conf., 1998.

[2] V. Zivojnovic, J.M. Velarde, C. Schlager, and H. Meyr, “DSPstone:A DSP-Oriented Benchmarking Methodology,” Proc. Int’l Conf.Signal Processing Applications & Technology (ICSPAT ’94), pp. 715-720, 1994.

[3] S.L. Scott, “Synchronization and Communication in the T3EMultiprocessor,” Proc. Seventh Int’l Conf. Architectural Support forProgramming Languages and Operating Systems (ASPLOS-VII),pp. 26-36, 1996.

[4] S. Chakrabarti, M. Gupta, and J.-D. Choi, “Global CommunicationAnalysis and Optimization,” Proc. SIGPLAN Conf. ProgrammingLanguage Design and Implementation (PLDI ’96), pp. 68-78, 1996.

[5] S. Hiranandani, K. Kennedy, and C.-W. Tseng, “CompilingFortran D for MIMD Distributed-Memory Machines,” Comm.ACM, vol. 35, no. 8, pp. 66-80, 1992.

[6] J. Mellor-Crummey, V. Adve, B. Broom, D. Chavarria-Miranda, R.Fowler, G. Jin, K. Kennedy, and Q. Yi, “Advanced OptimizationStrategies in the Rice DHPF Compiler,” Concurrency-Practice andExperience, vol. 14, nos. 8-9, pp. 741-767, 2002.

[7] C.G. Lee and M. Stoodley, “UTDSP Benchmark Suite,” Univ. ofToronto, Canada, 1992, http://www.eecg.toronto.edu/corinna/DSP/infrastructure/UTDSP.html.

[8] J.M. Anderson, S.P. Amarasinge, and M.S. Lam, “Data andComputation Transformations for Multiprocessors,” Proc. FifthACM SIGPLAN Symp. Principles and Practice of Parallel Program-ming (PPoPP ’95), pp. 166-178, 1995.

[9] M.F.P. O’Boyle and P.M.W. Knijnenburg, “Integrating Loop andData Transformations for Global Optimisation,” J. Parallel andDistributed Computing, vol. 62, no. 4, pp. 563-590, 2002.

[10] B. Franke and M.F.P. O’Boyle, “Array Recovery and High-LevelTransformations for DSP Applications,” ACM Trans. EmbeddedComputing Systems, vol. 2, no. 2, pp. 132-162, 2003.

[11] B. Bixby, K. Kennedy, and U. Kremer, “Automatic Data LayoutUsing 0-1 Integer Programming,” Proc. Parallel Architectures andCompiler Technology Conf. (PACT ’94), 1994.

[12] J. Garcia, E. Ayguad, and J. Labarta, “A Framework for IntegratingData Alignment, Distribution, and Redistribution in DistributedMemory Multiprocessors,” IEEE Trans. Parallel and DistributedSystems, vol. 12, no. 4, Apr. 2001.

[13] D. Bau, I. Kodukla, V. Kotlyar, K. Pingali, and P. Stodghill,“Solving Alignment Using Elimentary Linear Algebra,” Proc.Seventh Int’l Workshop Languages and Compilers for Parallel Comput-ing (LCPC ’94), pp. 46-60, 1994.

[14] K. Knobe, J. Lukas, and G. Steele, “Data Optimization: Allocationof Arrays to Reduce Communication on SIMD Machines,” J.Parallel and Distributed Computing, vol. 8, no. 2, pp. 102-118, 1990.

[15] S. Chatterjee, J.R. Gilbert, L. Oliker, R. Schreiber, and T.J. Sheffler,“Algorithms for Automatic Alignment of Arrays,” J. Parallel andDistributed Computing, vol. 38, no. 2, pp. 145-157, 1996.

[16] M.F.P. O’Boyle and F. Bodin, “Compiler Reduction of Synchro-nisation in Shared Virtual Memory Systems,” Proc. Ninth Int’lConf. Supercomputing (ICS ’95), pp. 318-327, 1995.

[17] M.F.P. O’Boyle and E.A. Stohr, “Compile Time Barrier Synchro-nization Minimization,” IEEE Trans. Parallel and DistributedSystems, vol. 13, no. 6, pp. 529-543, June 2002.

[18] W. Pugh, “Counting Solutions to Presburger Formulas: How andWhy,” Proc. SIGPLAN Conf. Programming Languages Design andImplementation (PLDI ’94), pp. 121-134, 1994.

[19] M.W. Hall, J.M. Anderson, S.P. Amarasinghe, B.R. Murphy, S.-W.Liao, E. Bugnion, and M.S. Lam, “Maximizing MultiprocessorPerformance with the SUIF Compiler,” Computer, vol. 29, no. 12,pp. 84-89, Dec. 1996.

[20] R. Gupta, S. Pande, K. Psarris, and V. Sakar, “CompilationTechniques for Parallel Systems,” Parallel Computing, vol. 25,nos. 13-14, pp. 1741-1783, 1999.

[21] J.R. Larus, “Compiling for Shared-Memory and Message-PassingComputers,” ACM Letters Programming Languages and Systems,vol. 2, nos. 1-4, pp. 165-180, 1993.

[22] M. Kandemir, J. Ramanujam, and A. Choudhary, “ImprovingCache Locality by a Combination of Loop and Data Transforma-tions,” IEEE Trans. Computers, vol. 48, no. 2, pp. 159-167, Feb. 1999.

[23] Y. Paek, A.G. Navarro, E.L. Zapata, and D.A. Padua, “Paralleliza-tion of Benchmarks for Scalable Shared-Memory Multiproces-sors,” Proc. Conf. Parallel Architectures and Compiler Technology(PACT ’98), 1998.

[24] F. Balasa, F.H.M. Franssen, F.V.M. Catthoor, and H.J. De Man,“Transformation of Nested Loops with Modulo Indexing to AffineRecurrences,” Parallel Processing Letters, vol. 4, no. 3, pp. 271-280,1994.

[25] S. Carr, K.S. McKinley, and C.-W. Tseng, “Compiler Optimiza-tions for Improving Data Locality,” Proc. Sixth Int’l Conf.Architectural Support for Programming Languages and OperatingSystems (ASPLOS-VI), pp. 252-262, 1994.

[26] J. Teich and L. Thiele, “Uniform Design of Parallel Programs forDSP,” Proc. IEEE Int’l Symp. Circuits and Systems (ISCAS ’91),pp. 344a-347a, 1991.

[27] A. Kalavade, J. Othmer, B. Ackland, and K.J. Singh, “SoftwareEnvironment for a Multiprocessor DSP,” Proc. 36th ACM/IEEEDesign Automation Conf. (DAC ’99), 1999.

[28] D.M. Lorts, “Combining Parallelization Techniques to IncreaseAdaptability and Efficiency of Multiprocessing DSP Systems,”Proc. Ninth DSP Workshop (DSP 2000)—First Signal ProcessingEducation Workshop (SPEd 2000), 2000.

[29] I. Karkowski and H. Corporaal, “Exploiting Fine- and Coarse-Grain Parallelism in Embedded Programs,” Proc. Int’l Conf.Parallel Architectures and Compilation Techniques (PACT ’98),pp. 60-67, 1998.

Bjorn Franke received the Diplom degree in computer science from theUniversity of Dortmund in 1999, and the MSc and PhD degrees incomputer science from the University of Edinburgh in 2000 and 2004,respectively. He is currently a lecturer in the Institute for ComputingSystems Architecture (ICSA) at the University of Edinburgh, where he isa member of the compiler and architecture group. His main researchinterests are in compilers for high-performance embedded systems.

Michael F.P.O’Boyle received the PhDdegree in computer science fromthe University of Manchester in July 1992. He was formerly a visitingscholar at StanfordUniversity, and a visiting professor at UPC,Barcelona.He is currently an EPSRC advanced research fellow and a reader at theUniversity of Edinburgh. His main research interests are in adaptivecompilation and automatic compilation for multicore architectures.


Documents

A complete compiler approach to auto-parallelizing C programs for multi-DSP systems