Wavefront Array Processors-Concept to Implementation

I1

I I'I 0o00

iti] 4

-+-a~~~~~~~~ 1 I

S.Y. Kung, S.C. Lo, S.N. Jean, and l.N. HwangUniversity of Southern California

Most signal and imageprocessing algorithmscan be decomposedinto computationalwavefronts that can beprocessed on pipelinedarrays.

T he supervisory overhead incurredin general-purpose supercom-

T puters often makes them too slowand expensive for real-time signal andimage processing. To achieve a through-put rate adequate for these applications,the only feasible alternative appears to bemassively concurrent processing byspecial-purpose hardware-by arrayprocessors. Progress in VLSI technologyhas lowered implementation costs for largearray processors to an acceptable level,and CAD techniques have facilitatedspeedy prototyping and implementation ofapplication-oriented (or algorithm-oriented) array processors.

Digital signal and image processingencompasses a variety of mathematicaland algorithmic techniques. Most signaland image processing algorithms aredominated by transform techniques, con-volution and correlation filtering, and cer-tain key linear algebraic methods. Thesealgorithms possess properties such asregularity, recursiveness, and locality, andthese properties can be exploited in arrayprocessor design. With VLSI it becomesfeasible to construct an array processorthat closely resembles the flow graph of aparticular algorithm. This type of arraymaximizes the main strength of VLSI-intensive computing power-and yet cir-cumvents its main weakness-restrictedcommunication.

Parallel processing architectures. SIMD(single-instruction, multiple-data-stream)computers, MIMD (multiple-instruction,multiple-data-stream) computers, systolicarrays, and wavefront arrays are popularmultiprocessors. It is important to clarifytheir similarities and differences.

SIMD arrays. An SIMD array is a syn-chronous array of processing elements(PEs) under the supervision of a singlecontrol unit.' All PEs receive the sameinstruction broadcasted from the controlunit but operate on different data setsfrom distinct data streams. Broadcastingof data is usually allowed in an SIMDarray (see Figure la).

MIMD arrays. An MIMD computerconsists of a number of PEs, each with itsown control unit, program, and data.'The main feature of anMIMD machine isthat the overall processing task can be dis-tributed among the PEs to increase pro-cessing parallelism. An MIMD machinemay encounter communication bottle-necks when multiple PEs attempt to simul-taneously access shared system resources.Nevertheless, the flexibility ofthe MIMDarchitecture often makes it essential fordealing with irregularly structuredalgorithms. A dataflow machine, anMIMD computer in which an instructionis ready for execution as soon as its oper-

0018-9162/87/0700-0018S01.00 19871EEE COMPUTER

i

fiI

Li I

-1- i 16ilt Lrilav!~ p.11000041. Use

0I I]

-pp, 0*1 At )I

18

ands arrive, offers a solution to the prob-lem of efficiently exploiting concurrencyof computation on a large scale, and it iscompatible with modern concepts of pro-gram structure.2

Systolic arrays. Two popular special-purpose VLSI array architectures are sys-tolic and wavefront arrays, which boastmassive concurrency derived from pipelineprocessing or parallel processing or both.A systolic array is a network of processorsthat rhythmically compute and pass datathrough the system. A systolic array isoften algorithm-oriented and is used as anattached processor (i.e., with a host com-puter). It features the important propertiesof modularity, regularity, local intercon-nection, a high degree of pipelining, andhighly synchronized multiprocessing. Anextensive literature on systolic arrayprocessing exists; the reader is referred toFisher and Kung3 and the referencestherein.

Wavefront arrays. The data movementsin systolic arrays are controlled by globaltiming-reference "beats. " The burden ofsynchronizing an entire systolic computingnetwork becomes heavy for very largearrays. A simple solution is to take advan-tage of the dataflow computing principle,which is natural to signal processingalgorithms and which leads the designer towavefront array processing. There are twoapproaches to deriving wavefront arrayalgorithms: one is to trace and pipeline thecomputational wavefronts; the other isbased on a data flow graph (DFG) model.Conceptually, the requirement for correcttiming in the systolic array is now replacedby a requirement for correct sequencing inthe wavefront array.

Data bus I TI

I I

(a)

a*O

I I

(b)

Figure 1. A mesh-type SIMD array (a); a systolic/wavefront array (b).

Comparisons. To highlight the charac-teristic differences among the architecturescited above, we propose a classification as

shown in Figure 2. Note that a systolicarray has local instruction codes and thatexternal data are piped into the array con-currently with the processing. SIMD andwavefront arrays can be regarded as some-

what more complex than systolic arrays.

An SIMD array has control (instruction)buses and data buses (in lieu of the localinstruction codes adopted in systolicarrays). A wavefront array, on the otherhand, provides data-driven processingcapability. MIMD multiprocessors gener-ally offer all the features just mentioned,possibly with an additional feature-shared memories.A mesh-type SIMD array is shown in

Figure la, while a systolic/wavefrontarray is shown in Figure lb. Note that anSIMD array usually loads data into itslocal memories before the computationstarts, while systolic and wavefront arrays

usually pipe data from an outside host andalso pipe the results back to the host. Dewand Manning4 compare SIMD arrays andsystolic arrays for a vision preprocessingapplication. They report that local win-dowing operations can be effectivelyimplemented on both systolic and SIMDarrays. However, for data-dependentoperations such as a binary search corre-

lator, the utilization of the SIMD array

will be inferior to that of the systolic array.

The efficiency of the systolic or wavefrontarray is due to the fact that the host han-dles image storage and can select the

desired data and pipe them into the array.

The wavefront array combines the sys-

tolic pipelining principle with the dataflowcomputing concept. In fact, the wavefrontarray can be viewed as a static dataflowarray that supports the direct hardwareimplementation of regular dataflowgraphs. Exploitation of the dataflow prin-ciple makes the extraction of parallelismand programming for wavefront arrays

relatively simpler.

19July 1987

II

Data pipelinedthrough boundaryPEs

Globally

synchronous

I

Figure 2. Classification of SIMDmachines, MIMD machines, systolicarrays, and wavefront arrays.

Data driven

Systolic

(Prestoredlocal control;Warp is anexample)

Preloadedfrom databus

SIMD

(Control broadcastfrom control unit;Illiac IV is anexample)

MIMDWavefront

(Prestoredlocal control;MWAP is anexample)

(Prestoredlocal control;Dataflow machineslike the Manchestermachine are examples)

Timingscheme

Figure 3. Wavefront processing formatrix multiplication.

Memory modules

Key:

First wave --------------Second wave -

,5 Unit time of data transferT = Unit time of arithmetic operation

Front

#1

Front '

#2

Front

#3

Front

#4

1+5A+T

1+5A

1+6A+T

1+6k

..

/ Front ., #5 / Front *6'6 Fr ont ' #7 -

1+7A+TI

I

1+7A-

COMPUTER

Data VO

I

I

20

1,4,&,T## 1+4A" J,

Wavefront arrayprocessor

An approach to deriving wavefr4arrays is to trace the computatiowavefronts and pipeline these fronts onprocessor array. "Computatioiwavefront" means smooth data moment in a localized communication nwork. The computing network serves adata-wave-propagating medium.wavefront in a processor array c

responds to a mathematical recursion inalgorithm. Successive pipeliningwavefronts through the array will accc

plish the computation of all recursionFor example, let us examine how

matrix multiplication algorithm can

executed on a square, orthogonalN xwavefront array (Figure 3). Let A = {

and B ={bb} and C = A x= {c,}, and let all beN x NmatricThe matrix A can be decomposed i]columns Ai and the matrix B into rowsand therefore

C = A1 * Bl + A2 * B2+ + AN * BN

where the product Al * B1 is the "o0product. " The matrix multiplicationthen be carried out in Nsets of wavefrc(recursions), each executing one o0product:

C(k) = C(k I) + Ak * Bk

or, equivalently,

C -, C,k) + a,k x b5k)

a) aik b(j) akj, X1, 2, .., N.Let us now examine the computatic

wavefront for the first recursion in mamultiplication. The elements of Astored in memory modules to the leftcolumns), and those ofB in memorymules on the top (in rows). The process stwith PE (1,1), where cl =c) + all *

is computed. The computational actiithen propagates to the neighboring P(1,2) and (2,1), which exec

their respective operations. The next fiof activity will be at PEs (3,1), (2,2),(1,3). Thus, a computation wavefronttravels down the processor arraycreated. Once the wavefront swethrough all the cells, the first recursiccomplete. As the first wave propagatescan execute an identical second recur

concurrently by pipelining a seci

wavefront immediately after the first (

For example, the (1, 1) processor will exe-cute c2)=-c1)'+a12 * b2l . . ., and soon.The separate roles of pipelined and par-

allel processing become evident when wecarefully inspect how computationalwavefronts that are to be processed in par-allel are pipelined successively through theprocessor array. The pipelining is feasiblebecause the wavefronts of two successiverecursions never intersect. That is, differ-ent processors are used to execute differ-ent recursions at any given instant. Thecomputational wavefronts are similar toelectromagnetic wavefronts, since eachprocessor acts as a secondary source andis responsible for the activation ofthe nextfront. This means that the computation isdata-driven.Note that the major difference between

a wavefront array and a systolic array isthe data-driven property. There is no

,Bj,A wavefront arrayequals a systolic

(1) array plus dataflowater computing.can)nts

global timing reference in a wavefrontarray, and yet the order of task sequenc-ing is correctly followed. In the wavefrontarchitecture, the information transferbetween aPE and its immediate neighborsis by mutual convenience. Whenever dataare available, the transmitting PE informsthe receiver, and the receiver accepts thedata whenever required. It then commu-nicates with the sender to acknowledgethat the data have been consumed. Thisscheme can be implemented by means ofa simple handshaking protocol5 whichensures that the computational wavefrontspropagate in an orderly manner instead ofcrashing into one another. Since there is noneed to synchronize the entire array, awavefront array is truly architecturallyscalable.On the other hand, a wavefront array

and a systolic array are identical in termsof regularity, modularity, local intercon-nection, and pipelinability. They both con-sist of modular processing units withregular and (spatially) local interconnec-tions. Moreover, their computing net-works may be extended indefinitely. Theyexhibit a linear-rate speedup; that is, theyachieve an O(M) speedup in terms of

processing rates, whereMis the number ofPEs.

In summary, a simple way to relate thewavefront array to its systolic counterpartis

Wavefront array = systolic array+ dataflow computing.

Algorithm mappingand programmingVLSI array processor technology is

steadily advancing, and it is presently intransition from the research phase to thedevelopment phase. Therefore, we shallexamine both the fundamental principlesestablished by research and the implemen-tation issues critical to development.

Mapping algorithms to systolic andwavefront arrays. As long as communica-tion in VLSI remains restricted, locallyinterconnected arrays will be of greatimportance. An increase of efficiency canbe expected if the algorithm arranges fora balanced distribution ofwork load whileobserving the requirement for locality,that is, for short communication paths.Such a load distribution and informationflow serves as a guideline to the designerof VLSI algorithms and eventually leadsto new architectural and language designs.

Given an algorithm, how can an arrayprocessor be systematically derived? Afundamental issue is how to express par-allel algorithms in a notation that is easyto understand but yet can be compiled intoefficient VLSI array processors. The ulti-mate design should begin with a powerfulalgorithmic notation that expresses therecurrence and parallelism associated withthe description of the space-time activities.This description should be able to be con-verted into a VLSI hardware descriptionor into executable array processor machinecodes.A VLSI algorithm is often very regular

and the computation activities are express-ible in terms of a simple grid model, asshown in Figure 4a. The computation isrepresented by nodes and the dependencyof the computational nodes is representedby arcs. Such representation is termed thedependence graph, or DG, of the algo-rithm. A DG is a directed graph, which isembedded in an index space and specifiesthe data dependencies of an algorithm. Ina DG, nodes represent computations andarcs specify the data dependencies between

July 1987 21

Projection.^e..s. udirection

(a)

Hyperplane

1 /

2

3s (Normal vector)

(b) 4 5 6 7

Figure 4. Illustration of a linear projection with projection vector d (a); a linearschedule vector s and its hyperplanes (b).

computations. In our notation, withrespect to a dependence arc, the terminat-ing node depends on the initiating node. Inderiving an array for a given algorithm, wefirst derive a localized DG from the algo-rithm and then map the DG to a systolicarray or directly to a wavefront array.

DG design. From the initial sequentialdescription of an algorithm, we can derivea DG for that algorithm by first convert-ing it to a single assignmentform, in whichany variable is assigned a unique valueonce in the algorithm. The single assign-ment form of an algorithm can show thedata dependencies in it clearly. For regu-lar and recursive algorithms, the DGs willalso be regular and can be represented bya grid model; therefore, the nodes can bespecified by simple indices such as (i,j,k).A mapping methodology is used for

mapping uniform (that is, shift-invariant)DGs onto processor arrays. (ADG is shift-invariant if the dependence arcs cor-responding to all the nodes in the indexspace do not change with respect to the-node positions.) Matrix multiplication,convolution, autoregressive filtering, dis-crete Fourier transforms, discrete

Hadamard transforms, Hough trans-forms, least squares solutions, sorting,perspective transforms, LU decomposi-tion, andQR decomposition all belong tothis algorithm class. By exploiting theregularity of such algorithms, we cangreatly simplify the array processor designfor them.A straightforward implementation of a

DG is to assign each node in the DG to aPE. This is not efficient, since each PEonly executes one computation in the algo-rithm. Therefore, we would like to let eachPE execute multiple nodes in the DG andyet retain all the parallelism in the DG.This calls for a mapping from the DG tothe array processor. In the following, wedescribe mapping DGs to both systolic andwavefront arrays.

Mapping DGs to systolic arrays. Inmapping a uniform (shift-invariant) DG toa systolic array, we need to specify thenode assignment and the schedule for theDG.The node assignment specifies how the

nodes in theDG are assigned to the PEs inthe array. A linearassignment (projection)of a DG is a linear mapping of the nodes

of the DG to the PEs, in which nodes alonga straight line are mapped to a PE. Theprojection direction is denoted by a vectord (see Figure 4a).The schedule specifies the execution

time for all the nodes in the DG. Thescheduled execution time of a node is rep-resented by a time index (that is, by an

integer). A linearschedule, denoted by s,maps a set of parallel equitemporal hyper-planes to a set of linearly increased timeindices, where s is the normal vector oftheequitemporal hyperplanes (see Figure 4b).That is, the time index of a node can bemathematically represented by sTi, wherei denotes the index of the node.For a systolic array to be obtained, the

projection vector d and the schedule vec-tor s have to satisfy two constraints:

* sTe> 0. Here e denotes any edge inthe DG. The number sTe denotes thenumber of delays (Ds) on the edge of thesystolic array. The schedule vector s mustobey the data dependencies of the DG;that is, if node i depends on the output ofnodej, thenj must be scheduled before i.

* sTd>0. The projection vector d andthe schedule vector s cannot be orthogonalto each other; otherwise, sequentialprocessing will result.

In systolic mapping, the following rulesare adopted:

* The nodes in the systolic array mustcorrespond to the projected nodes in theDG.

* The arcs in the systolic array must cor-

respond to the projected components ofarcs in the DG.

* The input data must be projected tothe corresponding arcs in the systolicarray.

Figure 5a shows a mapping of a DG to a

systolic array. The DG in Figure 5, whilenot explicitly specified, represents aconvolution-like algorithm.

MappingDGs to wavefront arrays. Dueto its dataflow nature, a wavefront arraydoes not have a fixed schedule. Therefore,the operation of a wavefront array is dic-tated only by the data dependency struc-ture and the initial data tokens. Awavefront array can be modeled by adataflow graph, or DFG. A DFG is a

weighted, directed graphDFG - [N, A, D(a), Q(a), T(n)j

in which nodes Nmodel computation andarcsA model communication links. Each

22 COMPUTER

node n has an associated nonnegative realweight T(n) representing its computationtime. Each arc a is associated with a non-negative integer weight, D(a), representingthe number of initial data tokens on thearc, and a positive integer weight, Q(a),representing the FIFO queue size of thearc. A node is enabledwhen all input arcscontain tokens and all output arcs containempty queues. A node fires after it hasbeen enabled for its computation time.Whenever a node fires, one input token istaken away from each input arc, and eachoutput arc from the node is assigned onemore token.There exists a systematic way to map

DGs to DFGs. Recall that for a shift-invariant DG, some of its boundary nodesmay appear to have a different depen-dency structure (e.g., fewer dependencyarcs) than that of the internal nodes. Forour mapping, it is necessary to enforce auniform appearance by assigning someinitializing data (usually a constant, e.g.,zero) to the boundary nodes of the DG.After this is done, all the nodes have thesame dependency arcs (see Figure 5b), andall the data input to boundary nodes areviewed as input data.

For the shift-invariant DG and a givenprojection direction d, we can derive theDFG in a manner similar to the systolicmapping. Each input data token in theDGis mapped to an initial token on the cor-responding arc in the DFG. Here the queuesize for each DFG arc is assumed to belarge enough to accommodate the targetalgorithms. An example ofmapping aDGto a DFG (with its initial tokens) is shownin Figure 5b. In contrast to the systolicmapping shown in Figure 5a, the DFGmapping does not need any schedule vec-tor s, since the data-driven computingnature of the wavefront array obviates theneed to specify the exact timing. Further-more, based on the dataflow principle, anoptimal schedule implied by theDG will beautomatically followed. This is explainedbelow.

Assume that each DG node is assignedto one PE and that all the input data areavailable, so that minimum computationtime can be achieved. Suppose the projec-tion direction d is chosen so there is a strictdependency among the nodes that aremapped to the same PE. Thus, the sequen-tial processing among these nodes by thesingle PE should not in any way impose anextra slowdown in the execution time, andhence the resulting DFG can compute thesame computation in minimum time. This

S

(a)

0

(b)

Figure 5. Mapping a DG to a systolic array (a); mapping a DG to a DFG(wavefront array) (b).

provides a simple guideline for the selec-tion of the projection direction d. (In fact,this rule is also useful for the systolic map-ping.) Note that this guideline may begeneralized to cover the nonhomogeneousDG and nonlinear assignment situations.6A nonlinear assignment is a good choice ifthe nodes assigned to each PE have a strictdata dependency. Note that a nonlinearassignment can be easily implemented ona programmable wavefront array.

Sometimes the nodes in aDG may havedifferent execution times and these timesmay (or may not) depend on the inputdata. For a systolic design, such timinguncertainty will prevent the designer fromseeking an optimal schedule. In this case,a wavefront design will be more appealingfrom a speed point of view. Analyzing theexact performance of a wavefront array,which is data-driven and sometimes data-dependent, is very difficult. A method that

July 1987 23

provides an upper bound on the executiontime of wavefront arrays for sparse matrixmultiplications has been proposed byMelhem.4

Queues in the DFG. Using queues is away to implement asynchronous commu-nication in a wavefront array. Actually, aqueue is a mechanism for storing andretrieving data. Queues can be imple-mented by software or hardware. For highprocessing speed, such as is required by awavefront array, hardware queues are pre-ferred. However, queues can also beimplemented with memories by software,which has the advantage that queuelengths are not limited.

In the above discussion, we haveassumed that the queues in a wavefrontarray (or DFG) are large enough for thetarget algorithms. Insufficient queue sizeusually results in an additional slowdownof the computation. Therefore, it is natu-ral to ask how to determine the minimumqueue size required for each DFG arc sothe minimum computation time can beachieved. For simplicity, the DG nodecomputation times are assumed to be data-independent. Hence, the DG can be sched-uled a priori and the minimum computa-tion time can be determined. Recall thateach DFG arc represents a number ofDGarcs. Suppose that aDG arc a is projectedonto a DFG arc a'. To determine the mini-mum required queue size for a', we notethe following:

* The scheduled completion time, t1,for the initiating node ofa indicates whenthe output data of the node areproduced(or put) on a'.

* The scheduled completion time, t2,for the terminating node of a indicateswhen the data are consumed from a'.Apparently, if T is the node computationtime, then t2 - tl + T represents thelength of time a data token stays in a 'andits two end nodes.

* The pipelining period, a, which is thetime period between two consecutive databeing put on a', can be determined fromthe schedule.

Thus, the queue size for a', Q, can be cal-culated as

Q = [(t2 - tl + T)/al

where i l denotes the ceiling function.If the queue size of a wavefront array is

less than the minimum required one, thenthe overall speed of the array will beslowed down. In the general case, when the

DG is not shift-invariant or the node timesare different, then the projected DFG isnot regular. Kung, Lewis, and Lo providea detailed analysis of the timing of theDFG and the minimization of queues foroptimal throughput.7 Their work is basedon timed Petri net theories.

Programming. In this discussion, weassume that wavefront arrays have fixedinterconnections between PEs and afixedqueue length for each data link. Program-ming a wavefront array means specifyingthe sequence of operations for each PE.Each operation includes the followingspecifications:

* the type of computation (addition,multiplication, division, and so on),

* the input data link (north, south, east,west, or internal register), and

* the output data link.

Note that an additional specification relat-ing to when an operation actually occursis required when one is programming sys-tolic arrays with fixed interconnections.This is needed to ensure the correct timingof the computations in these arrays, but itis not required in wavefront array pro-gramming because of the wavefrontarray's data-driven nature. In this sense,programming a wavefront array is easierthan programming a systolic array. Fromthe viewpoint of algorithm mapping (orautomatic language translation from aprogram on the host to a program for thearray), both wavefront and systolic arraysrequire an assignment of the computa-tional nodes of theDG to the PEs. For sys-tolic arrays a (time) scheduling ofcomputations is also necessary. Since awavefront array has inherent self-timing,it needs no scheduling. In fact, it will adoptthe optimal schedule.A programming language for a

wavefront array should be able to expressparallel data-driven computing. One goodexample of such a language is Occam,8which is designed to be the programminglanguage for the Inmos Transputer andwhich is, essentially, a high-level language.Another language that can be used for awavefront array is MDFL, the MatrixData Flow Language,5 which uses thewavefront notion to reduce the complex-ity of parallel programming. Wavefrontarray programming is quite straightfor-ward for those algorithms that are alreadyexpressed in terms of DFGs. For example,many 1-D or 2-D digital filters can be ini-tially given in the DFG form. The pro-grammer just needs to assign one PE to

each DFG node, if possible, and write pro-grams to execute the node functions. If thenumber of PEs is less than the number ofDFG nodes, the programmer can groupseveral DFG nodes and assign them to asingle PE.

A matrix multiplication example. Wecan now give an example of Occam pro-gramming for a two-dimensionalwavefront array for matrix multiplication.The computing network serves as a (data)wave propagating medium (see Figure 3again) that can be implemented by usingthe "channels" in Occam.

For data input, one matrix enters thearray of processors from the left column,while the other matrix enters from the toprow. As the data values move right anddown they are multiplied and accumu-lated. Finally, after the entire matrix haspassed through the array, each processorcontains an element of the final matrix.Again, all the PEs perform the same tasksof reading, multiplying, accumulating,and transmitting the data further right anddown.An Occam program for the main pro-

gram, which specifies the array structure,is

CHAN vertical[n*(n + 1)]:CHAN horizontal[n*(n + 1)]:PAR i = [OFOR n]PAR j = [0 FOR N]

mult (vertical[(n*i)+j],vertical[(n*i) + j + 1],horizontal[(n*i) + j],horizontal[(n*(i + 1)) + j)]):

The process mult, which describes the PEoperations, is called n x n times and isshown below:

PROC mult (CHAN up, down, left, right)=VAR acc, a, b:SEQ

acc : =0SEQ i = [O FOR n]SEQPARup ? aleft ? b

acc: = acc + a*bPARdown ! aright ! b:

More programming examples arepresented in Kung et al.5 and Kung.6

COMPUTER24

System architecturedesignThe overall architecture of an array

processor system can be divided into a

hierarchy of levels:

* The system architecture level defineshow the system appears to users and appli-

cation programmers. It includes thecharacteristics of languages, the user inter-face, and the operating system.

* The array architecture level definesthe interconnections between differentarrays and the functional capabilities ofthe processors comprising the arrays.

* The PE architecture level defines thehardware modules for the PE nodes. Itincludes both the instruction implementa-tion (fetch, decode, or execute mechanism)and the interfaces between individualbuilding blocks (e.g., bus widths, queue

lengths).

It should be noted that at all architec-tural levels the model ofexecution shouldbe defined. All levels do not necessarilyshare the same model of execution. Forexample, we may have a dataflow modelat the system level, allowing the user to see

the system as a collection of processes andprocessors that operate on data availabil-ity, whereas we may have a control flowmodel at the underlying levels (e.g., thePEs may be conventional von Neumannmachines). Or the system level may havea control flow model of execution (e.g., beprogrammed in Fortran) and the underly-ing levels may adopt a dataflow model ofcomputation (e.g., the PEs may be NECMPD7281 dataflow chips). In thewavefront array system proposed below,we adopt the dataflow model for the arrayarchitecture level and control flow modelfor the PE architecture level.A wavefront array processor may be

used either as an attached processor inter-facing with a compatible host machine or

as a stand-alone processor equipped witha global control processor. A system con-

figuration for an array processor is pro-posed in Figure 6. The system consists ofthe following major components:

* processor array(s),* interconnection network(s), and* a host computer and interface unit.

In general, desirable features for an

array processor system are high speed,flexibility, reliability, and cost-effectiveness. Let us now explore the key

T Interface unit

Figure 6. An array processor system consists of a host, an array control unit, aninterface system, an interconnection network, and a processor array.

architecture design issues for each of theabove components.

Processor arrays. A processor arraycomprises a number of PEs linked by anetwork with a regular topology. For fixed

array structures, the l-D (linear), 2-D(mesh or hexagonal), and 3-D cube-connected networks are the most popular.Many algorithms can indeed be mapped tothese regular arrays. Other variants suchas hypercubes9 and shuffles'° are also

July 1987 25

PE1 |

._I ____________1%116 % 1% % 1-0- lb II

(a)

Figure 7. The proposed handshaking circuit, with glitch protection ability (a); thetiming diagram of this circuit (b).

Key:

DE = Data enableDR= Data readyDA= Data availableDP = Data processedDS Data sentDU - Data usedPOR = Power on reset

becoming popular. The choice of arraystructure depends on the communicationrequired by the given algorithms and appli-cations.

Dynamically interconnected or recon-figurable array structures allow an arrayto support a large class of algorithms. Suchstructures usually involve significant hard-ware overhead. However, various strate-gies for improving the speed and flexibilityof (global) interconnection networks areavailable. Reconfigurability of array struc-tures based on switching lattices has beenproven to be useful for solving problemsrelated to fault tolerance. A typical exam-ple is the CHiP project," which uses a

programmable switch lattice embedded inthe PE array.

PE architecture. The PE architectureshould meet the requirements of theintended computing tasks. The keyrequirements for DSP applications, forexample, are adequate word length, a fastmultiply and accumulate, high-speedRAM, and fast coefficient table address-ing. The functionality of the PE should bedesigned to support these needs. There arefour main components in the PE:

* Arithmetic and logic unit. Since highthroughput is usually demanded of awavefront array processor, the ALU must

rapidly compute frequently encounteredoperations. Fixed-point ALUs are cheaperto build, but floating-point ones providehigher precision and dynamic range, whichare often required by DSP applications.

* Memory unit. Designs with separate,on-chip program and data memories arenow becoming popular among digital sig-nal processors. Although on-chip memo-ries are smaller than off-chip ones, they doallow faster processing.

* Control unit. There are twoapproaches to control unit design. Thefirst is the reduced instruction set com-puter (RISC) approach, which uses a smallset of simple instructions and obtains a

COMPUTER

16 qb qb q& % % % % % % % % % % q% -ip. 4ft. qlb. -M.. -M. gh. -M, 'a, -a, 92. -M, qb, --IL qft -M, '1k, -oh.

4

%

% %I

26

I Handshaking

CK1

CK2 'I=

DE

DS1 |

DR

r% e% P% ~ ~ ~ ~ ~ ~ ~ ~ ~~~~~~

DA

- DU2

DP

DUI

(b)

simple control unit with a higher clockrate. The second is the complex instructionset computer (CISC) approach, which usesa large and complex instruction set andallows complicated tasks to be completedwith fewer instructions. The current trendin VLSI implementations appears to betoward the RISC approach.

* I/O unit. The PE should be able toperform data transfers concurrently withprocessing. For each of the communica-tion links, the transfer of data is controlledby a separate I/O controller, which han-dles the two-way handshaking functions.

Asynchronous communication pro-

tocols. One ofthe most important featuresthat distinguish a wavefront array proces-sor from other array processors is the data-driven operation of each of its PEs. Toensure correct sequencing and data trans-fers between adjacent PEs, handshakingprotocols must be adopted to synchronizethe operations. There are two types ofasynchronous communication schemes:the one-way control scheme and the two-way control scheme. In the one-way con-trol scheme, the sender sends data withqutwaiting for the acknowledgment signal ofthe receiver. This method is suitable for awavefront array processor only when largebuffers are provided. The two-way control

scheme, usually known as handshaking, ispreferable for most wavefront arrayprocessors. A proposed handshaking cir-cuit is shown in Figure 7a. This circuit canbe considered an improved version of aprevious design.5 This new design is morerobust because its flip-flops are driven byinternal clocks, and it is less sensitive to theglitch noise encountered in the communi-cation links. The timing diagram of thiscircuit is shown in Figure 7b. Two rising-edge-triggered JK flip-flops and twofalling-edge-triggered D flip-flops plus twoAND gates are used to implement the cir-cuit. The basic operations and protocolsduring one handshaking cycle are asfollows:

(1) WhenDE = 1 and CK1 = falling,then DR = 1 (for one clock cycle) anddata are on the bus.

(2) When DR = 1, DU1 = 1, andCK1 = rising, then DS1 = 1 andDE = 0.

(3) When DS1 = 1, Qb = 1, andCK2 = falling, then DS2 = 1.

(4) When DS2 = 1, DP = 0, andCK2 = rising, then DA = 1 andDU2 = 0, and data are latched on PE2.

(5)WhenDA = 1 and CK2 = falling,then DP = 1 (for one clock cycle) anddata are used.

(6) When DS2 = 1, DP = 1, andCK2 = rising, then DA = 0 andDU2 = 1.

(7) WhenDU2 = 1 and CK1 = falling,then DU1 = 1.

(8) When DUl = 1, DR = 0, andCK1 = rising, then DS1 = 0 andDE= 1.

Ifwe define the handshaking communi-cation overhead to be equal to the timeinterval between the rising edge of the DRflag and that oftheDP flag, this time inter-val is determined by several delay factors:the flip-flop time delay dff, the propaga-tion time delay dp,, the phase differencebetween CK1 and CK2, and the delaysintroduced by the DP flag response afterDA is set high. On average, this overheadis shorter than three clock periods, and inmost cases both PEs continue their com-putations during the handshaking timeinterval. When the granularity of PEs islarge (which is true for wavefront arrayprocessors), the handshaking time penaltyis less significant.

Block handshaking scheme. One way toreduce the handshaking overhead is to usea block handshaking scheme, in which a

July 1987

DS2

IIIIIa

I

I

I

I

I

I

ttt

I

I

I

IIi

27

Memory blocks

I II---I

(a) (b)

Figure 8. Three configurations for the interconnection network in an array processor system: intra-array communication-among PEs (a); intra-array communication-between PEs and memory blocks (b); inter-array communication-betweenarrays and global memory blocks (array control units not shown) (c).

block of data can be transmitted andreceived in only one handshaking opera-tion. This method is useful for communi-cation between systems operating withalmost identical clock frequencies. Thesuccess of the block handshaking schemerelies on the assumption that the clock fre-quency and phase of the sending PEremain stable during the period of theblock data transfer.

Two-levelpipelining. In many cases fur-ther speed enhancement can be obtainedby using pipelining in the ALU of the PE.If this is done, the handshaking scheme ofparticipating PEs is the same as before.However, due to the uncertainty of a con-tinuous supply of data into the pipe, thePE should record the time at which dataenter the pipe so it can retrieve theprocessed data from the pipe later. If thereare no data coming into the pipe, then thecorresponding output is considered gar-bage. A simple way to implement thisscheme is to employ a one-bit shift regis-ter of the same length as the pipeline in thePE. When valid (or invalid) data come intothe pipe, a I (or a 0) is entered into the shiftregister. This 1 (or 0) is shifted in the shiftregister as the data move down the pipe.When the data come out of the pipe, theI (or 0) is shifted out of the register andused to gate the output data. In this way,the PE can retrieve the valid processeddata.

Interconnection networks. The inter-connection networks used in processorarrays can significantly affect their speed.Certain structured (intra-array) intercon-nection networks can be incorporated into

a PE array to provide direct, global, andhigh-speed communications. There aretwo suitable ways to configure an intra-array interconnection network.The first configuration (Figure 8a), in

which a network is used to support directcommunication among PEs, is appropri-ate when the number of PEs is equal to thenumber of memory blocks. Each PE ispermanently linked to one and only onememory block, and some smart memorymanagement must reside in the memoryblocks to reformat the 1/0 data. Commu-nication between PEs is establishedthrough the interconnection network,which can realize various interconnectionpatterns. When several computations indifferent PEs are successively performedusing the same set of data, this structureis very efficient since the memory opera-tions are usually slower than the functionsof the PE and are not involved in the trans-fer of data between different PEs. Thisdesign is the one that has been chosen formost array computers currently in use orunder development.'2The second configuration is often used

when the numbers of memory blocks andPEs are not equal. Usually, the number ofmemory blocks is greater than the numberof PEs (e.g., in matrix manipulationalgorithms). Figure 8b illustrates this con-figuration, in which the interconnectionnetwork is used to connect one or morememory blocks to one or more PEs.When there are multiple arrays (or mul-

tiple clusters ofPEs) in an array processorsystem, a bottleneck is often created if thehost machine is asked to handle all the datatransactions between the arrays. There-fore, for inter-array communication, a

global interconnection network is addedbetween the global memory blocks and thelocal memory blocks, while for intra-arraycommunication, a local interconnectionnetwork is provided within each array orcluster (Figure 8c).The configuration shown in Figure 8c,

a hierarchical version of that shown in Fig-ure 8b, illustrates this arrangement. Ineach subarray (cluster), a local intercon-nection network, which links the fasterlocal memory blocks and the PE array,provides intra-array communication,while in the system as a whole, a globalinterconnection network, which talksthrough the local control unit to commu-nicate with local memory, provides inter-array communication and also allowsaccess to globally shared data.'3 Forapplications in which inter-array commu-nication is frequent, the local memory andlocal interconnection networks may beremoved. This allows the global intercon-nection network to be used for both intra-and inter-array communication. However,the global interconnection network mustthen provide a complete partition capabil-ity to allow it to be used as either a local(intra-array) or a global (inter-array)network.'4

Host computer. The host computer (orthe array control unit)

* provides system monitoring, batchdata storage, management, and dataformatting,

* determines the schedule program thatcontrols the interface unit and inter-connection network, and

COMPUTER

I I

I I

28

for the arrays which must be able to treatthe arrays both as a whole (to allocate tasksto them) and as collections of processors(to load programs and data to individualPEs). The information the driver needs toperform these tasks is provided by a lan-guage compiler that performs dependenceanalysis.

In the environment ofa special-purposesystem, several software tools are needed:

(c)

* generates global control codes andobject codes for PEs.

The host can also perform data formattingof or data conversion between floating-point and fixed-point notations, bit-serialand bit-parallel representations, and so on.

Direct memory access schemes are oftenused for fast data transfer between thearray and the host or the I/O units. Thehost determines and schedules the paral-lel processing tasks and matches them withthe available array processor modules. Itgenerates control codes to coordinate allsystem units. The array control unit fol-lows the schedule commands, performsdata rearrangement, and handles direct-data-transfer traffic. Sequence controlguides the sequencing of operations. Stor-age management may be specifiedeither by the programmer or by the systemcontroller. System-controlled manage-ment is often safe and free from undesira-ble interference, but its efficiency ofallocation is low. A stack-based (runtime)storage management scheme is a simpleand useful alternative.

Operating system. Most conventionaloperating systems support featuresdesigned to provide a complete program-ming environment to the user. These fea-tures remain desirable in array processors.They include disk and terminal I/O sup-

ports, resource management facilities (forthe CPU, arrays, memory, and I/Odevices), multiuser and multitasking capa-bilities, and virtual memory management.An operating system for an array

processor involves one major extension toa conventional operating system-a driver

* program development tools-that is,basic tools (editors and file managers)like those provided by a conventionaloperating system and additional tools(high-level languages and projectionprograms) to assist in mapping andmatching algorithms to arrays,

* a testing and debugging tool-that is,a target (array) architecture sim-ulator,

* code downloading tools-that is,tools to program ROMs for existingsystems, and interfaces to drive CADsystems for custom/semicustomimplementation, and

* runtime support tools-that is, a

library of routines to support inter-PE and host-PE communications,basic control functions, and the man-agement of global resources such as

memory and external I/O.

Interface unit. The interface unit, whichis connected to the host via the host bus or

through DMA, downloads, uploads, andbuffers array data, handles interrupts, andformats data. Since an array processor isused as an attached processor, the designof this unit is important. The interface unitis monitored by the host or array controlunit according to the schedule program.

Why wavefrontarchitectures?Both wavefront and systolic arrays

share the important common feature ofusing a large number of modular andlocally interconnected processors for mas-sively pipelined and parallel processing.They are, however, different in hardwaredesign-most specifically in their clockand buffer arrangements-and in pro-gramming requirements.

Clock distribution. The clockingscheme is a critical factor in large-scalearray systems. If clocking can be suitablyhandled, then synchronous systemsusually yield a conceptually simplerdesign. However, when a fast system clock

is used, global synchronization oftenimposes a heavy burden on the hardwarebecause of clock skew.'5 The clock skewincreases with the size of the array and isespecially troublesome in two- or higher-dimensional arrays. Furthermore, syn-chronization ofthe transfer of dataamonga large number of PEs may lead to largecurrent surges as the components aresimultaneously energized or change state.This problem can be alleviated inwavefront arrays, however, because oftheir asynchronous nature.

Processing speed. Wavefront arrayssuffer from a fixed time-delay overheadresulting from handshaking, although ablock handshaking scheme can be adoptedto reduce it. Synchronous arrays, thoughthey do not have this problem, can suffera loss ofspeed-when the processing timesofthe PEs are not uniform, a synchronousarray may have to accommodate theslowest PE by using a slower clock. In con-trast, wavefront arrays, because of theirdata-driven nature, do not have to holdback faster PEs in order to accommodateslower ones. Wavefront arrays also yieldhigher speed when the computing times aredata-dependent. For example, when anabundance of "zero" entries are encoun-tered in sparse matrix multiplications, a"trivial" multiplication can be computedin much less time than a "nontrivial" mul-tiplication. Finally, wavefront arrays ben-efit from various techniques that may beadopted to speed up processing time. 16 Asimulation of systolic and wavefrontprocessing for a least squares minimiza-tion problem showed that the wavefrontarray was almost twice as fast as the sys-tolic array.'7

Programming. The programming ofwavefront arrays is easier than that of sys-tolic arrays because wavefront arraysrequire only the assignment of computa-tions to PEs, whereas systolic arraysrequire both assignment and scheduling ofcomputations.

Fault tolerance. Fault tolerance is animportant concern in systolic andwavefront arrays because of the largenumber ofPEs they may have. Fault toler-ance techniques for arrays can be groupedinto three categories:

* fabrication-time fault tolerance,* compile-time fault tolerance, and* runtime fault tolerance.

Neither systolic nor wavefront arrays pos-

July 1987

I

29

III1=1MM

Inputport 2 1

I

Inputport 1

Outputport 1

4-_

Output Serialport 2 link

Figure 9. Schematic diagram of the PE design for the STC-RSRE wavefront array (reproduced from Davie, Higgins, andCawthorn16).

sess any special advantages in fabrication-time and compile-time fault tolerance.However, wavefront arrays are superior tosystolic arrays in runtime fault tolerance.Since a wavefront array is a data-drivencomputing network, an individual PE canisolate itself from the array by indicatingto adjacent processors that it is not readyto accept data. Hence, when a fault occursin a PE, the PE can isolate itself and per-form self-testing. If the self-testing indi-cates that the fault has passed (wastransient), the PE rejoins the array. If theself-testing indicates that the fault is per-manent, the array is reconfigured to cir-cumvent the faulty PE. In contrast, whena fault occurs in a PE in a systolic array,all the PEs in the array must be interruptedfor self-testing. The occurrence of multi-ple faults is an additional problem for sys-tolic arrays, since such faults can cause acontrol-interrupt-line contention problem.

Wafer-scale integration. WSI technol-ogy can offer significant performanceenhancements over the conventionalmethod of system building, in whichindividually packaged chips are mountedon a printed-circuit board. WSI's advan-tages include

* shorter interconnect distancesbetween chips,

* the ability to mix semiconductor tech-nologies,

* faster system clock rates, and* the ability to perform dynamic inter-connection by means of switchinglattices.

Wavefront architectures' local intercon-nections and data-driven nature makeWSI particularly attractive for use withthem. With today's technologies, thedevelopment of WSI wavefront arrayprocessors is a realistic goal. The regular-ity, modularity, and data-driven nature ofwavefront architectures should be able tobe exploited to devise suitable fault toler-ance techniques for WSI wavefront arrays.The combination of wavefront techniquesand WSI technology appears very promis-ing for future high-speed supercomputing.

Examples of wavefrontarray systemsHere, we will examine two wavefront

array systems that have actually been con-structed. We will also explore the possibil-ity of utilizing commercially availableVLSI microprocessors such as the NEC,uPD7281 dataflow chip and the InmosTransputer for constructing wavefrontarrays.

STC-RSRE wavefront array processorsystem. The Standard Telecommunica-tions Company and the Royal Signals and

Radar Establishment in Britain havejointly developed a wavefront arrayprocessor system for adaptive beamform-ing. The system is reconfigurable for manydistributed array processing appli-cations. 16

PEdesign. The system's PE is based onthe TMS32010, with additional hardwareto provide multiple I/O ports, a floating-point ALU, lookup tables, localized con-trol, and a bit-serial diagnostic link. ThePE includes program ROM, containingfixed algorithmic code, and programRAM, providing space in which to down-load programs for other algorithms. Theschematic diagram of the PE is shown inFigure 9.

Array. When used for adaptive beam-forming, the STC-RSRE system consistsof 33 identical PEs, 21 of which areorganized as a triangular wavefront array,which performs the adaptive beamform-ing function by means of the QR algo-rithm, and 12 of which perform datacorrection and other preprocessing func-tions (Figure 10). The system has been suc-cessfully applied to many real-timeexperiments. It is supported by an RF/IF(radio frequency/intermediate frequency)front-end subsystem, a zero-IF receiver,andA/D converters. The processor systemis housed in a four-foot-high equipment

COMPUTER30

Antenna RF/IF Unit

Analogoutput

Key:

RF/IF = Radio frequency/intermediate frequency

ZIF = Zero IFPSU = Power supply unitDCWAP = Data compensation

WAPTWAP = Triangular WAP

Real-timeinterface

Figure 10. Overall configuration of the STC-RSRE adaptive array system (reproduced from Davie, Higgins, and Cawthorn16).

rack with power supplies and fans; its totalpower consumption is approximately 500watts.

A VLSI node processor. Any futurerealization of wavefront arrays for real-time signal processing requires the devel-opment of specialized VLSI processors toenable more compact implementation ofsystems and provide a throughput ratematched to modern radar, communica-tions, and electronic countermeasure sys-tems. As a part of the British Ministry ofDefence's Very High Performance Inte-grated Circuit Program, recent work atSTC has been directed toward the defini-tion of a high-performance VLSI "nodechip" that can serve as a programmablebuilding block for real-time wavefrontarray subsystems. 16A number of key features have been

included in the node chip specification:

* a high-throughput processor,* on-board 1/0 control, program anddata memory, arithmetic units, andparallel multiplier blocks,

* full handshake control to allow sim-ple interconnection with neighboringnode chips,

* an internal architecture optimized forfloating-point arithmetic, and

* programmability.

The node chip overlays computation

and dataflow by using FIFO buffering onboth its input and output ports. By allow-ing computation and dataflow to takeplace concurrently, the node chip ensuresmaximum dataflow. The application ofsophisticated algorithms to real-time sig-nal processing problems often requiresthat processing parameters be varied inaccordance with the required system per-formance or with environmental condi-tions (for example, that the dynamicbandwidth of a digital filter be controlledby rapid adjustment ofthe filter weightingcoefficients). In addition to the usual data-1/0 ports, the node chip provides ports forreceiving control information from anexternal supervisory processor. The con-trol data is propagated in synchronismwith the data wavefronts passing throughthe array.

Memory-linked Wavefront ArrayProcessor. TheMWAP is a new wavefrontarray processor developed at the AppliedPhysics Laboratory of the Johns HopkinsUniversity.'8 As a result of its use ofVHSICs (very high speed ICs), theMWAP achieves very high performance.Multiple MWAPs are connected on a ringnetwork to form a large processing system.The MWAP not only serves as a moduleon a ring bus but also uses a modular struc-ture for its PEs. The basic MWAP archi-tecture (Figure 11) consists of a bus

July 1987

interface to a dual-port memory, multiplePE/dual-port-memory pairs, and a busoutput interface.Each MWAP PE consists of a control

unit, an instruction unit, an instructioncache, a block of memory addressregisters, a floating-point multiplier, anda floating-point ALU. Once the instruc-tion cache is loaded, program and datamemory are separate. The key to thisarchitecture is the memory addressingstructure. All memory addressing is doneby reference to a memory address register,which can be read, or read and thenincremented, or read and then reset to abase address set up during program load-ing. The PE can simultaneously read orwrite data memory in both directions. Syn-chronization of the PEs in the MWAP isdone by the right and left control flags inthe memory addressing unit. A PE canalways read the memory to either its rightor left. However, it can only write intothese memories when the control flag forthem is reset. Each PE can set or reset thememory control flag in either memory. Ifthe PE attempts a store to a memory thathas its control flag set, program executionis suspended until the control flag is reset;then the store command is executed. Inaddition, the setting of a control flag in amemory causes an idle PE to begin pro-gram execution. This memory controlstructure results in high-speed, compact

31

Prooessor Subsystem

Input

Key:

Bil = Bus interface InputBIO = Bus interface outputDPM = Dual-port memoryPE = Processing element

Control

IOutput

Figure 11. Basic architecture of the MWAP (reproduced from Dolecek"8).

programs with completeMWAP dataflowcontrol.

Parallel 24-bit floating-point arithmeticwas chosen for the prototype MWAP toallow a broad range of algorithms to beimplemented. The dynamic range ofMWAP floating-point numbers is1.4 x 10-39to 1.7 x 1038, which simpli-fies data scaling. The MWAP floating-point multiplier is capable of 10 millionmultiplications per second. And the TIVHSIC static RAM chip used in the cur-rent MWAP dual-port memory supportsabout 10 million accesses per second perPE. Thus, for real-number filtering oper-ations, memory speed and multiplicationspeed are matched.

Some possible wavefront architectures.A wavefront array processor can be con-sidered a static dataflow machine and thuscan be implemented with dataflow PEs.Two commercially available VLSI chips,the NEC APD7281 dataflow chip'9 andthe Inmos Transputer, can be consideredfor this. Both provide a set of powerfulprocessing primitives and a handshakingmechanism for communications.

Constructing a wavefront array withNEC chips. NEC's,uPD7281 is a program-mable device that features a dataflowarchitecture. It can help improve the trade-off between speed and flexibility. TheMPD7281 uses an internal circular pipeline

and a powerful instruction set to allowhigh-end image processing. The disadvan-tage of this chip is that it was initiallydesigned for linear arrays or rings-not forthe wavefront array's mesh-connected,Illiac-IV-type architecture-and thus hasonly one input and one output bus. Thisproblem can be solved by designing aninterface that permits connection of apPD7281 PE to its four neighbors.

Constructing a wavefront array withInmos Transputers. Inmos's Transputerchip (T414 or T424) is an Occam-language-based design that provides hard-ware support for both concurrent compu-tation and communication. The chip is acomplete computer with four neighborconnections. It has been designed so thatits external behavior corresponds to theformal model of an Occam process. Itadopts the now-popular RISC architec-ture. It has a 32-bit processor capable of10 MIPS, 4K bytes of 50-ns static RAMand, significantly, a variety of communi-cations interfaces. These features make ita powerful building block for constructingconcurrent processing networks. TheTransputer's links are the hardware repre-sentation of the channels for process com-munication. There is an intimaterelationship between Transputer channellinks and the communication protocolenvisaged for wavefront arrays. Trans-puter nets programmed in Occam may be

quite naturally regarded as wavefrontprocessors. However, the Transputer issignificantly limited in many signalprocessing applications. A very usefulenhancement ofthe Transputer would bethe inclusion of a floating-point capability.

S ignal and image processing appli-cations generally call foralgorithms that are deterministic in

both time and space. This allows develop-ment of a unified theoretical frameworkfor architecture and algorithm design. Thewavefront array is the outcome of the useof such a framework. It eliminates theneed for global control characteristic ofthe systolic array and thus permits a dis-tributed and data-driven approach to themanipulation ofthe complex data depen-dencies in array processing. Moreover, itcan easily cope with various programmingrequirements, with the need for fault toler-ance, and with computing networks hav-ing either global or irregularinterconnections. The power and flexibil-ity of the wavefront array are demon-strated by its very broad applicationdomain, including adaptive array process-ing, image and vision processing, seismicanalysis, and medical signal processing.The trend toward the use of algorithmi-cally specialized parallel computers-including both systolic and wavefrontarrays-will continue. Such machines willplay a large role in future supercomputingtechnology. E

AcknowledgmentsThe authors wish to thank their colleagues at

USC for their invaluable contributions to theresearch summarized in this article, especiallyC.W. Chang, P.S. Lewis, J.C. Lien, E.Manolakos, R.W. Stewart, S.W. Sun, and J.Vlontzos.The research was supported in part by the

National Science Foundation under GrantECS-82-13358, by the Semiconductor ResearchCorporation under the USC SRC program, andby the Innovative Science and TechnologyOffice of the Strategic Defense Initiative Organ-ization, and was administered through theOffice of Naval Research under ContractsN00014-85-K-0469 and N00014-85-K-0599.

References1. K. Hwang and F. Briggs, Computer

Architectures and Parallel Processing,McGraw-Hill, New York, 1984.

2. J.B. Dennis, "Data Flow Supercom-puters," Computer, Nov. 1980, pp. 48-56.

COMPUTER32

3. A.L. Fisher and H.T. Kung, "Special-Purpose VLSI Architectures: General Dis-cussions and a Case Study," in VLSI andModern SignalProcessing, S.Y. Kung, H.J.Whitehouse, and T. Kailath, eds., Prentice-Hall, Englewood Cliffs, N.J., 1985, pp.153-169.

4 . P. Dew and L. Manning, "Comparison ofSystolic and SIMD Architectures for Com-puter Vision Computations, " and R. Mel-hem, "Irregular Wavefronts inData-Driven, Data-Dependent Computa-tions," in Systolic Arrays, W. Moore, A.McCabe, and R. Urquhart, eds., AdamHilger, Boston, Mass., 1987.

5 . S.Y. Kung, K.S. Arun, R.J. Gal-Ezer, andD.V. Bhaskar Rao, "Wavefront ArrayProcessor: Language, Architecture, andApplications," IEEE Trans. Computers(special issue on parallel and distributedcomputers), Nov. 1982, pp. 1054-1066.

6. S.Y. Kung, VLSI Array Processors,Prentice-Hall, Englewood Cliffs, N.J.,1987.

7 . S.Y. Kung, P.S. Lewis, and S.C. Lo, "Per-formance Analysis and Optimization ofVLSI Dataflow Arrays," J. Parallel andDistributed Computing, to appear.

8. P. Wilson, "Occam Architecture Eases Sys-tem Design," ComputerDesign, Nov. 1983,p. 107.

9. C.L. Seitz, "The Cosmic Cube," Comm.ACM, Jan. 1985, pp. 22-33.

10. H.S. Stone, "Parallel Processing with thePerfect Shuffle," IEEE Trans. Computers,Feb. 1971, pp. 153-161.

Sheng-dhun Lo is a research assistant at the Uni-versity of Southern California, where he is pur-suing a PhD in electrical engineering. Hisinterests are in parallel computer architectures,digital signal processing, and parallel algorithmdesign. Lo received aBS from the National Tai-wan University in 1980 and anMS from the Uni-versity of Southern California in 1984, both inelectrical engineering.

11. L. Snyder, "Introduction to the Configura-ble, Highly Parallel Computer," Com-puter, Jan. 1982, pp. 47-56.

12. J. Lenfant, "Parallel Permutations ofData: A Benes Network Control Algorithmfor Frequently Used Permutations," IEEETrans. Computers, July 1978, pp. 637-647.

13. D. Gajski, D. Kuck, D. Lawrie, and A.Sameh, "Cedar," tech. report, Dept. ofComputer Science, Univ. of Illinois,Urbana, Feb. 1983.

14. T. Feng, "A Survey of Interconnection Net-works," Computer, Dec. 1981, pp. 12-27.

15. A.L. Fisher and H.T. Kung, "Synchroniz-ing Large VLSI Arrays," IEEE Trans.Computers, Aug. 1985, pp. 734-740.

16. E.B. Davie, D.G. Higgins, and C.D. Caw-thorn, "An Advanced Adaptive AntennaTest-Bed Based on a Wavefront ArrayProcessor System," Proc. Int'l WorkshopSystolic Arrays, July 1986.

17. D.S. Broomhead et al., "A Practical Com-parison of the Systolic and WavefrontArray Processing Architectures," IEEEICASSP Proc., Nov. 1985, pp. 296-299.(Also in Proc. IEEE Workshop VLSI Sig-nal Processing, Nov. 1984.)

18. Q.E. Dolecek, "Parallel Processing Sys-tems for VHSIC," tech. report, AppliedPhysics Laboratory, Johns Hopkins Uni-versity, Laurel, Md., 1984, pp. 84-112.(Also in Proc. VHSlCApplications Work-shop, 1984.)

19. gPD7281 User's Guide, NEC Electronics,Mountain View, Calif., 1985.

Shiann-ning Jean is working toward a PhD atthe University of Southern California. Hisinterests include speech processing, algorithmmapping, array architectures, and array faulttolerance. Jean received the BS and MS degrees,both in electrical engineering, from the NationalTaiwan University in 1981 and 1983.

Sun-yuan Kung joined the faculty of electricalengineering systems at the University of South-ern California in Los Angeles in July 1977. Hecurrently holds the rank of professor. From1984 to 1985, he was associate editor for VLSIfor the IEEE Transactions on Acoustics,Speech, andSignalProcessing, and he presentlyserves on the IEEE ASSP Technical Commit-tee on VLSI. He was general chairman of theIEEE Workshop on VLSI Signal Processing in1982 and 1986, and he was the keynote speakerat the First International Workshop on SystolicArrays in Oxford, England, in July 1986. Hisinterests include linear systems approximation,spectrum analysis, digital signal processing, andVLSI array processors.A member of theACM and a senior member

of the IEEE, Kung received his PhD in electri-cal engineering from Stanford University in1977.

Jenq-neng Hwang is a research assistant in theSignal and Image Processing Institute of theDepartment of Electrical Engineering of theUniversity of Southern California, where he isworking toward the PhD degree. His interestsinclude parallel algorithm design, VLSI arrayarchitectures for image and vision processing,and neural networks. Hwang received the BSand MS degrees, both in electrical engineering,from the National Taiwan University in 1981and 1983.

Readers may write to Sun-yuan Kung at the Signal and Image Processing Institute, Dept. of Electrical Engineering, University of Southern California,University Park, Mail Stop MC-0272, Los Angeles, CA 90089.

July 1987 33

Documents

Wavefront Array Processors-Concept to Implementation