SINGLE INSTRUCTION STREAM - MULTIPLE DATA STREAM …hj/conferences/2.pdf · One aspect of the design of SIMD (single instruction stream - multiple data stream [8]) or array machines

SINGLE INSTRUCTION STREAM - MULTIPLE DATA STREAM MACHINEINTERCONNECTION NETWORK DESIGN*

Howard Jay SiegelDepartment of Electrical Engineering

Princeton UniversityPrinceton, New Jersey 08540

Abstract — An SIMD machine must have aninterconnection network to pass data between pro-cessing elements. We introduce a model of SIMDmachines which allows a formal mathematical analy-sis and comparison of different interconnectionnetworks. Five interconnection networks that havebeen proposed in the literature are defined interms of our model. They include a network simi-lar to the one used in the STARAN, a networksimilar to the one recommended by Feng to imple-ment data manipulating functions, the Illiac IVnetwork, and the Perfect Shuffle. The networksare evaluated in terms of the upper and lowerbounds on the time required for each network tosimulate the actions of the others. It is usuallyimpractical to implement all the interconnectionsthat may be needed by the machine to perform alarge variety of computations, so the ability ofa network to simulate other interconnections isimportant. The methods used to prove the lowerbounds and to construct the simulation algorithmsto demonstrate the upper bounds can be generalizedand applied to the analysis of other networks.

I. Introduction

One aspect of the design of SIMD (singleinstruction stream - multiple data stream [8]) orarray machines is the construction of an inter-connection network to pass data from one processorto another. One way to view the structure of anSIMD machine is as a set of N processing elements(where each processing element consists of a pro-cessor with its own memory), interconnected by anetwork, and fed instructions by a control unit.

This work was supported by NSF Grant DCR7i»-21939.The article is a revised summary of PrincetonUniversity, Department of Electrical Engineering,Computer Science Laboratory Technical Report 198.

Author's current address is Purdue University,School of Electrical Engineering, West Lafayette,Indiana 47907.

The network connects each processing element tosome subset of the other processing elements. Theconnections are represented by a set of inter-connection functions. Only one interconnectionfunction of the network can be used at a time;i.e., at any time, each processing element isconnected to only one other processing element.A transfer instruction causes data to be movedfrom each processing element to the processingelement to which it is connected. To move databetween two processing elements that are notdirectly connected, the data may be passed throughintermediary processing elements by executing aprogrammed sequence utilizing the interconnectionfunctions in that network.

When building an SIMD machine, an inter-connection network must be implemented. To choosewhich interconnection functions to include in thenetwork, the system designer must consider thetypes of problems the machine will be used tosolve. Generally, it is not possible to includeall of the interconnection functions desired.Therefore, those that will be used most oftenwould be implemented and used to simulate theother interconnections that may be required. Also,an SIMD machine may be being designed as a generalpurpose machine, to handle a large variety oftasks. Thus, it is very important for the systemdesigner to consider the ability of a set of inter-connection functions to simulate other inter-connection functions.

In this paper we shall develop a realisticmodel of SIMD machines and use it to evaluateinterconnection networks. We shall discuss fiveparticular interconnection networks and show themto be equivalent in the sense that each can simu-late the actions of the others. The networks wewill examine are: the Cube network, a networksimilar to the one implemented in the STARANmachine [3]; the PM2I network, a network similarto the one used by Feng to implement data manipu-lating functions [6], [7]; the Illiac network[1], [5]; the Perfect Shuffle, which has beenpopularized by Stone [18]; and the WPM2I network,a variation of PM2I which was introduced in [14].These networks are analyzed in terms of the timecomplexity required for one network to simulateanother.

A model independent lower time bound for eachsimulation shall be presented. Many of theselower bounds are proved in [14]. In this paperwe shall prove only those lower bound resultswhich are better than those presented in [14].

273

The upper time bound for each simulationshall be demonstrated by an algorithm that per-forms the simulation. The methods used to con-struct these algorithms can also be used to writealgorithms to simulate interconnections not pre-sented here. The algorithms we shall present canbe directly implemented on an SIMD machine thatsatisfies the assumptions we shall make In sectionIV.

II. The Model

Our model of an SIMD machine consists offour parts: processing elements, interconnectionnetwork, machine instructions, and maskingschemes. Each processing element (PE) is a pro-cessor together with its own memory, a set of atleast three fast access registers (A_, B_, and C),and a data transfer register (DTR). The DTR ofeach PE is connected to the DTR's of the otherPE's via the interconnection network. When aninterconnection function is executed, it is theDTR contents of each PE that are transferred.

There are N̂ PE's, each assigned an address

from 0 to N-l. We assume that N - 2m; i.e.,

log.N = in. We also assume that PE. has a register

ADDRESS that contains the integer i. LetADDRESS(j) be the jth_bit of ADDRESS.

Each PE is always in either active or in-active mode. If a PE is active it executes theinstructions broadcast to it by the control unit.If a PE is inactive it will not execute theinstructions broadcast to it.

An interconnection network is a set ofinterconnection functions, each a bijection onthe set of PE addresses. When an interconnectionfunction f is applied, PE. copies the contents

of its DTR into the DTR of PE.,.>. This occurs

for all i simultaneously, for 0 <_ i < N and PE.

active. Thus, saying an interconnection networkmaps the address x to the address y is equivalentto saying that it causes PE to pass its data to

PE . Note that an inactive PE may receive data

from another PE if an interconnection function isexecuted, but it cannot send data.

To pass data from one PE to another PE aprogrammed sequence of interconnection functionsmust be executed. This sequence of functionsmoves the data from one PE's DTR to another's bya single transfer or by passing the data throughintermediary registers.

For example, let one of the interconnectionfunctions f in a network be defined by f(x) = (x+1)mod N, where x is a PE address. Then when f (thecycle function) is applied, PE number 0 transfersthe contents of its DTR to the DTR of PE numberf(0) = 1, PE number 1 transfers the contents ofits DTR to the DTR of PE number f(l) = 2,..., andPE number N-l transfers the contents of its DTR

274

to the DTR of PE number f(N-t) » 0. To pass data>from PE, to PE,+2 ^ N, 0 <_ I < N, f may be

executed tiwce.

In section III five particular interconnectionnetworks will be defined.

The machine Instructions are those operationsthat each processor can perform on data in itsindividual memory or registers. We assume thereis a separate control unit (CU) computer whichstores programs and broadcasts instructions anddata. All active PE's execute the same instructionat the same time, but on possibly different data.

Actual SIMD machines allow data to be movedamong the memory, the fast access registers, andthe DTR of a single PE. All we assume, withoutloss of generality, is that data may be movedamong the registers. The notation X *• Y means thecontents of register Y are copied into registerX, where X and Y could be A, B, C, or DTR.

A masking scheme is a method for determiningwhich PE's will be active at a given point intime. An SIMD machine may have several differentmasking schemes. Each mask partitions the set ofPE addresses into those PE's that will be activeand those that will be inactive.

If PE address masks are used, an m-positionmask will accompany each instruction and willdetermine which PE's are active, i.e., executethat instruction. Each position of the mask iseither a 0, I or X ("don't care"), and the onlyPE's that will be active are those that match themask for each of the m bit positions of theiraddress. For example, if N 8 (so m « 3) andthe mask is 1X0, then only PE's 6 (U0) and k (100)would be active. Superscripts will be used as

repetition factors, e.g., X3O12 would be XXXOU.Square brackets will be used to denote a mask.For example, executing the instruction

"DTR + A [X 0]" would cause each even numberedPE to load Its DTR with the contents of its Aregister. This scheme was presented anddiscussed in [)<*].

Data conditional masks are the implicitresult of executing "if-then-else" statementsthat involve local data in each PE's registersor memory. This type of masking is used in suchmachines as the Illiac IV ([1], [5], [12]) andPEPE [20]. Whenever a conditional statement isexecuted each PE may be executing it withdifferent data, so the outcome may differ fromone PE to the next. Thus, as a result of theconditional each PE will set an internal flag sothat it will be active for either the "then" orthe "else," but not both. The execution of the"else" statements must follow the "then" state-ments; i.e., they cannot be executed simulta-neously. For example, as a result of executingthe statement: "j_f_ A > B then C •+• A else C -*- B"each PE will load its C register withthe maximum of its A and B registers; i.e., somePE's will execute "C •*• A," and then the restwill execute "C «- B." Thus, for SIMD machines,

data conditional masks and "if-then-else" state-ments are the same.

PE address masks and data conditional maskswill be the only masking schemes used in thispaper. PE address masks provide a concise method

to activate 3m different sets of PE's. Data con-ditional statements are an essential part of allprogramming languages, so it is fair to assumethey would be present in all SIMD machines. Theresults of this paper would still be valid evenif only data conditional masks were used. Thisis because if each PE knows its own address, thendata conditional masks could be used to simulatePE address masks using no additional inter-processor data transfers.

Whenever an interconnection function isexecuted, all active PE's pass their data at thesame time. Since each interconnection functionis a bijection, this transfer of data occurswithout conflict if all PE's are active. It ispossible, however, that masking can cause trans-fers of data no longer to represent bijectionson the PE addresses. Such data transfers woulddestroy data.

For example, let N = 8 and let the inter-connection function be f(x) • (×+l) mod 8. Supposef is executed with the PE address mask [OXX].Then f(4) = 4, since PE number 4 is not active,and f(3) • 4, since PE number 3 is active. Thus,this data transfer is not a bijection, so itdestroys data. In this case the original con-tents of the DTR of PE number 4 is destroyed bythe data transferred into that DTR from PE number3. In order to have saved this data the DTR con-tents of PE number 4 would have to have beencopied into a register or memory location of thatPE before the data transfer instruction wasexecuted.

Formally, an SIMD machine can be representedas the 4-tuple (N,F,I,M), where:

(1) N is a positive integer, representingthe number of processing elements inthe machine;

(2) F is the interconnection network, whereeach interconnection function is a bi-jection on the set {0,1,.. .N-l};

(3) I is the set of machine instructions,instructions that are executed by eachactive PE and act on data within thatPE;

(4) M is the set of masking schemes, whereeach mask partitions the set {0,1,.. .N-l}into the set of active PE's and the setof inactive PE's.

III. Interconnection Networks

Let the binary representation of a PE addressbe p ,p ....p.p., let T5". be the complement ofm~I m-2 I o i

of p., and let the integer n be the square root

of N.

(1) Perfect Shuffle (PSj. This networkconsists of a shuffle function and an exchangefunction. The shuffle is defined by:

The shuffle function is a left rotation of thebits of each address. The exchange functioncomplements the low order (O^h) bit of eachaddress. For example, s(3) = 6 and e(6) = 7, forN = 8. The shuffle can be thought of as theresult of perfectly shuffling a deck of cards(i.e., 0 + 0, N/2 • 1, 1 •* 2, N/2+1 • 3, etc.)(see [9], L10J, [14], [18]).

This network has been shown to be quite use-ful by Stone in [18], It is also the basis ofLawrie's "omega" network [11].

(2) 111iac. This network has four functions de-fined as follows (recall n is the square root ofN):

For example, if N 16, I (0) = 4. When we+n

discuss the Illiac we shall assume m is even, thatis, n = 2 m / 2 is an integer. If the PE's are con-sidered as a n x n array, then each PE will beconnected to its north, south, east and westneighbors (see [1], [5], [12], [14], [17]).

This network is implemented in the Illiac IVsystem. The ability of this system to performvarious tasks is described in [5]•

(3) Cube. This network consists of mfunctions defined by:

By specifying N, F, I, and M, a particular SIMDmachine architecture can be modeled.

275

and the exchange is defined by:

for 0 <_ i < m. The Cube function c. complements

the ith. bit of each address. For example,c»(7) ° 3. When the PE addresses are considered

as the corners of an m-dimensionai cube this net-work connects each PE to its m neighbors (see[)k]). Note that cQ and the Perfect Shuffle

exchange function e are identical.

The network used in the STARAN is a wiredseries of Cube functions (see [3]). In [2] and[k] the applicability of this network to practicalproblems is discussed. A version of this type ofnetwork was used as part of a parallel machinesimulation in 3]

(4) Plus-Minus 21 (PM2I). This network consistsof 2m functions defined by:

for 0 <_ i < m. For example, t+. (2) = k if N > 4.

Note that the Illiac IV is a subset of this net-work. Various properties of the PM2I networkcan be found in [14].

The network recommended by Feng to implementdata manipulating functions is a wired series ofPM2I functions [6]. The various data manipulatingfunctions that this network can perform arediscussed in [6] and [7].

(5) Wrap-around Plus Minus 21 (WPM2I). Thisnetwork consists of 2m functions defined by:

for 0 <_ i < m. WPM2I is like PM2I, except any"carry" or "borrow" will "wrap-around" to thep. . bit position. Note that any "carry" or

"borrow" cannot affect p.. For example, if

N = 8 and m = 3, then w (001) = 110, whereas

t_, (001) = 1 1 1 .

The WPM2I network was introduced in [14].It is a variation of the PM2I which has the

ability to simulate any other interconnectionfunction when the networks are treated as sets ofpermutations on the integers from 0 to N-l. Interms of group theory, WPM2I can generate theentire group of permutations on N elements. Ofthe five networks presented here, only WPM2Ihas this ability (see [14]) .

IV. Simulations Results

The designers of SIMD machines must choose aset of interconnection functions to implement, andthey will either base their choice on the type ofcomputations the system will be expected to per-form or assume they are building a general purposemachine. The number of functions that will beincluded in the network will be constrained bysuch factors as cost and hardware complexity.Therefore, it is quite important for the designerto consider the ability of the network that ischosen to simulate other functions.

In this section we compare the simulationability of five different types of interconnectionnetworks that have been proposed in the literatureand have been shown to be useful. The lowerbounds on simulation times and the simulationalgorithms that follow demonstrate techniques thatcan be used to compare and analyze other networks.

We use these specific simulations todemonstrate our methods for several reasons.There is little in the literature directly com-paring the abilities of these types of networks.The following theorem provides a means for such acomparison. In addition, by using these simula-tions to demonstrate the techniques, the systemdesigner may observe the minimum number of datatransfers needed if a network presented here wasimplemented and it was then found necessary tosimulate the actions of one of the networks thatwe have defined. The designer is also providedwith an algorithm to perform the simulation.Since these networks have been shown to be usefulit is very possible that any network implementedmay have to simulate one of them.

The lower bound results are valid for allmodels of SIMD machines. The only assumptionsmade for Theorem 1 are:

(1) that at any given point in time a PEmust be either active or inactive;

(2) that the interconnection function, of thenetwork to be simulated, which requiresthe most time to simulate, will determinethe lower time bound for the network; and

(3) that when an interconnection function issimulated, its effect on all PE's mustbe simulated.

The model - independence is significant becauseit means that the results and the methods usedto obtain them apply to real machines.

276

The lower bounds are in terms of the numberof times interconnection functions must be executedin order to perform the simulation. Recall thatthe transferring of data from PE to PE will

x yalso be referred to as mapping the address x to y.The mappings will be described by logical orarithmetic operations on the m bits of the PEaddresses.

Theorem 1 explores the lower bounds on thetime required for each network defined in sectionIII to simulate the other networks. In Theorem 2the upper bounds on the simulation time areanalyzed.

Many of the lower bound results were presentedin [14]. We will sketch the proofs of onlythose new results which provide tighter bounds.

Theorem 1: In the following table the entry inrow ×, column y, is a lower time bound for net-work x to simulate network y. An * indicates thatthe proof of the bound is sketched in [14].

277

Proof: The notation "x • y" means "the casewhere x is used to simulate y."

PS • Cube: Observe that c1(lm"201) lm and

(lm) _ , m - 2 0 K At ]east m_j shuff)es must be

executed to map 1 01 to lm, as we must move the

0 to the 0̂ h_ bit position so that the exchangecan change it. The only way to perform thismapping in m steps is to execute m-1 shuffles

followed by one exchange. To map 1 to 1 01 atleast one shuffle must be used after an exchangeis executed. Therefore, at least m+1 steps arerequired.

Illiac Cube: Let d(x,y) = |×-y|, the absolutedifference of x and y. The function d is ametric (see [14]). Let j = (m/2)-l. d(0,c.(0)) -

n/2. d(x,l+ (x)) = d(x,l_n(x)) = n, 0 <_ x < N,

so l+n and l_ cannot be used to move a distance

of n/2. d(x,l+)(x))= d{x.l^<x)) - 1, 0 <_ x < N.

Therefore, the only way to map 0 to n/2 In n/2

steps is to execute l+) n/2 times. But cj(lm) -,m/20](m/2)-l a d ^ subsequence of (|+])n/2 can

perform this mapping. Thus, at least (n/2)+lsteps are required.

PM2I - Cube: For 0 < j < m-1 and 0 <_ i < mc. t+,. thus, at least two steps are required.

PM2I - PS: An interconnection function f has theeffect of adding × distinct integers, mod N, tox different addresses, one to each address iff(k.)-k. = q., such that k. k. and q. q. for

i j , 0 <_ i , j < x. Each execution of a PM2Ifunction can add either a mod N integer, if thePE is active, or nothing, if the PE is inactive,to the set of PE addresses. Thus, the number ofdistinct integers added to addresses after log-x

executions of distinct PM2I functions is at mostx. The shuffle function has the effect of addingN-T distinct integers to the set of addresses.Thus, the PM2I network requires at leastflog_(N-1)1 = m steps to simulate the shuffle.

PM21 - llliac: '±1 = '±0' '±n = t ± ( m/2)

PM21 - WPM21 : For 0 < j < m and 0 <_ i < m,w+. t+.. Thus, at least two steps are required.

WPM2I • Cube: For 0 <_ j < m and 0 <_ i < m,c. w+.. Thus, at least two steps are required.

WPM2I • Illiac: I ( T ) = 0 m / 2 l m / 2 . The only

way WPM2I can perform this mapping in two steps

is w.o followed by w + ( m / 2 ) . ,+ n(,"/20m / 2) = 0m.

The only way WPM2I can perform this mapping intwo steps is w+/ .j) followed by w_Q. Thus, at

least three steps are required.

WPM2I - PM2I: Follows from WPM2I • Illiacanalysis. •

In Theorem 2 we demonstrate methods to con-struct algorithms to simulate particular inter-connections, (in [16] algorithms to simulatearbitrary interconnections are presented.) Thealgorithms that follow have more than theoreticalsignificance. Given an SIMD machine which satis-fies the assumptions we will make thesealgorithms can actually be used to perform thevarious simulations.

We make the following assumptions:

(1) All results are in terms of the model

presented in section II, where N = 2 m,F will vary, I includes instructionsfor moving data between the DTR and theother registers of the same PE, andM = {PE address masks, data conditionalmasks}. (Recall that data conditionalstatements can be used to simulate PEaddress masks without using anyadditional interprocessor data transfers.)

(2) Time bounds are in terms of the numberof executions of interconnection functions.

(3) When simulating the interconnectionfunction f the data to be transferred

PH21 • WPM21: For w+Q use t+Q and w_Q use t_Q.

For w+. , 0 < i < m (w_j similar):

278

starts in the DTR of PE and must end inx

the DTR of PEf(x) , 0 <_ x < N.

CO The interconnection function Is to besimulated as if it were executed withall PE's being active. In [15] it isshown how this restriction can be removed.

When PE address masks and data conditionalmasks are used together, the PE address masksaccompany each instruction in the "then" block andin the "else" block. Thus, in order for a PE tobe active it must be in active mode as a result ofthe conditional and match the PE address maskaccompanying the instruction. The notation A •«—» Bis an abbreviation for registors A and B switchingtheir contents using a third register.

After each algorithm an example is given todemonstrate how the algorithm operates. For theexamples we assume that the original contents ofthe DTR of PE. is the integer i, all addresses

and integers will be in binary, and unless other-wise stated, N will equal 8.

Theorem 2: In the following table the entries inrow ×, column y, are lower and upper bounds on thetime required for network × to simulate network ygiven the above assumptions. Each upper bound isbased on the time complexity of the algorithmpresented to do that simulation.

Proof: The notation "x - y" means "the case where× is used to simulate y." In [15] we discuss eachalgorithm and prove that it is correct.

PH2I • PS: For the exchange see the PM2I - Cubeanalysis, since cQ = e.

For the shuffle:

PM2I -» Cube:

279

P M 2 I • 111 i a c :

PS - PM2I :

PS -» Cube:

PS •» WPM21 :

PS - Illiac:

Cube •*• PM2I :

Cube PS:

Cube - WPM2I: For w . and w . see the Cube •* PM2I+u ~u

analysis since w+0

and w-0 L-o

Example of Cube

Cube -*- 111 iac: Follows from the Cube • PM2Ianalysi s.

WPM2I PM2I: For t , n use » . and f o r t . use—~—^————— +u +U ~Uw _ . For t+j, 0 < i < m (t_. is similar):

WPM2I -» PS: For exchange see the WPM2I •*• Cubeanalysis, since c. = e.

For the Shuffle: Same as PM2I - PS, using w+.

in place of t+. and w_. in place of t_..

WPM2I Cube: For c. , 0 < i < m:

Example of WPM2I • c .

WPM2I -* Illiac: Follows from the WPM2I - PM2Ianalysis.

Ill iac •» PM2I : For t . , m/2 <_ i < m

(t . is similar):

for j = 1 until 2'/2 m / 2 do I [×m]

Executing l+ 2X/2 times is equivalent to

adding 2 ' , which is equivalent to t . , m/2 <_ i < m.

For t+. , 0 <̂ i < m/2 (t_. is similar):

for j = 1 until 21 do I . [Xm]

Executing I . 2 times is equivalent to adding 2 ,

which is equivalent to t ., 0 <_ i < m/2.

111 iac PS: For the exchange see theIlliac • Cube analysis, since c. = e.

For the Shuffle: See Orcutt's thesis [12],section III.

280

V. Conclusions

A model of SIMD machines, designed to reflectall of the flexibility of real SIMD machines, waspresented. Five different Interconnection net-works that have been proposed in the literaturewere defined in terms of the model and evaluated.The networks were analyzed in terms of the timerequired for each network to simulate the others.A lower time bound for each simulation waspresented and an upper time bound was demonstratedby an algorithm that performed the simulation. Inmost cases tight bounds were found.

An SIMD machine designer must choose a setof interconnection functions, i.e., an inter-connection network, to implement. It is notpossible to include all of the interconnectionsan SIMD machine will need to perform a largevariety of computations. Thus, when choosing aninterconnection network, a designer must considerthe ability of the network to simulate otherinterconnection functions.

If an SIMD machine is being designed for aspecific task, then the peculiarities of thattask must be considered when choosing an inter-connection network to implement. If one assumesthat the machine will be a general purpose one,then the results of the theorem indicate that ahybrid network consisting of the PM2I functionsand the shuffle function would be quite powerfulin terms of simulation ability. This hybrid wouldbe able to simulate any network discussed here inat most 2 steps.

The methods used to prove the lower boundsand to construct the simulation algorithms can beused to analyze and compare other networks andhybrids of the networks presented here. To con-struct the simulation algorithms one must considerand keep track of the flow of N data words passingthrough N processing elements. In addition, onemust determine which data may get destroyed by adata transfer that is not a bijection and savethat data in such a way that it can be identifiedand reloaded later.

The table In Theorem 2 provides comparisoninformation to aid the system designer in choosinga network from among those discussed. The methodspresented provide tools for the designer to useto evaluate other networks.

VI. Acknowledgements

I would like to thank Professor J. D. Ullmanfor his help and guidance with this research. Iwould also like to thank L. J. Siegel for her comments.

281

Illiac -> Cube:

Illiac + WPM2I:f iik••l it>i•^i nn I

V I I . References

[1] G. H. Barnes, et . a l . , "The ILLIAC IVcomputer," IEEE Trans. Comput., Vol . C-17(Aug., 1968), pp. 7^6-757.

[2] K. E. Batcher, "STARAN/RADCAP hardwarearchitecture," Proceedings of the 1973Sagamore Computer Conference on ParallelProcessing, pp. I*t7-152.

[3] K. E. Batcher, The Multi-Dimensional AccessMemory in STARAN, submitted to the IEEETrans. Comput. Special Issue on ParallelProcessing; summary in the Proceedings ofthe 1975 Sagamore Conference on ParallelProcessing, page 167.

[k] L. H. Bauer, "Implementation of datamanipulating functions on the STARANassociative processor," Proceedings of the197** Sagamore Computer Conference onParallel Processing, pp. 209-227-

[5] W. J. Bouknight, e t . a l . , "The I l l i a c IVsystem," Proceedings of the IEEE, Vol. 60,No. k (Apr., 1972), pp. 369-388.

[6] T. Feng, "Data manipulating functions inparallel processors and their implementations,"IEEE Trans. Comput., Vol. C-23 (Mar., 1974),PP. 309-318.

[7] T. Feng, Parallel Processing Characteristicsand Implementation of Data ManipulatingFunctions, Dept, of Electrical and ComputerEngineering, Syracuse University, RADC-TR-73189 (July, 1973).

[8] M. J. Flynn, "Very high-speed computingsystems," Proceedings of the IEEE, Vol. 5*t,No. 12 (Dec, 1966), pp. 1901-1909.

[9] S. W. Golomb, "Permutations by cutting andshuffling," SIAM Review, Vol. 3, No. k(Oct., 1961), pp. 293-297.

[10] P. B. Johnson, "Congruences and cardshuffling," American Mathematical Monthly,Vol. 63 (Dec, 1956), pp. 718-719.

[11] D. E. Lawrie, Memory-Processor ConnectionNetworks, Dept, of Computer Science,University of Illinois, Rep. 557, (Feb.,1973).

[12] S. E. Orcutt, Computer Organization andAlgorithms for Very-High Speed Computation,Dept, of Computer Science, StanfordUniversity, Ph.D. Thesis, (Sept., 197*0

[13] D. Rahmlow, "Parasim," Princeton University,unpublished paper, (197*0.

[1*t] H. J. Siegel, "Analysis Techniques for SIMDMachine Interconnection Networks and theEffects of Processor Address Masks,"Proceedings of the 1975 Sagamore Conferenceon Parallel Processing, pp. 106-109.

[15] H. J. Siegel, SIMD Machine InterconnectionNetwork Design, Princeton University,Department of Electrical Engineering,Computer Science Laboratory Technical Report198, (Jan., 1976).

[16] H. J. Siegel, Single Instruction Stream-Multiple Data Stream Machine InterconnectionNetwork Universality, Princeton University,Department of Electrical Engineering,Computer Science Laboratory Technical Report(Aug., 1976).

[17] D. L. Slotnick, et. al., "The SOLOMONcomputer," 1962 Fall Joint Computer Conf.,AFIPS Proc, Vol. 22 (1962), pp. 97-107-

[18] H. S. Stone, "Parallel processing with theperfect shuffle," IEEE Trans. Comput•, Vol.C-20 (Feb., 1971), pp. 153-161.

[19] R. C. Swanson, "Interconnections forparallel memories to unscramble p-orderedvectors," IEEE Trans. Comput., Vol. C-23,No. 11 (Nov., 197*0, pp. 1105-1115.

[20] D. E. Wilson, "The PEPE support softwaresystem," Sixth Annual IEEE Computer SocietyInternational Conference (1972), pp. 61-6't.

282

Documents

SINGLE INSTRUCTION STREAM - MULTIPLE DATA STREAM …hj/conferences/2.pdf · One aspect of the design of SIMD (single instruction stream - multiple data stream [8]) or array machines