T o App ear in IEEE T r ansactions on Computers , 1992T o App ear in IEEE T r ansactions on Computers, 1992 W a velength Division Mul tiple A ccess Channel Hyper cube Pr ocessor Inter

To Appear in IEEE Transactions on Computers, 1992

Wavelength Division Multiple Access Channel Hypercube

Processor Interconnection

Patrick W. Dowd

1

Department of Electrical and Computer Engineering

State University of New York at Bu�alo

Bu�alo, NY 14260

[email protected]�alo.edu

Abstract

A multiprocessor system with a large number of nodes can be built at low cost by combining the recent advances in high

capacity channels available through optical �ber communication. A highly fault tolerant system is created with good

performance characteristics at a reduction in system complexity. The system capitalizes of the self-routing characteristic

of wavelength division multiple access to improve performance and reduce complexity.

A hypercube based structure is introduced, where optical multiple access channels span the dimensional axes. This

severely reduces the required degree, since only one I/O port is required per dimension. However, good performance

is maintained through the high capacity characteristics of optical communication. The reduction in degree is shown to

have signi�cant system complexity implications.

Four star-coupled con�gurations are studied as the basis for the optical multiple access channels, three of which

exhibit the optical self-routing characteristic. A performance analysis shows that through the integration of agile sources

or receivers, and wavelength division multiple access, systems can be developed with signi�cant increases in performance

yet at a reduction in communication subsystem complexity.

Index Terms: parallel computer architecture, computer communication, optical �ber communication, wavelength

division multiplexing.

I. Introduction

This paper examines techniques for interprocessor communication in a distributed memory, MIMD (multiple

instruction, multiple data) environment. The primary emphasis is on the reduction of interconnection network

complexity, while maintaining a speci�ed performance level.

A hypercube based structure is introduced, di�ering from other previously proposed schemes through its form

of physical interconnection: optical multiple access channels. The resulting structures have unity distance

connections between all nodes which share a common dimensional axis. The unity distance connections are

achieved through two mechanisms: the creation of multiple subchannels on a single physical channel, and

the access arbitration of each subchannel. This paper demonstrates that through the incorporation of optical

interconnections, a wide range of system parameters (performance and complexity) can be supported. This

hypercube based structure with wavelength division multiple access channels is denoted as WMCH throughout

this paper.

Six choices of optical multiple access channel implementations are examined: bidirectional bus, bidirectional

bus with control, dual unidirectional bus, folded unidirectional bus, doubly folded unidirectional bus, and star-

coupled. The fanout capability of the con�gurations are examined, limited by two major factors: the saturation

tra�c of the channel and the optical power budget. The saturation tra�c limits the number of interconnected

processors through the tra�c requirements of each node and the capacity of the channel. The second factor is the

optical power budget, which limits the number of interconnected processors through the physical characteristics

1

This work was supported by the National Science Foundation under Grant CCR-9010774.

1

of the optical devices. This paper illustrates that a star-coupled system o�ers far greater fanout than the other

choices, and is used as the basis in the remainder of the paper.

Section 2.2.2 introduces four mechanism of attaining multiple subchannels through wavelength division mul-

tiplexing (WDM) to target each node. With this approach, a processor does not receive and examine all the

data transmitted across a channel, but only data destined to it. This achieves optical self-routing across the

channel. The system is not fully self-routing since a packet must be forwarded from source to destination via

intermediate nodes. This requires the packet to undergo o/e and e/o (optical/electrical and electrical/optical)

conversions at a dimensional boundary for routing purposes. However, due to the topological characteristics of

the structure, the number of intermediate nodes have been severely reduced due to the low average distance of

the WMCH. For example, this paper illustrates that a structure supporting up to 64k nodes with a maximum

of one intermediate hop. This characteristic not only improves the performance, but signi�cantly reduces the

system complexity.

The access arbitration of the subchannels is discussed through two cases: a �xed allocation (TDMA), and

a random access (Slotted ALOHA) scheme. The two cases were chosen to provide a bounds for the resulting

performance. A class of protocols known as Demand Assignment Multiple Access (DAMA) have been introduced

in recent years to harness the high capacity nature of optical �ber [13]. The protocols exploited the unidirectional

nature of optical �ber, but ignored the physical characteristics of the optical devices. This paper shows that the

assumptions made by many such protocols regarding the channel con�guration are unreasonable, and highly

restricts the maximum system size due to power budget considerations. This limited system size has reduced

the attractiveness of such con�gurations.

The WMCH is a topological extension of the Spanning Multiaccess Channel Hypercube (SMCH) [8], the Gen-

eralized Hypercube (GHC) [3], and the Spanning Bus Hypercube [35]. However, a signi�cant improvement in

performance is achieved through the physical interconnection.

Topologically, the WMCH is equivalent to the SMCH and the GHC: a processor has a unity distance connection

to all processors which share a dimensional axis. However, the degree is severely reduced from that of the

GHC due to the multiple access channels, since only one I/O port is required per dimension. Performance is

maintained through the high capacity characteristics of optical communication. The reduction in degree has

signi�cant system complexity implications.

Wittie introduced the idea of using buses to provide interprocessor communication in a hypercube based struc-

ture [35]. However, due to advances in technology, many of the problems addressed in [35] have ceased to be

limiting factors. The goal of this work is to obtain the system complexity bene�ts of [35] and the topological

characteristics of [3,8]. Refer to [26] and [28] for an analysis of the adaptability with changing tra�c conditions,

and the synchronization of multiple slotted channels for the the spanning bus hypercube, respectively.

The format of the paper is as follows. The structure de�nitions of the WMCH are provided in Section 2, with a

discussion of the physical interconnection (possible channel con�gurations). The six previously mentioned optical

channel con�gurations are examined, followed by a more detailed study of four star-coupled con�gurations.

Three of the four star-coupled systems have the optical self-routing characteristic. A structure analysis is

presented in Section 2.3, comparing the resulting performance when TDMA or Slotted Aloha is used as the

media access control protocol.

Section 3 provides a comparison between other proposed interconnection techniques. Topological characteristics

(such as the degree, diameter, average distance, tra�c density and fault tolerance) of the boolean hypercube

(BC), generalized hypercube (GHC), nearest-neighbor mesh hypercube (NNMH) andWMCH are compared. This

is followed by a comparison of the expected delay and complexity.

II. WMCH Structure

The interconnection structure is de�ned in Section 2.1.1. Six possible channel con�gurations are examined in

terms of fanout, and shown to be limited by the optical power budget. Following this section, the paper focuses

2

on star-coupled interconnection due to its fanout superiority. Four star-coupled con�gurations are studied, of

which three exhibit the optical self-routing characteristic. The impact of the subchannel media access control

protocol is investigated in Section 2.3.

A. Topology

The following section de�nes the structure of the interconnection network in graph theoretic terms. This is

followed by an examination of its routing capability and structural characteristics.

x03x13x02x12x01x11x00x10

20x

21x213212211210

203202201200

2x32x22x12x0

1x3

10x

11x

103

113

00x

01x

1x21x11x0

0x30x20x10x0

112111110

102101100

013012011

003002001000

010

[0,X](0,3)(0,2)(0,1)(0,0)

(1,0) (1,1) (1,2) (1,3) [1,X]

[2,X](2,3)(2,2)(2,1)(2,0)

[3,X](3,3)(3,2)(3,1)(3,0)

[X,3][X,2][X,1][X,0]Star Couplers

OpticalPassive

Figure 1: WMCH, a hypercube based structure with optical multiple access channels spanning each dimensional axis. (a) N =

4� 3� 2, (b) N = 4� 4. The multiple access control is achieved through wavelength division multiple access coupled with either

Slotted Aloha or TDMA. The solid lines represent bundles of �bers to support point-to-point connections between the nodes and

the star coupler.

1) Interconnection Structure: A mixed radix system is used to represent the node numbering. Let N be a

decimal integer represented as a product of r factors:

N =

r

Y

i=1

m

i

A number P , 0 � P � (N � 1), can be represented as an r-tuple (p

r

; p

r�1

; : : : ; p

1

) where 0 � p

i

� (m

i

� 1).

There is a weight w

i

associated with each p

i

such that P =

r

X

i=1

p

i

w

i

, where w

i

=

i�1

Y

j=1

m

j

for all i, 1 � i � r. This

approach (and the schemes introduced in Section 3.1) are assumed to have N =

r

Y

i=1

m

i

nodes, and a processor

is denoted by the r-tuple P = (p

r

; : : : ; p

1

).

The WMCH is a hypercube based interconnection scheme in which interprocessor communication is based on

multiaccess channels spanning all dimensional axes (Figure 1). A channel is denoted by [c

r

: : : c

i+1

c

i

c

i�1

: : : c

1

],

an r-tuple in the mixed radix system. A channel spanning the i

th

dimension is indicated by c

i

= X for only

one 1 � i � r, and 0 � c

j

� (m

j

� 1) for all j, j 6= i and 1 � j � r. The X in position i of the r-tuple indicates

that the channel spans dimension i, as shown in Figure 1.

3

A processor (p

r

; : : : ; p

1

) is connected to an i-dimensional channel [c

r

: : : c

i+1

c

i

c

i�1

: : : c

1

] if c

i

= X and p

j

= c

j

for

all 1 � j � r, and j 6= i. Each processor will be connected to r such channels, all spanning di�erent dimensions.

Each i dimensional channel has m

i

processors attached where the processors have identical addresses except in

position i of the r-tuple, allowing simple, totally distributed routing schemes [7,8].

A WMCH with r = 3 and N = 4� 3� 2 is illustrated in Figure 1(a), and a 2-dimensional example with N = 4

2

is shown in Figure 1(b). When m

i

= m for all 1 � i � r, the total number of processors is N = m

r

, with a

total of

Nr

m

physical channels. A channel is tapped by all processors with identical r-tuple addresses, except

in the i

th

digit. For example, processor (213) in Figure 1(a) is attached to [x13], [2x3], and [21x]; and channel

[x12] attaches processors (012), (112) and (212). An i-dimensional channel is tapped by m

i

processors. The

total number of channels spanning dimension i is N=m

i

, giving the total number of channels in the system as:

L = N

r

X

i=1

1

m

i

(1)

A processor P = (p

r

: : : p

1

) is connected to a total of r channels, therefore the degree of this structure is r. The

distance between nodes is the Hamming distance between the source and destination address. The Hamming

distance between two nodes whose addresses di�er by 1 digit is unity, and the total Hamming distance is the

total number of di�ering address digits. This structure has a diameter of r, since a maximum of r digits may

di�er between node r-tuple addresses.

2) Structure Characteristics: This section examines the routing capability and structural characteristics of

the WMCH.

a) Routing: Due to the regularity of the structure, a routing scheme can be implemented without global

information.

A packet is transmitted on an i-dimensional channel if the i

th

digit of the packet destination address needs to

be aligned, since this is the only digit which di�ers among processors attached to that channel. If the i

th

digit

of the packet destination and node address match, the node accepts the packet. A digit by digit comparison is

then performed between the destination and node addresses. A packet reaches its destination when all digits

match. Otherwise, the accepted packet is routed to the queue of port j, where j is a dimension in which the

destination and node addresses di�er.

If the node and destination addresses di�er by d digits, there are d disjoint paths, all distance d. This is

illustrated in Figure 1(a). If node (003) wishes to transmit a packet to (210), there exist 3 disjoint paths of

distance 3. The packet could follow:

(003)! (203)! (213)! (210)

(003)! (013)! (010)! (210)

(003)! (000)! (200)! (210)

This leads to an alternate path routing scheme in which a packet may be queued in any one other d � 1

dimensions if the �rst choice queue length is larger than a certain threshold. Two possible variations place the

packet in the �rst of the d dimensions in which the queue length is below the threshold, or in the smallest of

the d queues. The average packet delay can be reduced since backlogged or faulty nodes are avoided.

b) Topology Characteristics: This section analyzes the structural characteristics of the WMCH. The degree

(k), diameter (d

max

), average distance (d), and tra�c density (�) are studied.

4

Degree and Diameter: A node P = (p

r

: : : p

1

) has one port for each incident multiple access channels. The

channels span di�erent dimensional axes. Since the structure is r-dimensional, the degree is r.

The distance between nodes is the Hamming distance between the source and destination address. A node is

at distance d from its destination when there are d di�ering digits between the r-tuple of the current node and

destination addresses. Since there can be at most r di�ering digits (the length of the r-tuple), the diameter of

the structure is r.

Average Distance: As will be discussed in the Structure Comparison section, the Generalized Hypercube (GHC)

contains direct links to all the nodes sharing a dimensional axis. Because of this connection scheme, the WMCH

has the same average distance as a GHC of similar size and con�guration.

To simplify the calculations when comparing WMCH with other interconnection schemes, it is assumed that

m

i

= m for all i, 1 � i � r. With this assumption, N = m

r

. This is a useful assumption, because it is necessary

for a balanced system. Furthermore, the value of m for all channels is to be maximized, bound by the fanout

limitations to obtain minimum cost. The average distance can now be written, as derived in [3,8]:

d =

�

r(m � 1)

m

� �

m

r

m

r

� 1

�

(2)

which is approximately

r(m� 1)

m

for large networks. The average distance is almost constant with increased

m, and increases as log

m

(N ) with increased r.

Tra�c Density: The next examined characteristic is the average tra�c density of each channel. The tra�c

density is de�ned as the product of the average distance and the total number of nodes contained within the

system divided by the total number of communication channels. With the WMCH, there are are S subchannel

per physical link. Using earlier results, the average tra�c density is

� =

dN

NrS=m

(3)

=

m � 1

S

�

N

N � 1

�

(4)

since there are a total of

Nr

m

channels in the system. When N � 1, � �

m � 1

S

.

B. WMCH Physical Interconnection

The following section investigates the implementation of optical multiple access channels. Section 2.2.1 (Optical

Power Budget) discusses six possible channel con�gurations: the bidirectional bus, bidirectional bus with control,

dual unidirectional bus, folded unidirectional bus, doubly folded unidirectional bus, and star-coupled. Star

coupled systems are examined in greater detail in the next section, due to their superior fanout capability.

Optical interconnects have been studied to achieve the cost and performance objectives discussed in the in-

troduction [9,10]. There are many desirable characteristics of optical interconnects: increased fanout, a large

bandwidth, high reliability, support longer interconnection lengths, exhibit low power requirements, and immu-

nity to EMI with reduced crosstalk.

Optical interconnections can appear at di�erent levels of the system design, providing chip-to-chip, module-to-

module, board-to-board and node-to-node communication [1,2,6,10,15-20,31,32]. A trade-o� between perfor-

mance and cost is possible. A decision must be made whether metal or optical interconnects are to be used in

5

an application. Di�culty arises because many parameters must be considered such as required power, speed,

complexity, length, fanout, reliability and cost.

Architectural constraints have shifted when optical interconnects are incorporated. Suppose the metal intercon-

nections of a parallel system (a hypercube, for example) were replaced with optical interconnects in a one-to-one

fashion. The performance improvement compared to the resulting cost might not seem worth the e�ort (this

would result in many expensive and underutilized channels). This paper introduces a technique that e�ciently

adopts this emerging technology.

Changes in computer interconnection are possible with optical components because of the relaxed fanout and

distance requirements. Processors can be designed at an increased physical distance, yet achieve superior

performance owing to the bandwidth-distance capability of optical communication. The propagation delay of

an optical link does not vary with changes in fanout. The optical fanout is not bound by capacitance but by

the power that must be delivered to each receiver to maintain a speci�ed bit-error-rate, referred as the optical

power budget (OPB).

The number of stations attached to an optical multiple access channel is bound by the saturation tra�c and the

OPB. However, because of the large bandwidth, the principle limitation in fanout is from power considerations.

The following section examines possible implementations of the optical multiple access channel, and in particular,

examines the fanout capability.

Node m-1Node 1

T R RT T R

Node 0

CNTRL

T R RT T RC C C

...Node 0 Node 1 Node m-1

Node m-1

R T

RT

Node 1

R T

RT

Node 0

R T

RT

Node m-1

RT

Node 1

RT

Node 0

RT

Node m-1

RRT

Node 1

RRT

Node 0

RRT

F

R

TNode m-1Node 1

F

R

T

F

R

TNode 0

Figure 2: Optical multiple access channel con�gurations: (a) bidirectional bus, (b) bidirectional bus with control, (c) dual unidi-

rectional bus, (d) folded unidirectional bus, (e) doubly folded unidirectional bus, and (f) star-coupled (SC

0

).

1) Optical Power Budget: Figure 2 illustrates six typical optical bus con�gurations: bidirectional bus (BB),

bidirectional bus with control (BBC), dual unidirectional bus (DUB), folded unidirectional bus (FUB), doubly

6

folded unidirectional bus (DFUB), and star-coupled (SC). In addition to the interconnection cost, the choice of

channel has a large impact on the system con�guration. This is because of the wide variation in optical power

budget.

The optical power budget limits the fanout, or the maximum number of processors that can be attached to

a channel. The OPB is important because of its impact on system cost. This is an extension of the usual

trade-o�s in degree and diameter in static interconnection networks: to reduce the diameter (for performance

reasons), requires an increase in degree. This increase in degree has signi�cant system cost implications: not

only the increase in cost for another I/O port, but modi�cation to the existing network is required. When the

fanout limits have been reached, a method to increase system size is through the increase in I/O ports. The

structure changes from a single bus, to either a hierarchical or regular structure.

The following elements must be considered to obtain the optical power budget:

2 power coupled from source

2 waveguide length and losses

2 insertion and reciprocity losses of the optical couplers

2 active or passive couplers

2 receiver power requirements

The optical power budget of the structures has been considered in detail in [10], and only the results are

summarized here. The expected fanout for two bus topologies are listed in Table 1. This case assumes laser

diode sources and avalanche photodiode detectors with characteristics +7dBm and �57dBm, respectively. The

remainder of the characteristics are as follows. It is assumed that the �ber has a loss of 0.2dB per kilometer

with a maximum length of 100m. The typical insertion loss for a commercially available connector is taken

to be �0:75dB, and a directional coupler insertion loss of �1:00dB is assumed. Table 1 list the resulting

maximum fanout for the characteristics just described. The terms �

o

and �

i

denoted the reciprocity factors of

the outbound and inbound couplers, respectively. The case of �

o

= 0:5 implies a 50% coupling of optical power

between waveguides in a coupler. The "-A" and "-P" terms in Table 1 denote active or passive directional

couplers, respectively.

The DFUB channel con�guration was not included in Table 1 because its optical fanout is almost identical to

the FUB channel con�guration since the coupling losses are more dominant than waveguide losses. The value

of the maximum number of channels attachable to a channel is highly dependent on the characteristics of the

devices. As the device characteristics improve, the number of nodes increases. The optical fanout of the star

coupled system is about 256 [2,15]

There has been much e�ort placed in examining optical bus-based systems, and protocols to provide access

arbitration [13,25,33]. However, the low fanout has reduced their attractiveness with systems requiring inter-

connection of greater than a dozen nodes.

The following section describes possible star-coupled con�gurations. The remainder of the paper focuses on a

star-coupled implementation because of its large fanout capability. Four examples of possible implementations

of star coupled systems are introduced in the next section.

2) Star-coupled Interconnection: Structures that capitalize on the exibility of agile distributed feedback

laser diodes and wavelength tunable �lters are examined. Non-coherent, wavelength tunable �lters can be

constructed in a variety of techniques: wavelength dependence of interferometric phenomena, for example Fabry-

Perot and Mach-Zehnder approaches; wavelength dependence of coupling through acousto-optic or electro-optic

techniques; and resonant ampli�cation which provides gain as well as wavelength selectivity.

The remainder of this section is organized as follows. The functional characteristics of optical devices are dis-

cussed initially. Four possible implementations of star coupled systems are introduced, followed by a discussion

on the communication protocol requirement for media access control.

7

Fabry-Perot wavelength tunable �lters have been constructed where a resonant cavity selectively interferes

with an incoming signal. Tunability is achieved through variations in the cavity length, the mirror re ectivity,

and through piezoelectric or electrostatic controls. Electro-optic tunable �lters have less tuning range (16 nm

with a bandwidth of 1 nm) than acousto-optic �lters. Acousto-optic devices have greater tunability, the entire

1:3 � 1:56�m range, with a �lter bandwidth of 1 nm. The setup delay for an acousto-optic device is on the

order of �S, whereas electro-optic device have on the order of nS setup time [23]. Tuning the acousto-optic

devices occur through changes in the acoustic frequency (10-300 MHz), and precise tuning is possible. The

electro-optic devices needs a drive voltage of about 10 v. Acousto-optic devices have the additional bene�t that

when multiple acoustical waves are superimposed, multiple subchannels can be extracted. For example, suppose

a common channel to be received by all nodes is needed for reservation and other control purposes. One of

the available channels could be deemed the common broadcast control channel, and a �lter of this type could

extract both the control channel and its speci�c data channel. This is achieved with a single �lter through the

superposition of two acoustic waves at the appropriate frequencies.

a) Star-coupled Examples: Four examples of possible implementations of star coupled systems are intro-

duced in the next section.

Single-channel star-coupled system: A star coupler broadcasts a message to all processors. The coupler uniformly

distributes all incoming optical power among the output waveguides. Figure 2(f) illustrates a possible channel

design with m processors attached (SC

0

).

All transmitters and receivers with SC

0

operate on the same wavelength. All nodes receives all tra�c. After a

node receives a packet (o/e and serial to parallel conversion), the header of the packet is decoded to determine

its destination address. The packet is discarded if the destination �eld and the local processor addresses do not

match.

Node m-1

Node m-1

Node 1

Node 1

Node 0

Node 0

Node m-1Node 1Node 0

...

...

...

T

R

F

T

R

F

T

R

F

T

R

F

T

R

F

T

R

F

m x 1

WDMF

R

T

F

R

T

F

R

T

Figure 3: Physical con�gurations of star-coupled optical multiple access channels. (a) Multiple subchannels via agile sources

(SC

1

), (b) Multiple subchannels via agile receivers (SC

2

), (c) Multiple subchannels via agile sources, with an m� 1 coupler and a

Wavelength Division Demultiplexer in place of a star-coupler (SC

3

).

8

Multiple-subchannel star-coupled systems: Two systems considered next are also based on an optical star cou-

pler, but either have agile transmitters or receivers. SC

1

is illustrated in Figure 3(a) and SC

2

by 3(b). The

two schemes assume the available optical bandwidth is partitioned into a comb of narrow subchannels, using

Wavelength Division Multiple Access (WDMA). WDMA di�ers from frequency division multiple access in the

subchannel spacing. The key to the two approaches is in the agility of either the sources or receivers.

SC

1

and SC

2

o�er the advantage of optical self-routing. Only the tra�c destined for a particular node is

delivered to that node. An SC

0

node must examine all transmitted tra�c to identity destined data. The

volume of data each node must process is greatly reduced with SC

1

and SC

2

, reducing the electronic bottleneck

problem. Assuming uniform destination distribution, a particular node in SC

1

and SC

2

processes 1=m

th

the

total tra�c as a node in SC

0

due to the subchannels. This characteristic has both performance and system cost

implications.

The optical self-routing characteristic is achieved through tunable receivers and transmitters. Tunable trans-

mitters and �xed wavelength receivers are used in SC

1

. A node n

i

has an agreed upon subchannel destination

address s

i

. A subchannel s

i

has a distinct wavelength �

i

, where �

i

6= �

j

for all i and j such that i 6= j,

0 � i � (m � 1), and 0 � j � (m � 1).

For node n

j

to communicate with node n

i

, n

j

tunes its transmitter to �

i

, and begins the transmission according

to the access protocol. The �xed wavelength receiver subchannel s

i

of n

i

has an optical �lter that blocks all

wavelengths except �

i

. After �ltering, the signal is then demodulated.

Each node in Figure 3(b) has a �xed (but distinct) wavelength transmitter and a tunable receiver (SC

2

). After

agreeing to accept a packet from station n

j

, node n

i

tunes its receiver �lter to s

j

and extracts �

j

. The media

access control protocol is more complex than that required by SC

1

since a transmitting station must �rst

inform the receiving station to expect the packet. For example, this could be achieved either through a separate

channel, through polling, or through a �xed assignment mechanism. This is a method to circumvent the agility

delay characteristics of laser diodes: if the agile laser diodes cannot switch su�ciently fast (on the order of nS)

and their latency impacts the overall performance, SC

2

is attractive since it allows the sources to remain set to

one particular wavelength.

A problem with all three of the approaches is collision. Stations in SC

1

cannot detect the transmission of an

outgoing packet, so overlapping transmissions occur unless a reservation scheme is used. A di�erence in the

three cases is the throughput. Collisions occur in SC

0

with any overlapped transmissions, whereas collisions

only occur in SC

1

and SC

2

when the same node is targeted. The star coupler in SC

1

and SC

2

can be viewed

as a non-blocking interconnection network, and SC

0

appears as a logical bus.

Multiple-subchannel wavelength division multiplexed system: The function of the star coupler is to distribute the

incoming optical power uniformly among the output ports. The power distribution, together with the limit of

agility of the sources and receivers, restrict the maximum number of processors on the star coupled systems.

The limit of the agility with currently constructed devices (estimated to be about 128 [20,23,24]) and power

budget limitations restrict the number to about 128-256 processor nodes [1,2,10,16,17]. See [34] for a review

of this technology. The system SC

3

, illustrated in Figure 3(c), avoids distributing the optical power among

all output ports, and optically routes the data to the destination node. In SC

1

and SC

2

, although optical

self-routing was also achieved, the optical power was distributed across all nodes limiting the fanout.

As in the three previous cases, there is no intermediate o=e and e=o conversions and destination decode with SC

3

.

The data is routed completely in its optical state. As with SC

1

and SC

2

, self-routing is achieved: the only data

delivered to a node is destined to that node. However, the optical power is not diluted in SC

3

which eases the

power budget limitation, and increases the fanout. All power other than insertion and coupling losses is routed

to the appropriate destination node. This is achieved through wavelength division multiplexing/demultiplexing.

Note that the wavelength-division demultiplexer is a passive component, as is the m� 1 coupler. This has fault

tolerance implications since a faulty transmitter or receiver isolates only the local node.

The operation of the m � 1 coupler is to funnel all incoming tra�c onto the same waveguide, and deliver it to

the WDM demultiplexer. Only transmissions simultaneously targeted to the same node collide, and all other

tra�c continues to pass una�ected by the collision.

9

A feature of this approach is the reduction in importance of topology. Much e�ort during the past decade was

placed in the examination of dynamic and static topologies [3,4,5,8,12,21,26-30,33]. In the four cases under

discussion, the limitation to system size is not bound by the tra�c generated at each node. The principal

limitations are the agility of the optical sources as in SC

1

and SC

3

; the optical detectors in SC

2

; and the

optical power budget in all four cases but to a lesser extent in SC

3

. SC

3

is an approach to increase the upper

limit of attachable processors because of the power budget considerations.

3) Subchannel Access Control Protocols: The structures have two levels of access protocols. The �rst is

wavelength division multiple access (WDMA), which partitions the enormous bandwidth of a single channel

into multiple, usable, subchannels. The object of this is to exploit the self-routing characteristic between source

and destination along a single optical channel. A receiver subsystem processes only tra�c destined to it, or

tra�c that it must forward along another dimension of the hypercube. Since the volume of tra�c processed is

reduced, the design requirements on this subsystem is eased. Essentially, this is an approach to obtain the high

capacity characteristics of optical communication, and circumvent the speed mismatch between electronic and

optical components.

A second advantage of WDMA is that although a system results with a low degree, a large number of nodes

are at a unity distance. For example, with a degree k (a k-dimensional structure) with m processors along each

dimension, a node has k(m � 1) neighbors at a distance of one. As illustrated in later sections, this achieves

a small diameter and average distance (relative to other interconnection topologies), yet achieves a low system

cost and complexity.

A second level media access protocol provides access in the time domain along each subchannel. This section

examines a few of the requirements on such a protocol and examines the behavior of Slotted ALOHA and Time

Division Multiple Access (TDMA).

A system must either allow collisions and rely on a positive acknowledgment scheme, or avoid collisions through

some reservation or �xed allocation scheme. Consider the case where each node has a reserved receiver sub-

channel (SC

1

). It is possible to have a reservation period for each round on each subchannel to request access.

A round could be constructed in two phases: (1) Initial request (reservation) phase, and (2) Data transmission

from each requesting node.

The di�culty with this approach is the synchronization with other subchannels. If each of the intervals are not

of �xed duration (�xed round length), a node would need to be able to detect the initiation of a new reservation

cycle for each subchannel. A node can only receive data along its receiver subchannel.

A possible solution is to have a TDMA subslot for reservation, where the destined node willACK if the request is

successful. If the source node detects no response from the target node, it inhibits according to a predetermined

algorithm. A second possibility is to allow collisions of the data tra�c, and ACK when successful. The second

approach could be slotted so corruption of a packet in midstream will not occur.

The critical resources to be optimized by multiple access protocols have shifted emphasis in this environment. In

the past, bandwidth was the critical resource and protocols were designed to maximize its utilization. That issue

is less of a constraint today due the the large available bandwidth. With the proposed approach, bandwidth is

not the limiting factor: it is the mismatch of speeds with the interface electronics. The objects pursued in this

paper is to obtain good performance (in terms of latency and throughput) while keeping the interface electronics

(communication subsystem) within their bounds of achievable speed. The self-routing characteristic of WDMA

aids this e�ort, since a node does not have to examine every packet transmitted along a channel.

a) Random Media Access Protocol: A possible protocol for SC

1

is described next. A slot is constructed

of two phases: the data transmission and acknowledgment subslots. A source node transmits a packet to

the destination (target) node during the data transmission subslot; and the destination node transmits an

acknowledgment to the source node during the ACK subslot. Since a source node cannot simultaneously

transmit to more than one destination nodes during a slot, the ACK subslot is collisionless. The ACK subslot is

composed of the time the receiver needs to verify the CRC, decode the source address, tune its agile transmitter,

10

and transmit the ACK.

Slotted ALOHA (with an elongated slot for the ACK transmission) is a possible choice for the access control

mechanism. A node assumes a collision has occurred if an ACK is not received. Note that CSMA is not possible

with SC

1

- SC

3

because the media cannot be sensed since each node only receives a particular subchannel.

The expected number of collisions is determined next when Slotted ALOHA is implemented with a total of

m stations. Assume that the source targets are uniformly distributed (for backlogged station n

i

, p(i; j) =

�

1

m�1

i 6= j

0 i = j

, where p(i; j) is the probability that node n

i

will transmit to n

j

).

Let Y denote the number of backlogged nodes. Consider a heavily backlogged system of Y = m�1 and m = S.

The probability that there is no transmission to node n

j

is p

(o)

j

=

�

1�

1

m � 1

�

m�1

, and the probability of

a single transmission is p

(1)

j

=

�

1�

1

m � 1

�

m�2

. In general, the probability that of Y backlogged stations, y

packets are targeted to station n

j

, y � Y , is p

(y)

j

=

�

Y

y

�

p

y

(1 � p)

Y�y

. De�ne the probability of collision as

p

(c)

j

=

m�1

X

y=2

p

(y)

j

, so p

(c)

j

= 1�

�

2m � 3

m� 1

��

1�

1

m � 1

�

m�2

in this example. Table 2(a) lists the probabilities for

m 2 f8; 16; 32; 64g.

Suppose the backlog was Y packets, p

(0)

j

=

�

1�

1

m� 1

�

Y

, p

(1)

j

= Y

�

1

m� 1

��

1�

1

m � 1

�

Y�1

, and p

(c)

j

=

1�

�

1 +

Y � 1

m � 1

��

1�

1

m � 1

�

Y�1

. Table 2(b) lists the probabilities with m = 64 for di�ering values of Y . Table

2(b) shows that 97% transmissions are successful in the case of Y = 4, 91% when Y = 8, 80% when Y = 16,

and 62% when Y = 32.

Denote the propagation delay from station n

i

to the star coupler as �

i

, and � = 2maxf�

i

; 1 � i � mg. B

denotes the data-rate of the subchannel (bits/sec), P the packet size (including preamble and header), so the

transmission time of a packet is T

p

= P=B. The total time of a slot is the transmission time of the packet,

ACK (T

a

), propagation delays (2� ), and the additional overhead of o/e conversion, address decode, checksum

computation and laser diode agility delay (�

p

+ �

a

). The total slot time is T

s

= T

p

+ T

a

+ 2� + �

p

+ �

a

.

The receiving station actually does not have to construct an ACK packet, but must transmit a burst at the

wavelength of the source for detection. Note that the time to recon�gure a laser diode is on the same order as

the propagation delay. In the case of positive acknowledgments, the total throughput would be

mp

(1)

j

T

s

packets

per second, and the subchannel utilization is U

j

=

�

T

p

T

s

�

p

(1)

j

.

b) Fixed Media Access Protocol: A �xed allocation scheme is examined because of its implementational

simplicity. There is no need for an acknowledgment mechanism, since it is collisionless, and has a bounded

service time. Its primary weakness is its behavior with bursty tra�c, and its primary strength is its excellent

performance with regular and heavy tra�c.

In the following discussion, only SC

0

and SC

1

are considered, both systems with m nodes. As before, packets

are a �xed length of P bits, and a subchannel �

i

has a data-rate of B bits/sec. The total data rate of an

individual node in SC

1

is still B bits/sec, since simultaneous subchannel transmission is not possible. The

packet transmission time for both cases is T

p

= P=B, and the arrival process from each node is assumed Poisson

with parameter �

o

. Note that this model does not block the generation of additional tra�c when a packet is

queued. This approach is taken since the model is for distributed memory parallel computer systems, where a

processor does not wait until a response is returned but either continues with the current process or context

switches to another process.

11

The delay is comprised of three components: the queueing delay, the transmission time, and the user slot

synchronization [14,22]. The average delay for the M/D/1 queueing model of TDMA is

t =

�

2

=�

2(1� �)

+ T

p

+

m

2

T

p

(5)

where the three terms represent the three components described above, � = m�T

p

, and � is the channel arrival

rate generated per node.

The average delay of SC

0

is

t

0

= T

p

h

m

2

+ 1

i

+

�

o

m

2

T

2

p

2[1� �

o

mT

p

]

(6)

since � = �

o

where �

o

denotes the total arrival rate generated by each node.

Assume there are S = m subchannels with SC

1

, each representing the path to a particular node. A packet

transmission along wavelength �

i

is destined to node n

i

, 0 � i � m � 1. A uniform tra�c is assumed, so the

rate at which node n

j

generates tra�c for n

i

is � = �

o

=(m�1), where i 6= j, and � is the total tra�c generated

by a node.

Node m

j

must maintainm� 1 separate bu�ers, one for each of the possible destination nodes. This additional

complexity, as compared with SC

0

which requires a single, longer, bu�er, is the price for the additional per-

formance demonstrated. However, the additional complexity in the transmission subsection of a node may be

o�set by the reduction in complexity in the receiver since only data destined to a node is routed to it, reducing

the volume of data a receiver must process.

rT

pT

Node 0

Node 3

Node 2

Node 10

m-1

2

1

Target/Destination SubchannelSource Nodes

Node 1

Node 4

Node 3

Node 2

Node 2

Node 5

Node 4

Node 3

Node 0

Node 1

Node m-1

Node m-2

Figure 4: Time-Space diagram of TDMA with SC

1

and N = S.

TDMA for SC

1

in this example is illustrated in Figure 4, which shows the behavior in each of the subchannels

and subslots, where T

r

is the time of a round, and T

s

is the time of a slot. In this case, T

s

= T

p

. In general,

the S subchannels (this assumes that m = S) are divided into m � 1 slots per round, so the total length of a

round is T

r

= T

p

(m�1). Determining the slot which is assigned to a particular source-destination pair is simple

and decentralized. Consider a node n

j

, 0 � j � m � 1. Figure 4 illustrates an approach where n

j

transmits to

n

j+1

during slot 1, n

j+2

during slot 2. In general, node n

j

transmits to n

(j+i)Mod(m)

during time-slot i for all

0 � i � m� 1.

The delay a packet incurs is similar to Equation (5), but with the modi�ed arrival rate. The average delay with

SC

1

, including the queueing delay for a particular subchannel is

t

1

=

1

2

T

p

[m+ 1] +

�

o

T

2

p

(m � 1)

2[1� �

o

T

p

]

(7)

12

0

20

40

60

80

100

120

140

0 0.2 0.4 0.6 0.8 1Average Arrival Rate

ts01(x)ts02(x)ts11(x)ts12(x)N=16

N=64

N=64

N=16

SC

SC

0

1

Figure 5: Average delay verses arrival rate for SC

0

and SC

1

with TDMA for N 2 f16;64g.

Figure 5 plots the average delay of SC

0

and SC

1

with a normalized service time (T

p

= 1), for m = 16 and

m = 64. The packet arrival rate is the per node probability of generating a new packet during a time slot. This

graph illustrates the performance improvement achievable through agile laser diode sources in SC

1

: a much

improved tra�c saturation point. Though the aggregate data rates of the individual communication links are

identical in both cases, the delay of SC

1

is much less and has a much greater capacity before saturation than

SC

0

because of the availability of multiple subchannels.

0

20

40

60

80

100

120

140

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Average Arrival Rate

N=16

Interleaved InterleavedTDMASlotted

Aloha

N=256

N=64

Figure 6: Simulation of SC

1

with Slotted Aloha and TDMA as the media access control protocol for N 2 f16;64;256g. Average

delay verses new packet generation probability (per node per slot).

Figure 6 graphs the results of a simulation study of SC

1

with Slotted Aloha and TDMA as the media access

control protocol, showing the average delay incurred as a function of the packet generation probability. Three

system sizes were considered: m 2 f16; 32; 64g. The delay is plotted as a function of individual packet generation

rates. Slotted Aloha protocol was implemented as described in the previous section, however a feedback mech-

anism was added to increase channel stability. This feedback mechanism modi�es a transmission probability

(TP). TP denotes the probability that a backlogged station will transmit during a particular slot. The TP was

decreased upon collision detection (absence of an ACK), and increased upon successful transmission. The graph

rea�rms the intuitively expected result: Slotted Aloha performs well with light tra�c, but saturated rapidly.

The TDMA approach can support a higher tra�c level, although with greater average delay with lighter tra�c.

This following section compares the WMCH with other interconnection schemes. Performance and complexity

issues are examined. In particular, the variation of the metrics are studied with increases in system size.

13

III. Behavior Comparison

Two system parameters vary when comparing a performance characteristic between hypercube structures with

increases in system size: the dimension, and the number of nodes along each dimension (width). The BC is a

special case that may only vary in dimension. Figures of merit with variations in both m and r are compared

in the following section. Equal sized networks are assumed in the comparison.

(1111)

(1010)

(0011)

(0000)

(0100)

(0101)

(1101)(0110)

(1000)

(1100)

(0001)

(1110)

(1011)

(0111)

(0010)

(000)

(010)

(100) (101) (102) (103)

(013)(012)(011)

(003)

(210) (211) (212) (213)

(203)(202)(201)(200)

(110 (111) (112) (113)

(210) (211) (212)

(202)(201)(200)

(110) (111) (112) (113)

(000)(003)

(013)(012)(011)(010)

(100) (101) (102) (103)

(203)

(213)

(c)(b)(a)

Figure 7: Topologies for comparison purposes. (a) boolean cube, N = 2

4

; (b) nearest neighbor mesh hypercube,N = 4� 3� 2; (c)

generalized hypercube, N = 4� 3� 2.

A. Structure De�nitions

The systems for comparison are de�ned in this section: the boolean cube (BC), nearest-neighbor mesh hypercube

(NNMH), and the generalized hypercube (GHC). A comparison with the WMCH then follows.

1) Boolean Cube: This system has N = 2

r

processors located on the vertices of an r-dimensional cube, since

m

i

= 2 for all 1 � i � r. Each node is represented as an r-tuple (p

r

: : : p

1

), where p

i

2 f0; 1g for all 1 � i � r.

A node (p

r

: : : p

1

) has a point-to-point communication link to (p

r

: : : p

i+1

[(p

i

+ 1)MOD(2)]p

i�1

: : : p

1

) for all i,

1 � i � r. Figure 7(a) illustrates a BC when r = 4.

2) Nearest Neighbor Mesh Hypercube: Nodes of a Nearest Neighbor Mesh Hypercube (NNMH) are ar-

ranged into an r-dimensional hypercube with m

i

nodes along each of the i dimensions. A node is connected

to its two nearest neighbors in each dimension. Node (p

r

: : : p

1

), where p

i

2 f0; 1; : : :m

i

� 1g, is connected

to (p

r

: : : p

i+1

[(p

i

� 1)MOD(m

i

)]p

i�1

: : : p

1

) and (p

r

: : : p

i+1

[(p

i

+ 1)MOD(m

i

)]p

i�1

: : : p

1

) for all i, 1 � i � r.

Figure 7(b) illustrates an example of this network when r = 3, and N = 4 � 3 � 2. When m

i

= m for all

1 � i � r, the total number of processors is N = m

r

.

3) Generalized Hypercube: A processor (p

r

: : : p

1

) of the Generalized Hypercube (GHC) is connected to all

processors that share its common dimensional axes, (p

r

: : : p

i+1

[(p

i

+ x)MOD(m

i

)]p

i�1

: : : p

1

) for all 1 � i � r,

and 1 � x � (m

i

� 1).

A GHC structure consists of r dimensions with m

i

nodes along the i

th

dimension. A node in a particular axis

has a direct point-to-point connection to all nodes along the same axis [3]. It has been determined in [3] that

an optimum structure occurs when m

i

= m = N

1=r

, for all 1 � i � r. The total number of nodes will then be

N = m

r

. An N = 4� 3� 2 network is illustrated in Figure 7(c).

14

0

10

20

30

40

50

60

10 100 1000 10000 100000N

GHC(1)

GHC(2)

BC

WMCH

NNMH(2)

NNMH(1)

Figure 8: Variation in degree with increased system size. (1) Expansion via r, with constant m = 8, (2) Expansion via m, with

constant r = 3.

B. Structure Comparison

In addition to typical topology characteristics (such as degree, diameter, and average distance), other metrics

of comparison are used in the following section to determine the advantages and disadvantages of the BC, GHC,

NNMH, WMCH. In particular, a model for the system complexity is introduced, followed by a description of

marginal analysis.

1) Degree: There usually is a trade o� between degree and diameter in interconnection networks. An

increase in degree, k, tends to reduce the diameter hence increasing performance at an increased node complexity.

For a network where each node has a large degree, the cost of the I=O ports become the dominating cost of the

network. The following sections shows that using multiple access channels, as in the WMCH, allows for both a

low degree and a small diameter.

The BC requires a port for each of the r dimensions of the cube, hence k = r. The NNMH has a degree of

k = 2r, with two links to each of its nearest neighbors in all the r dimensions. While this may seem greater

than the degree of a BC, the NNMH may have a lower degree for equal sized networks, as illustrated in Figure

8. The GHC requires a degree of m

i

� 1 for each of its r dimensions. When m

i

= m for all 1 � i � r, the

degree is k = r(m � 1). This large degree is due to the topology de�nition of a unity distance connection

between all nodes sharing a common dimensional axis. The WMCH achieves a unity distance connection with

all dimensional axes neighbors with a degree of k = r.

The degree is not constant for the BC and GHC, implying that network expansion requires modi�cation to

existing nodes. Expansion without modi�cation is possible with the NNMH and WMCH through an expansion

of m.

Figure 8 (and the graphs to follow) plots the increase in degree for increases in N under two cases: (1) expansion

via r, with constant m = 8; and (2) expansion via m, with constant r = 3. Due to the large capacity and high

fanout, the WMCH degree does not exceed 2 in this example. In the range under consideration (8 � N � 65536),

the WMCH has the form:

N =

�

m if N � 256

m

2

if N > 256

(8)

assuming a maximumfanout of 256. It is possible to force the NNMH to remain a 2-dimensional structure thereby

keeping the degree low. However, the resulting performance (diameter and average distance) is degraded with

increases in system size. The format of the WMCH de�ned in Equation (8) is used throughout the remainder

of the paper.

Through case (1), all schemes vary as log(N ), but with scaling factors of

1

log(2)

,

2

log(m)

and

(m � 1)

log(m)

for the

15

BC, NNMH and GHC, respectively. Through case (2), the degree of the NNMH remains constant while the

degree of the GHC varies as (N

1=r

� 1)r. The WMCH degree shows the least sensitivity to increases in N .

0

5

10

15

20

25

30

35

10 100 1000 10000 100000N

NNMH(2)

NNMH(1)

BC

WMCH

GHC(1)

GHC(2)

Figure 9: Variation in diameter with increased system size. (1) Expansion via r, with constant m = 8, (2) Expansion via m, with

constant r = 3.

2) Diameter: The distance between nodes in the BC, the GHC, and the WMCH is the hamming distance

between the source and destination addresses, since the distance between all nodes sharing a common dimen-

sional axis is unity. Since there can be at most r di�ering digits, r is the diameter of the networks. For the

NNMH, a packet traverses a maximum of

l

m

2

m

links per dimension, so the maximum distance in this network

is

l

m

2

m

r.

Through case (1), the diameter of the BC and WMCH is equals to the degree of the structure and vary as

described in the previous section. This illustrates the sensitivity of the BC in both degree and diameter with

increased N . The diameter of the NNMH varies as log(N ), with a scaling factor of

m

2log(m)

(Figure 9).

The diameter of the WMCH and GHC remain constant with increases in m. The NNMH diameter is sensitive to

increases in m. Figure 9 shows the diameter of the NNMH to vary linearly with m, or as

rN

1=r

2

. The diameter

of the GHC and WMCH remain constant at r with increases in m.

3) Average Distance: The average distance is useful since it provides a measure of the expected packet

delay. Let K

i

represent the number of nodes at a distance i. The average distance d is de�ned as:

d =

1

N � 1

r

X

i=1

iK

i

(9)

As derived in [3,8], the number of nodes at a given distance for the BC, WMCH, and the GHC is:

K

i

=

�

r

i

�

(m � 1)

i

(10)

since for i di�ering digits, each able to vary in (m � 1) ways, the enumeration is (m � 1)

i

. The combination of

i di�ering digits of r leads to Equation (10).

The GHC and the WMCH have an average distance of

(m � 1)rm

r�1

m

r

� 1

which approaches

r(m � 1)

m

for large

16

networks [3,8]. For the BC, m = 2 and the average distance is

rN

2(N � 1)

, or approximately

r

2

when N is large.

The NNMH has an average distance of

d =

r

m

m�1

X

i=0

min(i;m� 1) (11)

which is mr=4 when m is even, or

�

m

4

�

1

4m

�

r when m is odd, giving a total average distance of approximately

mr

4

.

0

2

4

6

8

10

12

14

16

10 100 1000 10000 100000N

NNMH(2)

GHC(1)

GHC(2)

WMCH

BC

NNMH(1)

Figure 10: Variation in average distance with increased system size. (1) Expansion via r, with constant m = 8, (2) Expansion via

m, with constant r = 3.

Figure 10 shows the variation in average distance with increased N . The BC, NNMH, WMCH, and the GHC

vary linearly with r. With variations in m, the NNMH shows the greatest sensitivity, varying linearly with m,

while the GHC and WMCH are approximately constant.

Equation (9) assumes that the packets are evenly distributed throughout the network. An operating system

may try to place computationally interrelated activity in nodes close in distance to minimize the packet delay.

Suppose a task is optimally mapped into a locality of @ nodes. Denote the maximum distance of this set of

nodes as d

@

, such that @ �

d

@

X

i=1

K

i

and @ >

d

@

�1

X

i=1

K

i

.

Denote the probability � of a node communicating with nodes at a distance less than or equal to d

@

, the average

distance is given by:

d =

"

�

P

d

@

j=1

K

j

#

d

@

X

i=1

iK

i

+

"

1� �

N �

P

d

@

j=1

K

j

#

r

X

i=d

@

+1

iK

i

(12)

With a small locality, the graph of Figure 10 may provide a pessimistic assessment of average distance for the

BC, NNMH and GHC. For example, many algorithms have been developed for the NNMH which only require

communication between neighbors. A very low average distance results, independent of system size.

4) Tra�c Density: The average tra�c density provides a measure of the potential packet delay by illus-

trating how much of the total packet tra�c each link must support. The average tra�c density is de�ned as the

product of the average distance and the total number of nodes, divided by L, the total number of communication

links.

17

The tra�c density of the topologies under consideration:

�

BC

=

�

N

N � 1

�

(13)

�

NNMH

=

m

4

(14)

�

GHC

=

2

m

�

N

N � 1

�

(15)

�

WMCH

=

(m � 1)

S

�

N

N � 1

�

(16)

since the BC has L = r2

r�1

communication channels, the NNMH has L = rN links, and the GHC has L =

r(m � 1)N

2

links.

0

0.5

1

1.5

2

2.5

3

3.5

4

10 100 1000 10000 100000N

NNMH(2)

NNMH(1)

BC

WMCH

GHC(1)GHC(2)

Figure 11: Variation in tra�c density with increased system size. (1) Expansion via r, with constantm = 8, (2) Expansion via m,

with constant r = 3.

Figure 11 plots the tra�c density with increases to system size. The tra�c density of the BC is insensitive to

variations in network size, and approaches 1 for large networks. The GHC has an extremely low tra�c density,

implying the large degree required for a low diameter has created a large number of channels with low channel

utilization. The WMCH has a tra�c density that approaches that of the BC when S = m.

5) Fault Tolerance: Two forms of connectivity are considered: node and link connectivity. Node (link)

connectivity is de�ned as the minimum number of faulty nodes (links) to disconnect the network.

Due to the regularity of the hypercube structures, node connectivity and link connectivity are both equal to

the degree of the structure. This is not true for the WMCH: a structure of degree r has a node connectivity of

r(m � 1).

There are d disjoint paths of equal length between two nodes at a distance d, d � r, for the BC,WMCH, NNMH,

and GHC. Although the degree of the node bounds the number of minimum length disjoint paths between source

and destination nodes, node connectivity limits the total number of paths available for packet routing where

the path may not be minimum length.

The link connectivity for the hypercube structures is identical to the node connectivity, except for the WMCH.

Assuming the worst case situation for the WMCH where a disruption in the channel causes a total loss of use,

the link connectivity is r. In general, this is unlikely since the channel is a passive component.

18

(a)

Node Processor Node Processor Node Processor

(b) (c)

Figure 12: Node model with three forms of internal node interconnection: (a) I(k) = k + 1; (b) I(k) = (k + 1)log

2

(k + 1); (c)

I(k) = (k+ 1)

2

6) Complexity: A function that provides a measure of network complexity is proposed in order to study

its behavior with increased network size. Figure 12 illustrates a model of a node, with three forms of internal

interconnection. Each node consists of three major components: the node processor (np), communication

processors interface (cp), and the local node interconnection (ni). The system cost is obtained by including the

cost of the communication channels (cc). The following assumes a system size of N nodes at a degree of k.

The cost of the node processor is neutral since equal sized networks are used in the comparison. However, the

term c

np

N is included in Equation (18) to represent the cost of N node processors.

A communication processor is responsible for capturing and bu�ering packets during transmission and reception,

providing media access control and logical link control (OSI levels 1-2). The complexity re ects the hardware

required to interface the communication channels with the local interconnection. Each node has k ports to

interface, so the complexity is O(k), and the complexity is modeled as kc

cp

N .

A performance/complexity trade-o� is present within the local interconnection. As shown in [7], if Q represents

an average latency delay per hop in routing packets in a network, the average packet delay increases by (d+1)Q,

where d is the average distance. This may become a signi�cant factor in packet delay, as demonstrated by the

initial models of the Intel iPSC. The structure of the local interconnection at each node can be enhanced to

reduce this delay. Figure 12 illustrates possible implementations of the internal interconnection. For example,

the local interconnections could be constructed as a dynamic multistage interconnection network, as illustrated

in Figure 12(b), and the complexity could be taken to be proportional to the number of 2�2 switching elements.

The local interconnection cost is modeled as c

ni

I(k)N , where I(k) represents the complexity of the local inter-

connection depending on its implementation. Three forms of I(k) are considered:

I(k) =

8

<

:

k + 1 (17:1)

(k + 1) log

2

(k + 1) (17:2)

(k + 1)

2

(17:3)

(17)

Equation (17.1) is the complexity of polling through a central controller as shown in Figure 12(a). Equation

(17.2) could be achieved by multistage switching elements as illustrated by Figure 12(b), and Equation (17.3)

could be achieved through a crossbar switch as shown in Figure 12(c).

Figure 12(a), modeled by Equation (17.1), would be similar to the initial implementation of the Intel iPSC.

Equations (17.2) and (17.3) attempt to improve the latency delay per hop through a more complex routing

structure. The remaining component of the system complexity function is the communication channels. This

term is a function of the (sub)channel bandwidth.

The complexity terms of the local interconnection, the channels and the channel interface of Equation (18) are

scaled by a term modeling the economies of scale of bandwidth within the communication links. The object

is study the relationship between the relative costs of metal and optical interconnects. A main advantage of

19

optical links is that the bandwidth can increase far beyond the bandwidth of a metal interconnect and still

retain the economies of scale properties. Since the crossover point is constantly moving, due to improvements

in technology, the increases in complexity of systems with optical interconnects are studied, using the metal

complexity as a reference. The B term in Equation (18) represents the relative increase bandwidth between

optical and metal interconnects, and � models the economies of scale.

Putting the complexity components of the node together, the system complexity model for a given topology is:

C = c

np

N + B

�

[c

cp

Nk + c

ni

NI(k) + c

cc

L] (18)

where L denotes the number of communication channels. Equation (18) is used to model the relative complexity

of systems with optical and metal interconnects.

This function re ects the number of nodes, degree per node, internal structure and channel implementation.

This equation also models the returns to scale of bandwidth: up to a point, doubling the channel bandwidth

does not double the cost. The total number of I=O ports are included in the parenthesis since a port may

increase in cost in order to handle the increased tra�c through the higher bandwidth channels.

N

30

25

20

15

10

5

00 10000 20000 30000 40000 50000 60000 70000

GHC(2) GHC(1)

BCNNMH(1)

NNMH(2)WMCH

Figure 13: Variation in system complexity with increases in system size. Relative bandwidth of optical links taken to be 10.

Figure 13 provides a comparison of the complexity with increases in system size. The relative bandwidth between

the metal and optical interconnects is taken to be 10 and � = 0:3. Section 3.2.8 examines the relationship of

complexity with relative bandwidths of 1, 10, 100 and 1000.

7) Delay Analysis: The average time a packet takes to travel from the source node to the destination node

is determined next. The development is continued with m

i

= m for all 1 � i � r. The model is based on an

M=M=1 system in which the output queue of each of the k transmit ports of a node receive packets from the

local node processor, and also packets being routed from the other k � 1 ports. The following assumptions are

made:

2 Each node is equally likely to be targeted as the destination of a packet.

2 Poisson arrivals of �

h

packets per second from all local node processors.

2 Uniform routing is performed without regard to queue lengths.

2 An accepted packet, that has not reached its destination, is routed on one other k � 1 ports with equal

probability.

Let 1=�B denote the mean service time, where B is the channel bandwidth, and 1=� is the average packet size.

The utilization factor is de�ned as � =

�

�

with � arrivals per second.

20

The total delay if a latency delay per hop Q, incurred by the communication processor for internal routing, is

taken into consideration is t = Q+

L

X

i=1

�

i

�

�

1

�B � �

i

+Q

�

, where �

i

is the arrival rate along channel i, L is the

total number of channels, � is the network load factor, which is equal to � = �=d, and � is the total arrival rate

along all the channels. With a regular network, all channels have equal capacity. There is equal tra�c on all

channels, because of the uniform routing, �

i

= �

j

for all i and j. The total arrival rate is � = L�

i

, reducing the

above equation to:

t =

d

�B � �

i

+Q(d+ 1) (19)

The average channel arrival rate for each scheme is determined next. New arrivals, �

h

, generated by each local

node processor are evenly distributed over its k I/O ports. Furthermore, assume that a packet is not routed

out on the same port from which it arrived. Relating the arrival rate of port i with the local node processor

arrivals, and the packets arriving along the other k � 1 ports:

�

i

=

�

h

k

+

1� p

k � 1

2

4

k

X

j=1

�

j

� �

i

3

5

(20)

The term 1� p denotes the probability that a packet has not reached its destination, where p = 1=d. The total

arrival rates of all ports except i, which have not reached their destination, are divided among k � 1 ports.

Equation (20) reduces to:

�

i

=

d

k

�

h

(21)

This result is expected, since the channel arrival rate is greater with a larger average distance, and smaller for a

larger degree. Equation (19) and (21) can be used to provide the saturation tra�c at the queueing leg for each

scheme. The saturation tra�c for each scheme is:

�

sat

=

k

d

h

1

�B

i

(22)

Equation (22) shows that the saturation tra�c can be increased with high capacity channels, higher degree,

and low average distance.

As seen in Equation (19), the delay includes a constant term (d+ 1)Q. In this analysis, the Q term is assumed

negligible when compared to the queueing delay. Since the constant delay varies linearly with d, interconnection

networks with a large average distance are most sensitive to this term which can be signi�cant.

With the characteristics of each scheme derived earlier, the packet delay can now be listed, utilizing Equations

(20) and (21):

t

BC

=

r=�B

2� �=�B

(23)

t

NNMH

=

2rm=�B

8� �m=�B

(24)

21

t

GHC

=

r(m � 1)=�B

m � �=�B

(25)

From Section 2.2.3,

t

WMCH

=

d

2

�

1

�B

o

�

[m+ 1] +

�d(m � 1)=(1=�B

o

)

2

2[1� �=�B

o

]

(26)

where B

o

denotes the data-rate along an optical subchannel. The subscript h has been eliminated from �

h

in

Equations (23)-(26), since there should be no confusion between arrivals.

0

20

40

60

80

100

120

140

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Average Arrival Rate

Analytic Model

Simulation

N=256

N=16

N=64

Figure 14: Comparison of WMCH analytic model with simulation results. Two dimensional structures with N 2 f16;32;64g.

Figure 14 graphs a comparison of the analytic model of the WMCH to simulation results. System sizes of

N = f16; 64; 256g were considered with 2-dimensional structures and S = m. The graph restricted itself to two

dimensional structures because there is little advantage in increasing the dimension beyond two. If the optical

power budget restricts the fanout to 256 nodes per star coupled system, a total system size of 65536 nodes can

be supported. This approach provides an approach that is scalable from small systems to massively parallel.

0

20

40

60

80

100

0 0.5 1 1.5 2 2.5 3 3.5 4Average Arrival Rate

BC NNMH GHCWMCH

(B=1)

WMCH (B=10)

Figure 15: Variation in packet delay with increasing arrival rate for the BC, GHC, NNMH and WMCH with N = 512. The optical

subchannels are taken to have a relative bandwidth with metal interconnects of 1 and 10.

Figure 15 graphs the average packet delay of the BC, GHC, NNMH and WMCH with N = 256. The slot

length and average packet size has been normalized to 1. The optical subchannels are taken to have a relative

bandwidth with metal interconnects of 1 and 10.

8) Marginal Complexity: The impact of network expansion should be taken into consideration when exam-

ining performance and complexity measures. The marginal expansion of a network must be carried out while

22

preserving the network structure [11].

Typically, topologies cannot be incrementally increased in size. Marginal Analysis denotes the incremental

increase to performance or cost as the system is expanded from size N to N + �N . For example, the BC

can only be expanded by doubling the system size. Through marginal analysis, the question of whether the

performance also doubles is addressed. Furthermore, marginal analysis identi�es cases when the node complexity

becomes increasingly more costly with increased system size.

Suppose the total number of processors are denoted as N =

r

Y

i=1

m

i

, which is often the case in regular hypercube

based static interconnection networks [3,8]. Expansion can be achieved through increases in r or m

i

for any

1 � i � r. Some topologies, such as the BC, can only be expanded through increases in r.

With expansion through r,

N +�N =

r+1

Y

i=1

m

i

(27)

�N = N (m

r+1

� 1) (28)

With expansion through an increase in m

i

to m

0

i

,

N +�N = m

1

: : :m

i�1

m

0

i

m

i+1

: : :m

r

(29)

�N = N

�

m

0

i

m

i

� 1

�

(30)

If the performance of a structure of size N is denoted as P(N), the Marginal Performance (MP) of the structure

is de�ned as the incremental increase in performance when the network is expanded to size N +�N :

MP =

P (N +�N )� P (N )

�N

(31)

For example, a measure of performance could be the product of the saturation tra�c and the total number of

processors. The Marginal Cost (MC) is similarly de�ned:

MC =

C(N +�N )�C(N )

�N

(32)

The MC is obtained for the BC, NNMH, GHC andWMCH using the complexity function de�ned in Section 3.2.6.

Figure 16 illustrates the di�culty of obtaining the excellent topological characteristics of the GHC. With both

approaches of expansion, the incremental complexity of the GHC increases rapidly.

Figure 17 graphs the marginal complexity of the WMCH when B 2 f1; 10; 100; 1000g. The surge in each trace is

due to the increase in degree as N exceeds 256 nodes. This graph illustrates the economies of scale in bandwidth

available with optical �ber links.

IV. Conclusion

A multiprocessor system with a large number of nodes can be built at low cost by combining the recent advances

in high capacity channels available through �ber optics, and wavelength division multiple access protocols.

23

0

100

200

300

400

500

600

100 1000 10000 100000N

GHC(2)

GHC(1)

WMCH

NNMH(1)

NNMH(2)

Figure 16: Variations in marginal complexitywith increases in system size. (1) Expansion via r, with constantm = 8, (2) Expansion

via m, with constant r = 3. The optical subchannels are taken to have a relative bandwidth with metal interconnects of 1.

0

20

40

60

80

100

100 1000 10000 100000N

B=1000

B=100

B=10

B=1

Figure 17: Variations in WMCH marginal complexityB 2 f1;10;100;1000g.

A highly fault tolerant system is created with good performance characteristics at a signi�cantly reduced cost.

The system capitalizes of the self-routing characteristic of wavelength division multiple access to improve per-

formance and reduce complexity.

Although the cost of the WMCH is signi�cantly lower than comparable architectures, the performance charac-

teristics of average distance, average packet delay, diameter, and delay are shown to be equal or better than the

studied networks.

This multiple access approach provides a more e�cient resource allocation, where the network performs well

under dynamically changing loads. The law of large numbers allows the balancing of temporary peaks in station

tra�c. This is a more e�cient approach than providing a large number of point-to-point channels with large

capacities to support peak tra�c.

In addition to the performance bene�ts optical interconnects provide, there is also the additional bene�t of relax-

ing the packaging requirements of the system design. Relaxing the interconnect physical distance requirements

at a particular data-rate has signi�cant system cost implications.

Four star-coupled systems were studied, of which three exhibited the optical self-routing characteristic. A

performance analysis showed that through the integration of agile sources or receivers, and wavelength divi-

sion multiple access, systems can be developed with signi�cant increases in performance yet at a reduction in

communication subsystem complexity.

24

V. References

1. E. Arthurs et al., "Multiwavelength Optical Crossconnect for Parallel Processing Computers," Electron.

Lett., vol. 24, pp. 119-120, 1988.

2. E. Arthurs, M.S. Goodman, H. Kobrinski, and M.P. Veechi, "HYPASS: An Optoelectronic Hybrid Package

Switching System," IEEE Jour. Sel. Areas Commun., vol. 6, pp. 1500-1510, Dec. 1988.

3. L.N. Bhuyan and D. P. Agrawal, "Generalized Hypercube and Hyperbus Structures for a Computer

Network," IEEE Trans. Comput., vol. c-33, pp. 323-333, Apr. 1984.

4. G.E. Carlson, J.E. Cruthirds, H. B. Sexton, and C. G. Wright, "Interconnection Networks Based on a

Generalization of Cube Connected Cycles," IEEE Trans. Comput., vol. c-34, pp. 769-773, Aug. 1985.

5. G. Chiola, M.A. Marsan and G. Balbo, "Product-Form Solution Techniques for the Performance Analysis

of Multiple-Bus Multiprocessor Systems with Non-uniform Memory Reference," IEEE Trans. Comput.,

vol. 37, pp. 532-540, May 1988.

6. B.D. Clymer and J.W. Goodman, "Optical clock distribution to silicon chips," Optical Engineering, vol.

25, pp. 1103 - 1108, Oct. 1986.

7. P.W. Dowd and K. Jabbour, "Performance Evaluation of the Spanning Multiaccess Channel Hypercube

Interconnection Network," IEE Proceedings, vol. 134, Pt. E, pp. 295-302, Nov. 1987.

8. P.W. Dowd and K. Jabbour, "Spanning Multiaccess Channel Hypercube Computer Interconnection,"

IEEE Trans. Comput., vol. 37, pp. 1137-1142, Sept. 1988.

9. P.W. Dowd, "Optical Bus and Star Coupled Parallel Interconnection", Proc. Fourth Ann. Symposium on

Parallel Processing, Fullerton, CA., pp. 824-838, April 1990.

10. P.W. Dowd, "Optical Interconnections for Computer Communication," IBM Technical Report TR01.A961,

Endicott NY, Apr. 1989.

11. P.W. Dowd, M. Dowd and K. Jabbour, "Static Interconnection Network Extensibility based on a Marginal

Performance/Cost Analysis," IEE Proceedings, vol. 136, Part E, pp. 9-15, Jan. 1989.

12. T. Feng, "A Survey of Interconnection Networks," Computer, pp. 12-27, Dec. 1981.

13. M. Fine and F.A. Tobagi, "Demand Assignment Multiple Access Schemes in Broadcast Bus Local Area

Networks," IEEE Trans. Comput., vol c-33, pp. 1130-1159, Dec. 1984.

14. M.J. Ferguson, "An Approximate Analysis of Delay for Fixed and Variable Length Packets in an Unslotted

ALOHA Channel," IEEE Trans. Commun., vol. COM-25, pp. 644-654, July 1977.

15. M.S. Goodman, "Multiwavelength Networks and New Approaches to Packet Switching," IEEE Commun.

Mag., pp. 27 - 35, Oct. 1989.

16. D.H. Hartman, "Digital high speed interconnects: a study of the optical alternative," Optical Engineering,

vol. 25, pp. 1086 - 1102, Oct. 1986.

17. P.R. Haugen, S. Rychnovsky, A. Husain and L.D. Hutcheson, "Optical interconnects for high speed com-

puting," Optical Engineering, vol. 25, pp. 1076 - 1085, Oct. 1986.

18. A.M. Hill, "One-Sided Rearrangeable Optical Switching Network," IEEE Jour. Lightwave. Tech., vol.

LT-4, pp. 785-789, July 1986.

19. H.S. Hinton, "A Non-Blocking Optical Interconnection Network using Directional Couplers," Proc. IEEE

Global Telecommunications Conference, vol. 2, pp. 885 - 889, Nov. 1984.

20. R.G. Hunsperger, Integrated Optics: Theory and Technology, Springer - Verlag, 1984.

21. K.B. Irani and I.H. Onyuksel, "A Closed-Form Solution for the Performance Analysis of Multiple-Bus

Multiprocessor Systems," IEEE Trans. Comput., vol. c-33, pp. 1004-1012, Nov. 1984.

25

22. L. Kleinrock, S.S. Lam, "Packet Switching in a Multiaccess Broadcast Channel: Performance Evaluation,"

IEEE Trans. Commun., vol. com-23, pp. 410-423, Apr. 1975.

23. H. Kobrinski and M. Yin, "Wavelength-Tunable Optical Filters: Applications and Technology," IEEE

Commun. Mag., pp. 53 - 35, Oct. 1989.

24. T.P Lee and C.E. Zah, "Wavelength-Tunable and Single Frequency Semiconductor Lasers for Photonic

Communications Networks," IEEE Commun. Mag., pp. 42 - 52, Oct. 1989.

25. N.F. Maxemchuk, "Twelve RandomAccess Strategies for Fiber-Optic Networks," IEEE Trans. Commun.,

vol. 36, pp. 942-950, Aug. 1988.

26. P.K. McKinley, Group Communication in Bus-Based Computer Networks, Ph.D. thesis, University of

Illinois at Urbana-Champaign, Urbana, Illinois, 1989.

27. T.N. Mudge and H.B. Al-Sadoun, "A Semi-Markov Model for the Performance of Multiple-Bus Systems,"

IEEE Trans. Comput., vol. c-34, pp. 934-942, Oct. 1985.

28. Y. Ofek, The Topology, Algorithms, and Analysis of a Synchronous Optical Hypergraph Architecture, Ph.D.

thesis, University of Illinois at Urbana-Champaign, Urbana, Illinois, 1989.

29. F.P. Preparata and J. Viullemin, "The Cube-Connected Cycles: A versatile network for parallel compu-

tation," Commum. Ass. Comput. Mach., vol. 24, pp. 300-309, May 1981.

30. D.A. Reed and H.D. Schwetman, "Cost-Performance Bounds for Multimicrocomputer Networks," IEEE

Trans. Comput., vol. c-32, pp.83-95, Jan. 1983.

31. A.A. Sawchuk and T. C. Strand, "Digital Optical Computing," Proc. IEEE, vol. 72, pp. 758 - 779, July

1984.

32. R.A. Spanke and V. E. Benes, "An N-stage Planar Optical Permutation Network," Applied Optics, vol.

26, Apr. 1987.

33. F.A. Tobagi, "Multiaccess Protocols in Packet Communication Systems," IEEE Trans. Commun., vol.

com-28, pp. 468-488, Apr. 1980.

34. S.S. Wagner and H.L Lemberg, "Technology and System Issues for a WDM-Based Fiber Loop Architec-

ture," IEEE Jour. Light. Tech., vol. 7, pp. 1759-1768, Nov. 1989.

35. L.D. Wittie, "Communication Structures for Large Networks of Microcomputers," IEEE Trans. Comput.,

vol c-30, pp. 264-272, Apr. 1981.

26

Documents

T o App ear in IEEE T r ansactions on Computers , 1992T o App ear in IEEE T r ansactions on Computers, 1992 W a velength Division Mul tiple A ccess Channel Hyper cube Pr ocessor Inter