Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
To Appear in IEEE Transactions on Computers, 1992
Wavelength Division Multiple Access Channel Hypercube
Processor Interconnection
Patrick W. Dowd
1
Department of Electrical and Computer Engineering
State University of New York at Bu�alo
Bu�alo, NY 14260
[email protected]�alo.edu
Abstract
A multiprocessor system with a large number of nodes can be built at low cost by combining the recent advances in high
capacity channels available through optical �ber communication. A highly fault tolerant system is created with good
performance characteristics at a reduction in system complexity. The system capitalizes of the self-routing characteristic
of wavelength division multiple access to improve performance and reduce complexity.
A hypercube based structure is introduced, where optical multiple access channels span the dimensional axes. This
severely reduces the required degree, since only one I/O port is required per dimension. However, good performance
is maintained through the high capacity characteristics of optical communication. The reduction in degree is shown to
have signi�cant system complexity implications.
Four star-coupled con�gurations are studied as the basis for the optical multiple access channels, three of which
exhibit the optical self-routing characteristic. A performance analysis shows that through the integration of agile sources
or receivers, and wavelength division multiple access, systems can be developed with signi�cant increases in performance
yet at a reduction in communication subsystem complexity.
Index Terms: parallel computer architecture, computer communication, optical �ber communication, wavelength
division multiplexing.
I. Introduction
This paper examines techniques for interprocessor communication in a distributed memory, MIMD (multiple
instruction, multiple data) environment. The primary emphasis is on the reduction of interconnection network
complexity, while maintaining a speci�ed performance level.
A hypercube based structure is introduced, di�ering from other previously proposed schemes through its form
of physical interconnection: optical multiple access channels. The resulting structures have unity distance
connections between all nodes which share a common dimensional axis. The unity distance connections are
achieved through two mechanisms: the creation of multiple subchannels on a single physical channel, and
the access arbitration of each subchannel. This paper demonstrates that through the incorporation of optical
interconnections, a wide range of system parameters (performance and complexity) can be supported. This
hypercube based structure with wavelength division multiple access channels is denoted as WMCH throughout
this paper.
Six choices of optical multiple access channel implementations are examined: bidirectional bus, bidirectional
bus with control, dual unidirectional bus, folded unidirectional bus, doubly folded unidirectional bus, and star-
coupled. The fanout capability of the con�gurations are examined, limited by two major factors: the saturation
tra�c of the channel and the optical power budget. The saturation tra�c limits the number of interconnected
processors through the tra�c requirements of each node and the capacity of the channel. The second factor is the
optical power budget, which limits the number of interconnected processors through the physical characteristics
1
This work was supported by the National Science Foundation under Grant CCR-9010774.
1
of the optical devices. This paper illustrates that a star-coupled system o�ers far greater fanout than the other
choices, and is used as the basis in the remainder of the paper.
Section 2.2.2 introduces four mechanism of attaining multiple subchannels through wavelength division mul-
tiplexing (WDM) to target each node. With this approach, a processor does not receive and examine all the
data transmitted across a channel, but only data destined to it. This achieves optical self-routing across the
channel. The system is not fully self-routing since a packet must be forwarded from source to destination via
intermediate nodes. This requires the packet to undergo o/e and e/o (optical/electrical and electrical/optical)
conversions at a dimensional boundary for routing purposes. However, due to the topological characteristics of
the structure, the number of intermediate nodes have been severely reduced due to the low average distance of
the WMCH. For example, this paper illustrates that a structure supporting up to 64k nodes with a maximum
of one intermediate hop. This characteristic not only improves the performance, but signi�cantly reduces the
system complexity.
The access arbitration of the subchannels is discussed through two cases: a �xed allocation (TDMA), and
a random access (Slotted ALOHA) scheme. The two cases were chosen to provide a bounds for the resulting
performance. A class of protocols known as Demand Assignment Multiple Access (DAMA) have been introduced
in recent years to harness the high capacity nature of optical �ber [13]. The protocols exploited the unidirectional
nature of optical �ber, but ignored the physical characteristics of the optical devices. This paper shows that the
assumptions made by many such protocols regarding the channel con�guration are unreasonable, and highly
restricts the maximum system size due to power budget considerations. This limited system size has reduced
the attractiveness of such con�gurations.
The WMCH is a topological extension of the Spanning Multiaccess Channel Hypercube (SMCH) [8], the Gen-
eralized Hypercube (GHC) [3], and the Spanning Bus Hypercube [35]. However, a signi�cant improvement in
performance is achieved through the physical interconnection.
Topologically, the WMCH is equivalent to the SMCH and the GHC: a processor has a unity distance connection
to all processors which share a dimensional axis. However, the degree is severely reduced from that of the
GHC due to the multiple access channels, since only one I/O port is required per dimension. Performance is
maintained through the high capacity characteristics of optical communication. The reduction in degree has
signi�cant system complexity implications.
Wittie introduced the idea of using buses to provide interprocessor communication in a hypercube based struc-
ture [35]. However, due to advances in technology, many of the problems addressed in [35] have ceased to be
limiting factors. The goal of this work is to obtain the system complexity bene�ts of [35] and the topological
characteristics of [3,8]. Refer to [26] and [28] for an analysis of the adaptability with changing tra�c conditions,
and the synchronization of multiple slotted channels for the the spanning bus hypercube, respectively.
The format of the paper is as follows. The structure de�nitions of the WMCH are provided in Section 2, with a
discussion of the physical interconnection (possible channel con�gurations). The six previously mentioned optical
channel con�gurations are examined, followed by a more detailed study of four star-coupled con�gurations.
Three of the four star-coupled systems have the optical self-routing characteristic. A structure analysis is
presented in Section 2.3, comparing the resulting performance when TDMA or Slotted Aloha is used as the
media access control protocol.
Section 3 provides a comparison between other proposed interconnection techniques. Topological characteristics
(such as the degree, diameter, average distance, tra�c density and fault tolerance) of the boolean hypercube
(BC), generalized hypercube (GHC), nearest-neighbor mesh hypercube (NNMH) andWMCH are compared. This
is followed by a comparison of the expected delay and complexity.
II. WMCH Structure
The interconnection structure is de�ned in Section 2.1.1. Six possible channel con�gurations are examined in
terms of fanout, and shown to be limited by the optical power budget. Following this section, the paper focuses
2
on star-coupled interconnection due to its fanout superiority. Four star-coupled con�gurations are studied, of
which three exhibit the optical self-routing characteristic. The impact of the subchannel media access control
protocol is investigated in Section 2.3.
A. Topology
The following section de�nes the structure of the interconnection network in graph theoretic terms. This is
followed by an examination of its routing capability and structural characteristics.
x03x13x02x12x01x11x00x10
20x
21x213212211210
203202201200
2x32x22x12x0
1x3
10x
11x
103
113
00x
01x
1x21x11x0
0x30x20x10x0
112111110
102101100
013012011
003002001000
010
[0,X](0,3)(0,2)(0,1)(0,0)
(1,0) (1,1) (1,2) (1,3) [1,X]
[2,X](2,3)(2,2)(2,1)(2,0)
[3,X](3,3)(3,2)(3,1)(3,0)
[X,3][X,2][X,1][X,0]Star Couplers
OpticalPassive
Figure 1: WMCH, a hypercube based structure with optical multiple access channels spanning each dimensional axis. (a) N =
4� 3� 2, (b) N = 4� 4. The multiple access control is achieved through wavelength division multiple access coupled with either
Slotted Aloha or TDMA. The solid lines represent bundles of �bers to support point-to-point connections between the nodes and
the star coupler.
1) Interconnection Structure: A mixed radix system is used to represent the node numbering. Let N be a
decimal integer represented as a product of r factors:
N =
r
Y
i=1
m
i
A number P , 0 � P � (N � 1), can be represented as an r-tuple (p
r
; p
r�1
; : : : ; p
1
) where 0 � p
i
� (m
i
� 1).
There is a weight w
i
associated with each p
i
such that P =
r
X
i=1
p
i
w
i
, where w
i
=
i�1
Y
j=1
m
j
for all i, 1 � i � r. This
approach (and the schemes introduced in Section 3.1) are assumed to have N =
r
Y
i=1
m
i
nodes, and a processor
is denoted by the r-tuple P = (p
r
; : : : ; p
1
).
The WMCH is a hypercube based interconnection scheme in which interprocessor communication is based on
multiaccess channels spanning all dimensional axes (Figure 1). A channel is denoted by [c
r
: : : c
i+1
c
i
c
i�1
: : : c
1
],
an r-tuple in the mixed radix system. A channel spanning the i
th
dimension is indicated by c
i
= X for only
one 1 � i � r, and 0 � c
j
� (m
j
� 1) for all j, j 6= i and 1 � j � r. The X in position i of the r-tuple indicates
that the channel spans dimension i, as shown in Figure 1.
3
A processor (p
r
; : : : ; p
1
) is connected to an i-dimensional channel [c
r
: : : c
i+1
c
i
c
i�1
: : : c
1
] if c
i
= X and p
j
= c
j
for
all 1 � j � r, and j 6= i. Each processor will be connected to r such channels, all spanning di�erent dimensions.
Each i dimensional channel has m
i
processors attached where the processors have identical addresses except in
position i of the r-tuple, allowing simple, totally distributed routing schemes [7,8].
A WMCH with r = 3 and N = 4� 3� 2 is illustrated in Figure 1(a), and a 2-dimensional example with N = 4
2
is shown in Figure 1(b). When m
i
= m for all 1 � i � r, the total number of processors is N = m
r
, with a
total of
Nr
m
physical channels. A channel is tapped by all processors with identical r-tuple addresses, except
in the i
th
digit. For example, processor (213) in Figure 1(a) is attached to [x13], [2x3], and [21x]; and channel
[x12] attaches processors (012), (112) and (212). An i-dimensional channel is tapped by m
i
processors. The
total number of channels spanning dimension i is N=m
i
, giving the total number of channels in the system as:
L = N
r
X
i=1
1
m
i
(1)
A processor P = (p
r
: : : p
1
) is connected to a total of r channels, therefore the degree of this structure is r. The
distance between nodes is the Hamming distance between the source and destination address. The Hamming
distance between two nodes whose addresses di�er by 1 digit is unity, and the total Hamming distance is the
total number of di�ering address digits. This structure has a diameter of r, since a maximum of r digits may
di�er between node r-tuple addresses.
2) Structure Characteristics: This section examines the routing capability and structural characteristics of
the WMCH.
a) Routing: Due to the regularity of the structure, a routing scheme can be implemented without global
information.
A packet is transmitted on an i-dimensional channel if the i
th
digit of the packet destination address needs to
be aligned, since this is the only digit which di�ers among processors attached to that channel. If the i
th
digit
of the packet destination and node address match, the node accepts the packet. A digit by digit comparison is
then performed between the destination and node addresses. A packet reaches its destination when all digits
match. Otherwise, the accepted packet is routed to the queue of port j, where j is a dimension in which the
destination and node addresses di�er.
If the node and destination addresses di�er by d digits, there are d disjoint paths, all distance d. This is
illustrated in Figure 1(a). If node (003) wishes to transmit a packet to (210), there exist 3 disjoint paths of
distance 3. The packet could follow:
(003)! (203)! (213)! (210)
(003)! (013)! (010)! (210)
(003)! (000)! (200)! (210)
This leads to an alternate path routing scheme in which a packet may be queued in any one other d � 1
dimensions if the �rst choice queue length is larger than a certain threshold. Two possible variations place the
packet in the �rst of the d dimensions in which the queue length is below the threshold, or in the smallest of
the d queues. The average packet delay can be reduced since backlogged or faulty nodes are avoided.
b) Topology Characteristics: This section analyzes the structural characteristics of the WMCH. The degree
(k), diameter (d
max
), average distance (d), and tra�c density (�) are studied.
4
Degree and Diameter: A node P = (p
r
: : : p
1
) has one port for each incident multiple access channels. The
channels span di�erent dimensional axes. Since the structure is r-dimensional, the degree is r.
The distance between nodes is the Hamming distance between the source and destination address. A node is
at distance d from its destination when there are d di�ering digits between the r-tuple of the current node and
destination addresses. Since there can be at most r di�ering digits (the length of the r-tuple), the diameter of
the structure is r.
Average Distance: As will be discussed in the Structure Comparison section, the Generalized Hypercube (GHC)
contains direct links to all the nodes sharing a dimensional axis. Because of this connection scheme, the WMCH
has the same average distance as a GHC of similar size and con�guration.
To simplify the calculations when comparing WMCH with other interconnection schemes, it is assumed that
m
i
= m for all i, 1 � i � r. With this assumption, N = m
r
. This is a useful assumption, because it is necessary
for a balanced system. Furthermore, the value of m for all channels is to be maximized, bound by the fanout
limitations to obtain minimum cost. The average distance can now be written, as derived in [3,8]:
d =
�
r(m � 1)
m
� �
m
r
m
r
� 1
�
(2)
which is approximately
r(m� 1)
m
for large networks. The average distance is almost constant with increased
m, and increases as log
m
(N ) with increased r.
Tra�c Density: The next examined characteristic is the average tra�c density of each channel. The tra�c
density is de�ned as the product of the average distance and the total number of nodes contained within the
system divided by the total number of communication channels. With the WMCH, there are are S subchannel
per physical link. Using earlier results, the average tra�c density is
� =
dN
NrS=m
(3)
=
m � 1
S
�
N
N � 1
�
(4)
since there are a total of
Nr
m
channels in the system. When N � 1, � �
m � 1
S
.
B. WMCH Physical Interconnection
The following section investigates the implementation of optical multiple access channels. Section 2.2.1 (Optical
Power Budget) discusses six possible channel con�gurations: the bidirectional bus, bidirectional bus with control,
dual unidirectional bus, folded unidirectional bus, doubly folded unidirectional bus, and star-coupled. Star
coupled systems are examined in greater detail in the next section, due to their superior fanout capability.
Optical interconnects have been studied to achieve the cost and performance objectives discussed in the in-
troduction [9,10]. There are many desirable characteristics of optical interconnects: increased fanout, a large
bandwidth, high reliability, support longer interconnection lengths, exhibit low power requirements, and immu-
nity to EMI with reduced crosstalk.
Optical interconnections can appear at di�erent levels of the system design, providing chip-to-chip, module-to-
module, board-to-board and node-to-node communication [1,2,6,10,15-20,31,32]. A trade-o� between perfor-
mance and cost is possible. A decision must be made whether metal or optical interconnects are to be used in
5
an application. Di�culty arises because many parameters must be considered such as required power, speed,
complexity, length, fanout, reliability and cost.
Architectural constraints have shifted when optical interconnects are incorporated. Suppose the metal intercon-
nections of a parallel system (a hypercube, for example) were replaced with optical interconnects in a one-to-one
fashion. The performance improvement compared to the resulting cost might not seem worth the e�ort (this
would result in many expensive and underutilized channels). This paper introduces a technique that e�ciently
adopts this emerging technology.
Changes in computer interconnection are possible with optical components because of the relaxed fanout and
distance requirements. Processors can be designed at an increased physical distance, yet achieve superior
performance owing to the bandwidth-distance capability of optical communication. The propagation delay of
an optical link does not vary with changes in fanout. The optical fanout is not bound by capacitance but by
the power that must be delivered to each receiver to maintain a speci�ed bit-error-rate, referred as the optical
power budget (OPB).
The number of stations attached to an optical multiple access channel is bound by the saturation tra�c and the
OPB. However, because of the large bandwidth, the principle limitation in fanout is from power considerations.
The following section examines possible implementations of the optical multiple access channel, and in particular,
examines the fanout capability.
Node m-1Node 1
T R RT T R
Node 0
CNTRL
T R RT T RC C C
...Node 0 Node 1 Node m-1
Node m-1
R T
RT
Node 1
R T
RT
Node 0
R T
RT
Node m-1
RT
Node 1
RT
Node 0
RT
Node m-1
RRT
Node 1
RRT
Node 0
RRT
F
R
TNode m-1Node 1
F
R
T
F
R
TNode 0
Figure 2: Optical multiple access channel con�gurations: (a) bidirectional bus, (b) bidirectional bus with control, (c) dual unidi-
rectional bus, (d) folded unidirectional bus, (e) doubly folded unidirectional bus, and (f) star-coupled (SC
0
).
1) Optical Power Budget: Figure 2 illustrates six typical optical bus con�gurations: bidirectional bus (BB),
bidirectional bus with control (BBC), dual unidirectional bus (DUB), folded unidirectional bus (FUB), doubly
6
folded unidirectional bus (DFUB), and star-coupled (SC). In addition to the interconnection cost, the choice of
channel has a large impact on the system con�guration. This is because of the wide variation in optical power
budget.
The optical power budget limits the fanout, or the maximum number of processors that can be attached to
a channel. The OPB is important because of its impact on system cost. This is an extension of the usual
trade-o�s in degree and diameter in static interconnection networks: to reduce the diameter (for performance
reasons), requires an increase in degree. This increase in degree has signi�cant system cost implications: not
only the increase in cost for another I/O port, but modi�cation to the existing network is required. When the
fanout limits have been reached, a method to increase system size is through the increase in I/O ports. The
structure changes from a single bus, to either a hierarchical or regular structure.
The following elements must be considered to obtain the optical power budget:
2 power coupled from source
2 waveguide length and losses
2 insertion and reciprocity losses of the optical couplers
2 active or passive couplers
2 receiver power requirements
The optical power budget of the structures has been considered in detail in [10], and only the results are
summarized here. The expected fanout for two bus topologies are listed in Table 1. This case assumes laser
diode sources and avalanche photodiode detectors with characteristics +7dBm and �57dBm, respectively. The
remainder of the characteristics are as follows. It is assumed that the �ber has a loss of 0.2dB per kilometer
with a maximum length of 100m. The typical insertion loss for a commercially available connector is taken
to be �0:75dB, and a directional coupler insertion loss of �1:00dB is assumed. Table 1 list the resulting
maximum fanout for the characteristics just described. The terms �
o
and �
i
denoted the reciprocity factors of
the outbound and inbound couplers, respectively. The case of �
o
= 0:5 implies a 50% coupling of optical power
between waveguides in a coupler. The "-A" and "-P" terms in Table 1 denote active or passive directional
couplers, respectively.
The DFUB channel con�guration was not included in Table 1 because its optical fanout is almost identical to
the FUB channel con�guration since the coupling losses are more dominant than waveguide losses. The value
of the maximum number of channels attachable to a channel is highly dependent on the characteristics of the
devices. As the device characteristics improve, the number of nodes increases. The optical fanout of the star
coupled system is about 256 [2,15]
There has been much e�ort placed in examining optical bus-based systems, and protocols to provide access
arbitration [13,25,33]. However, the low fanout has reduced their attractiveness with systems requiring inter-
connection of greater than a dozen nodes.
The following section describes possible star-coupled con�gurations. The remainder of the paper focuses on a
star-coupled implementation because of its large fanout capability. Four examples of possible implementations
of star coupled systems are introduced in the next section.
2) Star-coupled Interconnection: Structures that capitalize on the exibility of agile distributed feedback
laser diodes and wavelength tunable �lters are examined. Non-coherent, wavelength tunable �lters can be
constructed in a variety of techniques: wavelength dependence of interferometric phenomena, for example Fabry-
Perot and Mach-Zehnder approaches; wavelength dependence of coupling through acousto-optic or electro-optic
techniques; and resonant ampli�cation which provides gain as well as wavelength selectivity.
The remainder of this section is organized as follows. The functional characteristics of optical devices are dis-
cussed initially. Four possible implementations of star coupled systems are introduced, followed by a discussion
on the communication protocol requirement for media access control.
7
Fabry-Perot wavelength tunable �lters have been constructed where a resonant cavity selectively interferes
with an incoming signal. Tunability is achieved through variations in the cavity length, the mirror re ectivity,
and through piezoelectric or electrostatic controls. Electro-optic tunable �lters have less tuning range (16 nm
with a bandwidth of 1 nm) than acousto-optic �lters. Acousto-optic devices have greater tunability, the entire
1:3 � 1:56�m range, with a �lter bandwidth of 1 nm. The setup delay for an acousto-optic device is on the
order of �S, whereas electro-optic device have on the order of nS setup time [23]. Tuning the acousto-optic
devices occur through changes in the acoustic frequency (10-300 MHz), and precise tuning is possible. The
electro-optic devices needs a drive voltage of about 10 v. Acousto-optic devices have the additional bene�t that
when multiple acoustical waves are superimposed, multiple subchannels can be extracted. For example, suppose
a common channel to be received by all nodes is needed for reservation and other control purposes. One of
the available channels could be deemed the common broadcast control channel, and a �lter of this type could
extract both the control channel and its speci�c data channel. This is achieved with a single �lter through the
superposition of two acoustic waves at the appropriate frequencies.
a) Star-coupled Examples: Four examples of possible implementations of star coupled systems are intro-
duced in the next section.
Single-channel star-coupled system: A star coupler broadcasts a message to all processors. The coupler uniformly
distributes all incoming optical power among the output waveguides. Figure 2(f) illustrates a possible channel
design with m processors attached (SC
0
).
All transmitters and receivers with SC
0
operate on the same wavelength. All nodes receives all tra�c. After a
node receives a packet (o/e and serial to parallel conversion), the header of the packet is decoded to determine
its destination address. The packet is discarded if the destination �eld and the local processor addresses do not
match.
Node m-1
Node m-1
Node 1
Node 1
Node 0
Node 0
Node m-1Node 1Node 0
...
...
...
T
R
F
T
R
F
T
R
F
T
R
F
T
R
F
T
R
F
m x 1
WDMF
R
T
F
R
T
F
R
T
Figure 3: Physical con�gurations of star-coupled optical multiple access channels. (a) Multiple subchannels via agile sources
(SC
1
), (b) Multiple subchannels via agile receivers (SC
2
), (c) Multiple subchannels via agile sources, with an m� 1 coupler and a
Wavelength Division Demultiplexer in place of a star-coupler (SC
3
).
8
Multiple-subchannel star-coupled systems: Two systems considered next are also based on an optical star cou-
pler, but either have agile transmitters or receivers. SC
1
is illustrated in Figure 3(a) and SC
2
by 3(b). The
two schemes assume the available optical bandwidth is partitioned into a comb of narrow subchannels, using
Wavelength Division Multiple Access (WDMA). WDMA di�ers from frequency division multiple access in the
subchannel spacing. The key to the two approaches is in the agility of either the sources or receivers.
SC
1
and SC
2
o�er the advantage of optical self-routing. Only the tra�c destined for a particular node is
delivered to that node. An SC
0
node must examine all transmitted tra�c to identity destined data. The
volume of data each node must process is greatly reduced with SC
1
and SC
2
, reducing the electronic bottleneck
problem. Assuming uniform destination distribution, a particular node in SC
1
and SC
2
processes 1=m
th
the
total tra�c as a node in SC
0
due to the subchannels. This characteristic has both performance and system cost
implications.
The optical self-routing characteristic is achieved through tunable receivers and transmitters. Tunable trans-
mitters and �xed wavelength receivers are used in SC
1
. A node n
i
has an agreed upon subchannel destination
address s
i
. A subchannel s
i
has a distinct wavelength �
i
, where �
i
6= �
j
for all i and j such that i 6= j,
0 � i � (m � 1), and 0 � j � (m � 1).
For node n
j
to communicate with node n
i
, n
j
tunes its transmitter to �
i
, and begins the transmission according
to the access protocol. The �xed wavelength receiver subchannel s
i
of n
i
has an optical �lter that blocks all
wavelengths except �
i
. After �ltering, the signal is then demodulated.
Each node in Figure 3(b) has a �xed (but distinct) wavelength transmitter and a tunable receiver (SC
2
). After
agreeing to accept a packet from station n
j
, node n
i
tunes its receiver �lter to s
j
and extracts �
j
. The media
access control protocol is more complex than that required by SC
1
since a transmitting station must �rst
inform the receiving station to expect the packet. For example, this could be achieved either through a separate
channel, through polling, or through a �xed assignment mechanism. This is a method to circumvent the agility
delay characteristics of laser diodes: if the agile laser diodes cannot switch su�ciently fast (on the order of nS)
and their latency impacts the overall performance, SC
2
is attractive since it allows the sources to remain set to
one particular wavelength.
A problem with all three of the approaches is collision. Stations in SC
1
cannot detect the transmission of an
outgoing packet, so overlapping transmissions occur unless a reservation scheme is used. A di�erence in the
three cases is the throughput. Collisions occur in SC
0
with any overlapped transmissions, whereas collisions
only occur in SC
1
and SC
2
when the same node is targeted. The star coupler in SC
1
and SC
2
can be viewed
as a non-blocking interconnection network, and SC
0
appears as a logical bus.
Multiple-subchannel wavelength division multiplexed system: The function of the star coupler is to distribute the
incoming optical power uniformly among the output ports. The power distribution, together with the limit of
agility of the sources and receivers, restrict the maximum number of processors on the star coupled systems.
The limit of the agility with currently constructed devices (estimated to be about 128 [20,23,24]) and power
budget limitations restrict the number to about 128-256 processor nodes [1,2,10,16,17]. See [34] for a review
of this technology. The system SC
3
, illustrated in Figure 3(c), avoids distributing the optical power among
all output ports, and optically routes the data to the destination node. In SC
1
and SC
2
, although optical
self-routing was also achieved, the optical power was distributed across all nodes limiting the fanout.
As in the three previous cases, there is no intermediate o=e and e=o conversions and destination decode with SC
3
.
The data is routed completely in its optical state. As with SC
1
and SC
2
, self-routing is achieved: the only data
delivered to a node is destined to that node. However, the optical power is not diluted in SC
3
which eases the
power budget limitation, and increases the fanout. All power other than insertion and coupling losses is routed
to the appropriate destination node. This is achieved through wavelength division multiplexing/demultiplexing.
Note that the wavelength-division demultiplexer is a passive component, as is the m� 1 coupler. This has fault
tolerance implications since a faulty transmitter or receiver isolates only the local node.
The operation of the m � 1 coupler is to funnel all incoming tra�c onto the same waveguide, and deliver it to
the WDM demultiplexer. Only transmissions simultaneously targeted to the same node collide, and all other
tra�c continues to pass una�ected by the collision.
9
A feature of this approach is the reduction in importance of topology. Much e�ort during the past decade was
placed in the examination of dynamic and static topologies [3,4,5,8,12,21,26-30,33]. In the four cases under
discussion, the limitation to system size is not bound by the tra�c generated at each node. The principal
limitations are the agility of the optical sources as in SC
1
and SC
3
; the optical detectors in SC
2
; and the
optical power budget in all four cases but to a lesser extent in SC
3
. SC
3
is an approach to increase the upper
limit of attachable processors because of the power budget considerations.
3) Subchannel Access Control Protocols: The structures have two levels of access protocols. The �rst is
wavelength division multiple access (WDMA), which partitions the enormous bandwidth of a single channel
into multiple, usable, subchannels. The object of this is to exploit the self-routing characteristic between source
and destination along a single optical channel. A receiver subsystem processes only tra�c destined to it, or
tra�c that it must forward along another dimension of the hypercube. Since the volume of tra�c processed is
reduced, the design requirements on this subsystem is eased. Essentially, this is an approach to obtain the high
capacity characteristics of optical communication, and circumvent the speed mismatch between electronic and
optical components.
A second advantage of WDMA is that although a system results with a low degree, a large number of nodes
are at a unity distance. For example, with a degree k (a k-dimensional structure) with m processors along each
dimension, a node has k(m � 1) neighbors at a distance of one. As illustrated in later sections, this achieves
a small diameter and average distance (relative to other interconnection topologies), yet achieves a low system
cost and complexity.
A second level media access protocol provides access in the time domain along each subchannel. This section
examines a few of the requirements on such a protocol and examines the behavior of Slotted ALOHA and Time
Division Multiple Access (TDMA).
A system must either allow collisions and rely on a positive acknowledgment scheme, or avoid collisions through
some reservation or �xed allocation scheme. Consider the case where each node has a reserved receiver sub-
channel (SC
1
). It is possible to have a reservation period for each round on each subchannel to request access.
A round could be constructed in two phases: (1) Initial request (reservation) phase, and (2) Data transmission
from each requesting node.
The di�culty with this approach is the synchronization with other subchannels. If each of the intervals are not
of �xed duration (�xed round length), a node would need to be able to detect the initiation of a new reservation
cycle for each subchannel. A node can only receive data along its receiver subchannel.
A possible solution is to have a TDMA subslot for reservation, where the destined node willACK if the request is
successful. If the source node detects no response from the target node, it inhibits according to a predetermined
algorithm. A second possibility is to allow collisions of the data tra�c, and ACK when successful. The second
approach could be slotted so corruption of a packet in midstream will not occur.
The critical resources to be optimized by multiple access protocols have shifted emphasis in this environment. In
the past, bandwidth was the critical resource and protocols were designed to maximize its utilization. That issue
is less of a constraint today due the the large available bandwidth. With the proposed approach, bandwidth is
not the limiting factor: it is the mismatch of speeds with the interface electronics. The objects pursued in this
paper is to obtain good performance (in terms of latency and throughput) while keeping the interface electronics
(communication subsystem) within their bounds of achievable speed. The self-routing characteristic of WDMA
aids this e�ort, since a node does not have to examine every packet transmitted along a channel.
a) Random Media Access Protocol: A possible protocol for SC
1
is described next. A slot is constructed
of two phases: the data transmission and acknowledgment subslots. A source node transmits a packet to
the destination (target) node during the data transmission subslot; and the destination node transmits an
acknowledgment to the source node during the ACK subslot. Since a source node cannot simultaneously
transmit to more than one destination nodes during a slot, the ACK subslot is collisionless. The ACK subslot is
composed of the time the receiver needs to verify the CRC, decode the source address, tune its agile transmitter,
10
and transmit the ACK.
Slotted ALOHA (with an elongated slot for the ACK transmission) is a possible choice for the access control
mechanism. A node assumes a collision has occurred if an ACK is not received. Note that CSMA is not possible
with SC
1
- SC
3
because the media cannot be sensed since each node only receives a particular subchannel.
The expected number of collisions is determined next when Slotted ALOHA is implemented with a total of
m stations. Assume that the source targets are uniformly distributed (for backlogged station n
i
, p(i; j) =
�
1
m�1
i 6= j
0 i = j
, where p(i; j) is the probability that node n
i
will transmit to n
j
).
Let Y denote the number of backlogged nodes. Consider a heavily backlogged system of Y = m�1 and m = S.
The probability that there is no transmission to node n
j
is p
(o)
j
=
�
1�
1
m � 1
�
m�1
, and the probability of
a single transmission is p
(1)
j
=
�
1�
1
m � 1
�
m�2
. In general, the probability that of Y backlogged stations, y
packets are targeted to station n
j
, y � Y , is p
(y)
j
=
�
Y
y
�
p
y
(1 � p)
Y�y
. De�ne the probability of collision as
p
(c)
j
=
m�1
X
y=2
p
(y)
j
, so p
(c)
j
= 1�
�
2m � 3
m� 1
��
1�
1
m � 1
�
m�2
in this example. Table 2(a) lists the probabilities for
m 2 f8; 16; 32; 64g.
Suppose the backlog was Y packets, p
(0)
j
=
�
1�
1
m� 1
�
Y
, p
(1)
j
= Y
�
1
m� 1
��
1�
1
m � 1
�
Y�1
, and p
(c)
j
=
1�
�
1 +
Y � 1
m � 1
��
1�
1
m � 1
�
Y�1
. Table 2(b) lists the probabilities with m = 64 for di�ering values of Y . Table
2(b) shows that 97% transmissions are successful in the case of Y = 4, 91% when Y = 8, 80% when Y = 16,
and 62% when Y = 32.
Denote the propagation delay from station n
i
to the star coupler as �
i
, and � = 2maxf�
i
; 1 � i � mg. B
denotes the data-rate of the subchannel (bits/sec), P the packet size (including preamble and header), so the
transmission time of a packet is T
p
= P=B. The total time of a slot is the transmission time of the packet,
ACK (T
a
), propagation delays (2� ), and the additional overhead of o/e conversion, address decode, checksum
computation and laser diode agility delay (�
p
+ �
a
). The total slot time is T
s
= T
p
+ T
a
+ 2� + �
p
+ �
a
.
The receiving station actually does not have to construct an ACK packet, but must transmit a burst at the
wavelength of the source for detection. Note that the time to recon�gure a laser diode is on the same order as
the propagation delay. In the case of positive acknowledgments, the total throughput would be
mp
(1)
j
T
s
packets
per second, and the subchannel utilization is U
j
=
�
T
p
T
s
�
p
(1)
j
.
b) Fixed Media Access Protocol: A �xed allocation scheme is examined because of its implementational
simplicity. There is no need for an acknowledgment mechanism, since it is collisionless, and has a bounded
service time. Its primary weakness is its behavior with bursty tra�c, and its primary strength is its excellent
performance with regular and heavy tra�c.
In the following discussion, only SC
0
and SC
1
are considered, both systems with m nodes. As before, packets
are a �xed length of P bits, and a subchannel �
i
has a data-rate of B bits/sec. The total data rate of an
individual node in SC
1
is still B bits/sec, since simultaneous subchannel transmission is not possible. The
packet transmission time for both cases is T
p
= P=B, and the arrival process from each node is assumed Poisson
with parameter �
o
. Note that this model does not block the generation of additional tra�c when a packet is
queued. This approach is taken since the model is for distributed memory parallel computer systems, where a
processor does not wait until a response is returned but either continues with the current process or context
switches to another process.
11
The delay is comprised of three components: the queueing delay, the transmission time, and the user slot
synchronization [14,22]. The average delay for the M/D/1 queueing model of TDMA is
t =
�
2
=�
2(1� �)
+ T
p
+
m
2
T
p
(5)
where the three terms represent the three components described above, � = m�T
p
, and � is the channel arrival
rate generated per node.
The average delay of SC
0
is
t
0
= T
p
h
m
2
+ 1
i
+
�
o
m
2
T
2
p
2[1� �
o
mT
p
]
(6)
since � = �
o
where �
o
denotes the total arrival rate generated by each node.
Assume there are S = m subchannels with SC
1
, each representing the path to a particular node. A packet
transmission along wavelength �
i
is destined to node n
i
, 0 � i � m � 1. A uniform tra�c is assumed, so the
rate at which node n
j
generates tra�c for n
i
is � = �
o
=(m�1), where i 6= j, and � is the total tra�c generated
by a node.
Node m
j
must maintainm� 1 separate bu�ers, one for each of the possible destination nodes. This additional
complexity, as compared with SC
0
which requires a single, longer, bu�er, is the price for the additional per-
formance demonstrated. However, the additional complexity in the transmission subsection of a node may be
o�set by the reduction in complexity in the receiver since only data destined to a node is routed to it, reducing
the volume of data a receiver must process.
rT
pT
Node 0
Node 3
Node 2
Node 10
m-1
2
1
Target/Destination SubchannelSource Nodes
Node 1
Node 4
Node 3
Node 2
Node 2
Node 5
Node 4
Node 3
Node 0
Node 1
Node m-1
Node m-2
Figure 4: Time-Space diagram of TDMA with SC
1
and N = S.
TDMA for SC
1
in this example is illustrated in Figure 4, which shows the behavior in each of the subchannels
and subslots, where T
r
is the time of a round, and T
s
is the time of a slot. In this case, T
s
= T
p
. In general,
the S subchannels (this assumes that m = S) are divided into m � 1 slots per round, so the total length of a
round is T
r
= T
p
(m�1). Determining the slot which is assigned to a particular source-destination pair is simple
and decentralized. Consider a node n
j
, 0 � j � m � 1. Figure 4 illustrates an approach where n
j
transmits to
n
j+1
during slot 1, n
j+2
during slot 2. In general, node n
j
transmits to n
(j+i)Mod(m)
during time-slot i for all
0 � i � m� 1.
The delay a packet incurs is similar to Equation (5), but with the modi�ed arrival rate. The average delay with
SC
1
, including the queueing delay for a particular subchannel is
t
1
=
1
2
T
p
[m+ 1] +
�
o
T
2
p
(m � 1)
2[1� �
o
T
p
]
(7)
12
0
20
40
60
80
100
120
140
0 0.2 0.4 0.6 0.8 1Average Arrival Rate
ts01(x)ts02(x)ts11(x)ts12(x)N=16
N=64
N=64
N=16
SC
SC
0
1
Figure 5: Average delay verses arrival rate for SC
0
and SC
1
with TDMA for N 2 f16;64g.
Figure 5 plots the average delay of SC
0
and SC
1
with a normalized service time (T
p
= 1), for m = 16 and
m = 64. The packet arrival rate is the per node probability of generating a new packet during a time slot. This
graph illustrates the performance improvement achievable through agile laser diode sources in SC
1
: a much
improved tra�c saturation point. Though the aggregate data rates of the individual communication links are
identical in both cases, the delay of SC
1
is much less and has a much greater capacity before saturation than
SC
0
because of the availability of multiple subchannels.
0
20
40
60
80
100
120
140
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Average Arrival Rate
N=16
Interleaved InterleavedTDMASlotted
Aloha
N=256
N=64
Figure 6: Simulation of SC
1
with Slotted Aloha and TDMA as the media access control protocol for N 2 f16;64;256g. Average
delay verses new packet generation probability (per node per slot).
Figure 6 graphs the results of a simulation study of SC
1
with Slotted Aloha and TDMA as the media access
control protocol, showing the average delay incurred as a function of the packet generation probability. Three
system sizes were considered: m 2 f16; 32; 64g. The delay is plotted as a function of individual packet generation
rates. Slotted Aloha protocol was implemented as described in the previous section, however a feedback mech-
anism was added to increase channel stability. This feedback mechanism modi�es a transmission probability
(TP). TP denotes the probability that a backlogged station will transmit during a particular slot. The TP was
decreased upon collision detection (absence of an ACK), and increased upon successful transmission. The graph
rea�rms the intuitively expected result: Slotted Aloha performs well with light tra�c, but saturated rapidly.
The TDMA approach can support a higher tra�c level, although with greater average delay with lighter tra�c.
This following section compares the WMCH with other interconnection schemes. Performance and complexity
issues are examined. In particular, the variation of the metrics are studied with increases in system size.
13
III. Behavior Comparison
Two system parameters vary when comparing a performance characteristic between hypercube structures with
increases in system size: the dimension, and the number of nodes along each dimension (width). The BC is a
special case that may only vary in dimension. Figures of merit with variations in both m and r are compared
in the following section. Equal sized networks are assumed in the comparison.
(1111)
(1010)
(0011)
(0000)
(0100)
(0101)
(1101)(0110)
(1000)
(1100)
(0001)
(1110)
(1011)
(0111)
(0010)
(000)
(010)
(100) (101) (102) (103)
(013)(012)(011)
(003)
(210) (211) (212) (213)
(203)(202)(201)(200)
(110 (111) (112) (113)
(210) (211) (212)
(202)(201)(200)
(110) (111) (112) (113)
(000)(003)
(013)(012)(011)(010)
(100) (101) (102) (103)
(203)
(213)
(c)(b)(a)
Figure 7: Topologies for comparison purposes. (a) boolean cube, N = 2
4
; (b) nearest neighbor mesh hypercube,N = 4� 3� 2; (c)
generalized hypercube, N = 4� 3� 2.
A. Structure De�nitions
The systems for comparison are de�ned in this section: the boolean cube (BC), nearest-neighbor mesh hypercube
(NNMH), and the generalized hypercube (GHC). A comparison with the WMCH then follows.
1) Boolean Cube: This system has N = 2
r
processors located on the vertices of an r-dimensional cube, since
m
i
= 2 for all 1 � i � r. Each node is represented as an r-tuple (p
r
: : : p
1
), where p
i
2 f0; 1g for all 1 � i � r.
A node (p
r
: : : p
1
) has a point-to-point communication link to (p
r
: : : p
i+1
[(p
i
+ 1)MOD(2)]p
i�1
: : : p
1
) for all i,
1 � i � r. Figure 7(a) illustrates a BC when r = 4.
2) Nearest Neighbor Mesh Hypercube: Nodes of a Nearest Neighbor Mesh Hypercube (NNMH) are ar-
ranged into an r-dimensional hypercube with m
i
nodes along each of the i dimensions. A node is connected
to its two nearest neighbors in each dimension. Node (p
r
: : : p
1
), where p
i
2 f0; 1; : : :m
i
� 1g, is connected
to (p
r
: : : p
i+1
[(p
i
� 1)MOD(m
i
)]p
i�1
: : : p
1
) and (p
r
: : : p
i+1
[(p
i
+ 1)MOD(m
i
)]p
i�1
: : : p
1
) for all i, 1 � i � r.
Figure 7(b) illustrates an example of this network when r = 3, and N = 4 � 3 � 2. When m
i
= m for all
1 � i � r, the total number of processors is N = m
r
.
3) Generalized Hypercube: A processor (p
r
: : : p
1
) of the Generalized Hypercube (GHC) is connected to all
processors that share its common dimensional axes, (p
r
: : : p
i+1
[(p
i
+ x)MOD(m
i
)]p
i�1
: : : p
1
) for all 1 � i � r,
and 1 � x � (m
i
� 1).
A GHC structure consists of r dimensions with m
i
nodes along the i
th
dimension. A node in a particular axis
has a direct point-to-point connection to all nodes along the same axis [3]. It has been determined in [3] that
an optimum structure occurs when m
i
= m = N
1=r
, for all 1 � i � r. The total number of nodes will then be
N = m
r
. An N = 4� 3� 2 network is illustrated in Figure 7(c).
14
0
10
20
30
40
50
60
10 100 1000 10000 100000N
GHC(1)
GHC(2)
BC
WMCH
NNMH(2)
NNMH(1)
Figure 8: Variation in degree with increased system size. (1) Expansion via r, with constant m = 8, (2) Expansion via m, with
constant r = 3.
B. Structure Comparison
In addition to typical topology characteristics (such as degree, diameter, and average distance), other metrics
of comparison are used in the following section to determine the advantages and disadvantages of the BC, GHC,
NNMH, WMCH. In particular, a model for the system complexity is introduced, followed by a description of
marginal analysis.
1) Degree: There usually is a trade o� between degree and diameter in interconnection networks. An
increase in degree, k, tends to reduce the diameter hence increasing performance at an increased node complexity.
For a network where each node has a large degree, the cost of the I=O ports become the dominating cost of the
network. The following sections shows that using multiple access channels, as in the WMCH, allows for both a
low degree and a small diameter.
The BC requires a port for each of the r dimensions of the cube, hence k = r. The NNMH has a degree of
k = 2r, with two links to each of its nearest neighbors in all the r dimensions. While this may seem greater
than the degree of a BC, the NNMH may have a lower degree for equal sized networks, as illustrated in Figure
8. The GHC requires a degree of m
i
� 1 for each of its r dimensions. When m
i
= m for all 1 � i � r, the
degree is k = r(m � 1). This large degree is due to the topology de�nition of a unity distance connection
between all nodes sharing a common dimensional axis. The WMCH achieves a unity distance connection with
all dimensional axes neighbors with a degree of k = r.
The degree is not constant for the BC and GHC, implying that network expansion requires modi�cation to
existing nodes. Expansion without modi�cation is possible with the NNMH and WMCH through an expansion
of m.
Figure 8 (and the graphs to follow) plots the increase in degree for increases in N under two cases: (1) expansion
via r, with constant m = 8; and (2) expansion via m, with constant r = 3. Due to the large capacity and high
fanout, the WMCH degree does not exceed 2 in this example. In the range under consideration (8 � N � 65536),
the WMCH has the form:
N =
�
m if N � 256
m
2
if N > 256
(8)
assuming a maximumfanout of 256. It is possible to force the NNMH to remain a 2-dimensional structure thereby
keeping the degree low. However, the resulting performance (diameter and average distance) is degraded with
increases in system size. The format of the WMCH de�ned in Equation (8) is used throughout the remainder
of the paper.
Through case (1), all schemes vary as log(N ), but with scaling factors of
1
log(2)
,
2
log(m)
and
(m � 1)
log(m)
for the
15
BC, NNMH and GHC, respectively. Through case (2), the degree of the NNMH remains constant while the
degree of the GHC varies as (N
1=r
� 1)r. The WMCH degree shows the least sensitivity to increases in N .
0
5
10
15
20
25
30
35
10 100 1000 10000 100000N
NNMH(2)
NNMH(1)
BC
WMCH
GHC(1)
GHC(2)
Figure 9: Variation in diameter with increased system size. (1) Expansion via r, with constant m = 8, (2) Expansion via m, with
constant r = 3.
2) Diameter: The distance between nodes in the BC, the GHC, and the WMCH is the hamming distance
between the source and destination addresses, since the distance between all nodes sharing a common dimen-
sional axis is unity. Since there can be at most r di�ering digits, r is the diameter of the networks. For the
NNMH, a packet traverses a maximum of
l
m
2
m
links per dimension, so the maximum distance in this network
is
l
m
2
m
r.
Through case (1), the diameter of the BC and WMCH is equals to the degree of the structure and vary as
described in the previous section. This illustrates the sensitivity of the BC in both degree and diameter with
increased N . The diameter of the NNMH varies as log(N ), with a scaling factor of
m
2log(m)
(Figure 9).
The diameter of the WMCH and GHC remain constant with increases in m. The NNMH diameter is sensitive to
increases in m. Figure 9 shows the diameter of the NNMH to vary linearly with m, or as
rN
1=r
2
. The diameter
of the GHC and WMCH remain constant at r with increases in m.
3) Average Distance: The average distance is useful since it provides a measure of the expected packet
delay. Let K
i
represent the number of nodes at a distance i. The average distance d is de�ned as:
d =
1
N � 1
r
X
i=1
iK
i
(9)
As derived in [3,8], the number of nodes at a given distance for the BC, WMCH, and the GHC is:
K
i
=
�
r
i
�
(m � 1)
i
(10)
since for i di�ering digits, each able to vary in (m � 1) ways, the enumeration is (m � 1)
i
. The combination of
i di�ering digits of r leads to Equation (10).
The GHC and the WMCH have an average distance of
(m � 1)rm
r�1
m
r
� 1
which approaches
r(m � 1)
m
for large
16
networks [3,8]. For the BC, m = 2 and the average distance is
rN
2(N � 1)
, or approximately
r
2
when N is large.
The NNMH has an average distance of
d =
r
m
m�1
X
i=0
min(i;m� 1) (11)
which is mr=4 when m is even, or
�
m
4
�
1
4m
�
r when m is odd, giving a total average distance of approximately
mr
4
.
0
2
4
6
8
10
12
14
16
10 100 1000 10000 100000N
NNMH(2)
GHC(1)
GHC(2)
WMCH
BC
NNMH(1)
Figure 10: Variation in average distance with increased system size. (1) Expansion via r, with constant m = 8, (2) Expansion via
m, with constant r = 3.
Figure 10 shows the variation in average distance with increased N . The BC, NNMH, WMCH, and the GHC
vary linearly with r. With variations in m, the NNMH shows the greatest sensitivity, varying linearly with m,
while the GHC and WMCH are approximately constant.
Equation (9) assumes that the packets are evenly distributed throughout the network. An operating system
may try to place computationally interrelated activity in nodes close in distance to minimize the packet delay.
Suppose a task is optimally mapped into a locality of @ nodes. Denote the maximum distance of this set of
nodes as d
@
, such that @ �
d
@
X
i=1
K
i
and @ >
d
@
�1
X
i=1
K
i
.
Denote the probability � of a node communicating with nodes at a distance less than or equal to d
@
, the average
distance is given by:
d =
"
�
P
d
@
j=1
K
j
#
d
@
X
i=1
iK
i
+
"
1� �
N �
P
d
@
j=1
K
j
#
r
X
i=d
@
+1
iK
i
(12)
With a small locality, the graph of Figure 10 may provide a pessimistic assessment of average distance for the
BC, NNMH and GHC. For example, many algorithms have been developed for the NNMH which only require
communication between neighbors. A very low average distance results, independent of system size.
4) Tra�c Density: The average tra�c density provides a measure of the potential packet delay by illus-
trating how much of the total packet tra�c each link must support. The average tra�c density is de�ned as the
product of the average distance and the total number of nodes, divided by L, the total number of communication
links.
17
The tra�c density of the topologies under consideration:
�
BC
=
�
N
N � 1
�
(13)
�
NNMH
=
m
4
(14)
�
GHC
=
2
m
�
N
N � 1
�
(15)
�
WMCH
=
(m � 1)
S
�
N
N � 1
�
(16)
since the BC has L = r2
r�1
communication channels, the NNMH has L = rN links, and the GHC has L =
r(m � 1)N
2
links.
0
0.5
1
1.5
2
2.5
3
3.5
4
10 100 1000 10000 100000N
NNMH(2)
NNMH(1)
BC
WMCH
GHC(1)GHC(2)
Figure 11: Variation in tra�c density with increased system size. (1) Expansion via r, with constantm = 8, (2) Expansion via m,
with constant r = 3.
Figure 11 plots the tra�c density with increases to system size. The tra�c density of the BC is insensitive to
variations in network size, and approaches 1 for large networks. The GHC has an extremely low tra�c density,
implying the large degree required for a low diameter has created a large number of channels with low channel
utilization. The WMCH has a tra�c density that approaches that of the BC when S = m.
5) Fault Tolerance: Two forms of connectivity are considered: node and link connectivity. Node (link)
connectivity is de�ned as the minimum number of faulty nodes (links) to disconnect the network.
Due to the regularity of the hypercube structures, node connectivity and link connectivity are both equal to
the degree of the structure. This is not true for the WMCH: a structure of degree r has a node connectivity of
r(m � 1).
There are d disjoint paths of equal length between two nodes at a distance d, d � r, for the BC,WMCH, NNMH,
and GHC. Although the degree of the node bounds the number of minimum length disjoint paths between source
and destination nodes, node connectivity limits the total number of paths available for packet routing where
the path may not be minimum length.
The link connectivity for the hypercube structures is identical to the node connectivity, except for the WMCH.
Assuming the worst case situation for the WMCH where a disruption in the channel causes a total loss of use,
the link connectivity is r. In general, this is unlikely since the channel is a passive component.
18
(a)
Node Processor Node Processor Node Processor
(b) (c)
Figure 12: Node model with three forms of internal node interconnection: (a) I(k) = k + 1; (b) I(k) = (k + 1)log
2
(k + 1); (c)
I(k) = (k+ 1)
2
6) Complexity: A function that provides a measure of network complexity is proposed in order to study
its behavior with increased network size. Figure 12 illustrates a model of a node, with three forms of internal
interconnection. Each node consists of three major components: the node processor (np), communication
processors interface (cp), and the local node interconnection (ni). The system cost is obtained by including the
cost of the communication channels (cc). The following assumes a system size of N nodes at a degree of k.
The cost of the node processor is neutral since equal sized networks are used in the comparison. However, the
term c
np
N is included in Equation (18) to represent the cost of N node processors.
A communication processor is responsible for capturing and bu�ering packets during transmission and reception,
providing media access control and logical link control (OSI levels 1-2). The complexity re ects the hardware
required to interface the communication channels with the local interconnection. Each node has k ports to
interface, so the complexity is O(k), and the complexity is modeled as kc
cp
N .
A performance/complexity trade-o� is present within the local interconnection. As shown in [7], if Q represents
an average latency delay per hop in routing packets in a network, the average packet delay increases by (d+1)Q,
where d is the average distance. This may become a signi�cant factor in packet delay, as demonstrated by the
initial models of the Intel iPSC. The structure of the local interconnection at each node can be enhanced to
reduce this delay. Figure 12 illustrates possible implementations of the internal interconnection. For example,
the local interconnections could be constructed as a dynamic multistage interconnection network, as illustrated
in Figure 12(b), and the complexity could be taken to be proportional to the number of 2�2 switching elements.
The local interconnection cost is modeled as c
ni
I(k)N , where I(k) represents the complexity of the local inter-
connection depending on its implementation. Three forms of I(k) are considered:
I(k) =
8
<
:
k + 1 (17:1)
(k + 1) log
2
(k + 1) (17:2)
(k + 1)
2
(17:3)
(17)
Equation (17.1) is the complexity of polling through a central controller as shown in Figure 12(a). Equation
(17.2) could be achieved by multistage switching elements as illustrated by Figure 12(b), and Equation (17.3)
could be achieved through a crossbar switch as shown in Figure 12(c).
Figure 12(a), modeled by Equation (17.1), would be similar to the initial implementation of the Intel iPSC.
Equations (17.2) and (17.3) attempt to improve the latency delay per hop through a more complex routing
structure. The remaining component of the system complexity function is the communication channels. This
term is a function of the (sub)channel bandwidth.
The complexity terms of the local interconnection, the channels and the channel interface of Equation (18) are
scaled by a term modeling the economies of scale of bandwidth within the communication links. The object
is study the relationship between the relative costs of metal and optical interconnects. A main advantage of
19
optical links is that the bandwidth can increase far beyond the bandwidth of a metal interconnect and still
retain the economies of scale properties. Since the crossover point is constantly moving, due to improvements
in technology, the increases in complexity of systems with optical interconnects are studied, using the metal
complexity as a reference. The B term in Equation (18) represents the relative increase bandwidth between
optical and metal interconnects, and � models the economies of scale.
Putting the complexity components of the node together, the system complexity model for a given topology is:
C = c
np
N + B
�
[c
cp
Nk + c
ni
NI(k) + c
cc
L] (18)
where L denotes the number of communication channels. Equation (18) is used to model the relative complexity
of systems with optical and metal interconnects.
This function re ects the number of nodes, degree per node, internal structure and channel implementation.
This equation also models the returns to scale of bandwidth: up to a point, doubling the channel bandwidth
does not double the cost. The total number of I=O ports are included in the parenthesis since a port may
increase in cost in order to handle the increased tra�c through the higher bandwidth channels.
N
30
25
20
15
10
5
00 10000 20000 30000 40000 50000 60000 70000
GHC(2) GHC(1)
BCNNMH(1)
NNMH(2)WMCH
Figure 13: Variation in system complexity with increases in system size. Relative bandwidth of optical links taken to be 10.
Figure 13 provides a comparison of the complexity with increases in system size. The relative bandwidth between
the metal and optical interconnects is taken to be 10 and � = 0:3. Section 3.2.8 examines the relationship of
complexity with relative bandwidths of 1, 10, 100 and 1000.
7) Delay Analysis: The average time a packet takes to travel from the source node to the destination node
is determined next. The development is continued with m
i
= m for all 1 � i � r. The model is based on an
M=M=1 system in which the output queue of each of the k transmit ports of a node receive packets from the
local node processor, and also packets being routed from the other k � 1 ports. The following assumptions are
made:
2 Each node is equally likely to be targeted as the destination of a packet.
2 Poisson arrivals of �
h
packets per second from all local node processors.
2 Uniform routing is performed without regard to queue lengths.
2 An accepted packet, that has not reached its destination, is routed on one other k � 1 ports with equal
probability.
Let 1=�B denote the mean service time, where B is the channel bandwidth, and 1=� is the average packet size.
The utilization factor is de�ned as � =
�
�
with � arrivals per second.
20
The total delay if a latency delay per hop Q, incurred by the communication processor for internal routing, is
taken into consideration is t = Q+
L
X
i=1
�
i
�
�
1
�B � �
i
+Q
�
, where �
i
is the arrival rate along channel i, L is the
total number of channels, � is the network load factor, which is equal to � = �=d, and � is the total arrival rate
along all the channels. With a regular network, all channels have equal capacity. There is equal tra�c on all
channels, because of the uniform routing, �
i
= �
j
for all i and j. The total arrival rate is � = L�
i
, reducing the
above equation to:
t =
d
�B � �
i
+Q(d+ 1) (19)
The average channel arrival rate for each scheme is determined next. New arrivals, �
h
, generated by each local
node processor are evenly distributed over its k I/O ports. Furthermore, assume that a packet is not routed
out on the same port from which it arrived. Relating the arrival rate of port i with the local node processor
arrivals, and the packets arriving along the other k � 1 ports:
�
i
=
�
h
k
+
1� p
k � 1
2
4
k
X
j=1
�
j
� �
i
3
5
(20)
The term 1� p denotes the probability that a packet has not reached its destination, where p = 1=d. The total
arrival rates of all ports except i, which have not reached their destination, are divided among k � 1 ports.
Equation (20) reduces to:
�
i
=
d
k
�
h
(21)
This result is expected, since the channel arrival rate is greater with a larger average distance, and smaller for a
larger degree. Equation (19) and (21) can be used to provide the saturation tra�c at the queueing leg for each
scheme. The saturation tra�c for each scheme is:
�
sat
=
k
d
h
1
�B
i
(22)
Equation (22) shows that the saturation tra�c can be increased with high capacity channels, higher degree,
and low average distance.
As seen in Equation (19), the delay includes a constant term (d+ 1)Q. In this analysis, the Q term is assumed
negligible when compared to the queueing delay. Since the constant delay varies linearly with d, interconnection
networks with a large average distance are most sensitive to this term which can be signi�cant.
With the characteristics of each scheme derived earlier, the packet delay can now be listed, utilizing Equations
(20) and (21):
t
BC
=
r=�B
2� �=�B
(23)
t
NNMH
=
2rm=�B
8� �m=�B
(24)
21
t
GHC
=
r(m � 1)=�B
m � �=�B
(25)
From Section 2.2.3,
t
WMCH
=
d
2
�
1
�B
o
�
[m+ 1] +
�d(m � 1)=(1=�B
o
)
2
2[1� �=�B
o
]
(26)
where B
o
denotes the data-rate along an optical subchannel. The subscript h has been eliminated from �
h
in
Equations (23)-(26), since there should be no confusion between arrivals.
0
20
40
60
80
100
120
140
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Average Arrival Rate
Analytic Model
Simulation
N=256
N=16
N=64
Figure 14: Comparison of WMCH analytic model with simulation results. Two dimensional structures with N 2 f16;32;64g.
Figure 14 graphs a comparison of the analytic model of the WMCH to simulation results. System sizes of
N = f16; 64; 256g were considered with 2-dimensional structures and S = m. The graph restricted itself to two
dimensional structures because there is little advantage in increasing the dimension beyond two. If the optical
power budget restricts the fanout to 256 nodes per star coupled system, a total system size of 65536 nodes can
be supported. This approach provides an approach that is scalable from small systems to massively parallel.
0
20
40
60
80
100
0 0.5 1 1.5 2 2.5 3 3.5 4Average Arrival Rate
BC NNMH GHCWMCH
(B=1)
WMCH (B=10)
Figure 15: Variation in packet delay with increasing arrival rate for the BC, GHC, NNMH and WMCH with N = 512. The optical
subchannels are taken to have a relative bandwidth with metal interconnects of 1 and 10.
Figure 15 graphs the average packet delay of the BC, GHC, NNMH and WMCH with N = 256. The slot
length and average packet size has been normalized to 1. The optical subchannels are taken to have a relative
bandwidth with metal interconnects of 1 and 10.
8) Marginal Complexity: The impact of network expansion should be taken into consideration when exam-
ining performance and complexity measures. The marginal expansion of a network must be carried out while
22
preserving the network structure [11].
Typically, topologies cannot be incrementally increased in size. Marginal Analysis denotes the incremental
increase to performance or cost as the system is expanded from size N to N + �N . For example, the BC
can only be expanded by doubling the system size. Through marginal analysis, the question of whether the
performance also doubles is addressed. Furthermore, marginal analysis identi�es cases when the node complexity
becomes increasingly more costly with increased system size.
Suppose the total number of processors are denoted as N =
r
Y
i=1
m
i
, which is often the case in regular hypercube
based static interconnection networks [3,8]. Expansion can be achieved through increases in r or m
i
for any
1 � i � r. Some topologies, such as the BC, can only be expanded through increases in r.
With expansion through r,
N +�N =
r+1
Y
i=1
m
i
(27)
�N = N (m
r+1
� 1) (28)
With expansion through an increase in m
i
to m
0
i
,
N +�N = m
1
: : :m
i�1
m
0
i
m
i+1
: : :m
r
(29)
�N = N
�
m
0
i
m
i
� 1
�
(30)
If the performance of a structure of size N is denoted as P(N), the Marginal Performance (MP) of the structure
is de�ned as the incremental increase in performance when the network is expanded to size N +�N :
MP =
P (N +�N )� P (N )
�N
(31)
For example, a measure of performance could be the product of the saturation tra�c and the total number of
processors. The Marginal Cost (MC) is similarly de�ned:
MC =
C(N +�N )�C(N )
�N
(32)
The MC is obtained for the BC, NNMH, GHC andWMCH using the complexity function de�ned in Section 3.2.6.
Figure 16 illustrates the di�culty of obtaining the excellent topological characteristics of the GHC. With both
approaches of expansion, the incremental complexity of the GHC increases rapidly.
Figure 17 graphs the marginal complexity of the WMCH when B 2 f1; 10; 100; 1000g. The surge in each trace is
due to the increase in degree as N exceeds 256 nodes. This graph illustrates the economies of scale in bandwidth
available with optical �ber links.
IV. Conclusion
A multiprocessor system with a large number of nodes can be built at low cost by combining the recent advances
in high capacity channels available through �ber optics, and wavelength division multiple access protocols.
23
0
100
200
300
400
500
600
100 1000 10000 100000N
GHC(2)
GHC(1)
WMCH
NNMH(1)
NNMH(2)
Figure 16: Variations in marginal complexitywith increases in system size. (1) Expansion via r, with constantm = 8, (2) Expansion
via m, with constant r = 3. The optical subchannels are taken to have a relative bandwidth with metal interconnects of 1.
0
20
40
60
80
100
100 1000 10000 100000N
B=1000
B=100
B=10
B=1
Figure 17: Variations in WMCH marginal complexityB 2 f1;10;100;1000g.
A highly fault tolerant system is created with good performance characteristics at a signi�cantly reduced cost.
The system capitalizes of the self-routing characteristic of wavelength division multiple access to improve per-
formance and reduce complexity.
Although the cost of the WMCH is signi�cantly lower than comparable architectures, the performance charac-
teristics of average distance, average packet delay, diameter, and delay are shown to be equal or better than the
studied networks.
This multiple access approach provides a more e�cient resource allocation, where the network performs well
under dynamically changing loads. The law of large numbers allows the balancing of temporary peaks in station
tra�c. This is a more e�cient approach than providing a large number of point-to-point channels with large
capacities to support peak tra�c.
In addition to the performance bene�ts optical interconnects provide, there is also the additional bene�t of relax-
ing the packaging requirements of the system design. Relaxing the interconnect physical distance requirements
at a particular data-rate has signi�cant system cost implications.
Four star-coupled systems were studied, of which three exhibited the optical self-routing characteristic. A
performance analysis showed that through the integration of agile sources or receivers, and wavelength divi-
sion multiple access, systems can be developed with signi�cant increases in performance yet at a reduction in
communication subsystem complexity.
24
V. References
1. E. Arthurs et al., "Multiwavelength Optical Crossconnect for Parallel Processing Computers," Electron.
Lett., vol. 24, pp. 119-120, 1988.
2. E. Arthurs, M.S. Goodman, H. Kobrinski, and M.P. Veechi, "HYPASS: An Optoelectronic Hybrid Package
Switching System," IEEE Jour. Sel. Areas Commun., vol. 6, pp. 1500-1510, Dec. 1988.
3. L.N. Bhuyan and D. P. Agrawal, "Generalized Hypercube and Hyperbus Structures for a Computer
Network," IEEE Trans. Comput., vol. c-33, pp. 323-333, Apr. 1984.
4. G.E. Carlson, J.E. Cruthirds, H. B. Sexton, and C. G. Wright, "Interconnection Networks Based on a
Generalization of Cube Connected Cycles," IEEE Trans. Comput., vol. c-34, pp. 769-773, Aug. 1985.
5. G. Chiola, M.A. Marsan and G. Balbo, "Product-Form Solution Techniques for the Performance Analysis
of Multiple-Bus Multiprocessor Systems with Non-uniform Memory Reference," IEEE Trans. Comput.,
vol. 37, pp. 532-540, May 1988.
6. B.D. Clymer and J.W. Goodman, "Optical clock distribution to silicon chips," Optical Engineering, vol.
25, pp. 1103 - 1108, Oct. 1986.
7. P.W. Dowd and K. Jabbour, "Performance Evaluation of the Spanning Multiaccess Channel Hypercube
Interconnection Network," IEE Proceedings, vol. 134, Pt. E, pp. 295-302, Nov. 1987.
8. P.W. Dowd and K. Jabbour, "Spanning Multiaccess Channel Hypercube Computer Interconnection,"
IEEE Trans. Comput., vol. 37, pp. 1137-1142, Sept. 1988.
9. P.W. Dowd, "Optical Bus and Star Coupled Parallel Interconnection", Proc. Fourth Ann. Symposium on
Parallel Processing, Fullerton, CA., pp. 824-838, April 1990.
10. P.W. Dowd, "Optical Interconnections for Computer Communication," IBM Technical Report TR01.A961,
Endicott NY, Apr. 1989.
11. P.W. Dowd, M. Dowd and K. Jabbour, "Static Interconnection Network Extensibility based on a Marginal
Performance/Cost Analysis," IEE Proceedings, vol. 136, Part E, pp. 9-15, Jan. 1989.
12. T. Feng, "A Survey of Interconnection Networks," Computer, pp. 12-27, Dec. 1981.
13. M. Fine and F.A. Tobagi, "Demand Assignment Multiple Access Schemes in Broadcast Bus Local Area
Networks," IEEE Trans. Comput., vol c-33, pp. 1130-1159, Dec. 1984.
14. M.J. Ferguson, "An Approximate Analysis of Delay for Fixed and Variable Length Packets in an Unslotted
ALOHA Channel," IEEE Trans. Commun., vol. COM-25, pp. 644-654, July 1977.
15. M.S. Goodman, "Multiwavelength Networks and New Approaches to Packet Switching," IEEE Commun.
Mag., pp. 27 - 35, Oct. 1989.
16. D.H. Hartman, "Digital high speed interconnects: a study of the optical alternative," Optical Engineering,
vol. 25, pp. 1086 - 1102, Oct. 1986.
17. P.R. Haugen, S. Rychnovsky, A. Husain and L.D. Hutcheson, "Optical interconnects for high speed com-
puting," Optical Engineering, vol. 25, pp. 1076 - 1085, Oct. 1986.
18. A.M. Hill, "One-Sided Rearrangeable Optical Switching Network," IEEE Jour. Lightwave. Tech., vol.
LT-4, pp. 785-789, July 1986.
19. H.S. Hinton, "A Non-Blocking Optical Interconnection Network using Directional Couplers," Proc. IEEE
Global Telecommunications Conference, vol. 2, pp. 885 - 889, Nov. 1984.
20. R.G. Hunsperger, Integrated Optics: Theory and Technology, Springer - Verlag, 1984.
21. K.B. Irani and I.H. Onyuksel, "A Closed-Form Solution for the Performance Analysis of Multiple-Bus
Multiprocessor Systems," IEEE Trans. Comput., vol. c-33, pp. 1004-1012, Nov. 1984.
25
22. L. Kleinrock, S.S. Lam, "Packet Switching in a Multiaccess Broadcast Channel: Performance Evaluation,"
IEEE Trans. Commun., vol. com-23, pp. 410-423, Apr. 1975.
23. H. Kobrinski and M. Yin, "Wavelength-Tunable Optical Filters: Applications and Technology," IEEE
Commun. Mag., pp. 53 - 35, Oct. 1989.
24. T.P Lee and C.E. Zah, "Wavelength-Tunable and Single Frequency Semiconductor Lasers for Photonic
Communications Networks," IEEE Commun. Mag., pp. 42 - 52, Oct. 1989.
25. N.F. Maxemchuk, "Twelve RandomAccess Strategies for Fiber-Optic Networks," IEEE Trans. Commun.,
vol. 36, pp. 942-950, Aug. 1988.
26. P.K. McKinley, Group Communication in Bus-Based Computer Networks, Ph.D. thesis, University of
Illinois at Urbana-Champaign, Urbana, Illinois, 1989.
27. T.N. Mudge and H.B. Al-Sadoun, "A Semi-Markov Model for the Performance of Multiple-Bus Systems,"
IEEE Trans. Comput., vol. c-34, pp. 934-942, Oct. 1985.
28. Y. Ofek, The Topology, Algorithms, and Analysis of a Synchronous Optical Hypergraph Architecture, Ph.D.
thesis, University of Illinois at Urbana-Champaign, Urbana, Illinois, 1989.
29. F.P. Preparata and J. Viullemin, "The Cube-Connected Cycles: A versatile network for parallel compu-
tation," Commum. Ass. Comput. Mach., vol. 24, pp. 300-309, May 1981.
30. D.A. Reed and H.D. Schwetman, "Cost-Performance Bounds for Multimicrocomputer Networks," IEEE
Trans. Comput., vol. c-32, pp.83-95, Jan. 1983.
31. A.A. Sawchuk and T. C. Strand, "Digital Optical Computing," Proc. IEEE, vol. 72, pp. 758 - 779, July
1984.
32. R.A. Spanke and V. E. Benes, "An N-stage Planar Optical Permutation Network," Applied Optics, vol.
26, Apr. 1987.
33. F.A. Tobagi, "Multiaccess Protocols in Packet Communication Systems," IEEE Trans. Commun., vol.
com-28, pp. 468-488, Apr. 1980.
34. S.S. Wagner and H.L Lemberg, "Technology and System Issues for a WDM-Based Fiber Loop Architec-
ture," IEEE Jour. Light. Tech., vol. 7, pp. 1759-1768, Nov. 1989.
35. L.D. Wittie, "Communication Structures for Large Networks of Microcomputers," IEEE Trans. Comput.,
vol c-30, pp. 264-272, Apr. 1981.
26