Bit-serial architecture for optical computingstrukov/ece594BWinter2011/optical.pdf · Bit-serial architecture for optical computing Vincent P. Heuring, Harry F. Jordan, and Jonathan

Bit-serial architecture for optical computing

Vincent P. Heuring, Harry F. Jordan, and Jonathan P. Pratt

The design of a complete, stored-program digital optical computer is described. A fully functional,

proof-of-principle prototype can be achieved by using LiNbO 3 directional couplers as logic elements and

fiber-optic delay lines as memory elements. The key design issues are computation in a realm wherepropagation delays are much greater than logic delays and implementation of circuits withoutfip-flops. The techniques developed to address these issues yield architectures that do not change astheir clocking speed is scaled upward and the size is scaled downward proportionally; these are called

speed-scalable architectures. Signal amplitude restoration and resynchronization are accomplished bythe novel technique of switching in a fresh copy of the system clock. Device characteristics that areimportant to the proof-of-principle demonstration are discussed, including the special properties andlimitations that are important when designing with them. Design principles are exemplified by thedesign of an n-bit counter. Following this, the design for a stored-program bit-serial computer isdescribed. We estimate that the described prototype architecture can be operated in the 100-MHz regionwith off-the-shelf components, and in the 0. 1-1-THz region with foreseeable future components.

I. Introduction

We describe an architecture for a realizable, stored-program, general-purpose digital computer con-structed from optical components. There have beena number of proposals for optical computer architec-tures in the literature, but most are either architec-tures for special-purpose machines,"2 or they rely ondevices that are in the research category.3 A keyfeature of this machine is that it incorporates a storedprogram that it can manipulate as data, which thussupports the compiler and high-level language hierar-chy that forms the basis of computer science. For acomputer architect, the stored program, not thearithmetic capability, distinguishes a computer fromother digital systems. While a powerful, high-speedarithmetic unit may be an extremely valuable device,the ability to store and manipulate machine instruc-tions as data is the foundation of the many levels ofhardware and software complexities that characterizegeneral-purpose computing. From past experience,quantitative advances in storage capacity and access

When this work was performed, the authors were with theDepartment of Electrical and Computer Engineering, Center forOptoelectronic Computing Systems, University of Colorado, Boul-der, Colorado 80309-0425. J. P. Pratt is now with the Universityof Colorado Health Science Center, Department of Radiology, BoxA0344, 200 East Ninth Avenue, Denver, Colorado 80262.

Received 13 September 1991.0003-6935/92/173213-12$05.00/0.33 1992 Optical Society of America.

speed, arithmetic capability, etc., follow the qualita-tive advance to self-contained, stored-program opera-tion.

Our objective is to design a proof-of-principle ma-chine that can be constructed from commerciallyavailable components. Such a design permits com-puter architects to become involved in optical com-puter design earlier than would be possible if theywere to wait for the arrival of a more familiar-lookingoptical technology such as integrated optics. It thusputs the architect in a position to influence the designof future optical computing components. As forpractical application, a bit-serial design such as theone proposed here should have immediate applicationin fiber-optic communications systems that alreadyoperate in the serial mode.4

Given the goal of designing a realizable stored-program computer from commercially available com-ponents, we faced some of the same constraints thatwere faced by early architects of electronic com-puters5 6: a limitation on the number and kinds ofdevices that could be used in a realizable implementa-tion and the consequent need to implement memorywith less expensive components. (Here we use theterm expensive in its larger sense, i.e., the cost interms of price, size, power dissipation, etc.) Practicalelectronic computers have been constructed with asfew as 15 flip-flops and a diode array, with a rotatingmagnetic drum for memory.6 Circulating mercurydelay lines were also employed.7 These early ma-chines generally used a bit-serial approach, since, in

10 June 1992 / Vol. 31, No. 17 / APPLIED OPTICS 3213

general, a bit-serial design that uses M devices for acertain functionality, for example, an adder, willrequire M*N devices to implement the functionalityin parallel for word size N. Huang3 mentions thelack of cascadable logic gates and optical memorydevices as two of the chief bottlenecks in building anoptical computer. Our studies have shown that theLiNbO 3 directional coupler will serve as a logic device,and a glass fiber loop will serve as a delay-linememory. Subsections L.A and I.B describe thesedevices as used as components in an optical computer.Subsections I.C-I.E address some of the design issuesinherent in architectures that are constrained by thespeed of light, or more precisely, architectures thatare confined to the surface of a relativistic light cone.Section II describes the implementation of a counteras an example of a design that uses these components.Section III discusses the actual computer design.Finally, in Section IV we discuss the probable operat-ing speed of the computer with present day compo-nents, and make estimates of possible future operat-ing speeds.

A. LiNbO 3 Switch as a Logic Element

The LiNbO3 electro-optic switch is a guided wavedevice that is produced by diffusing precisely placedTi waveguides into high-purity LiNbO3 crystals, asshown schematically in Fig.1(a).8 In the absence ofan externally applied voltage VAC, but with the correctbias voltages VDC1 and VDC2, light entering at point Awill couple to the neighboring waveguide and emergeat E. Likewise, light entering at B will emerge at D.This is referred to as the cross state. Application ofthe correct voltage at VAC will cause the light enteringat A to emerge at D and light entering at B to emergeat E. This is referred to as the bar state. Typicalcommercially available devices,9"10are several centime-ters in length and are terminated in fiber pigtails forsystem interconnection. As systems designers, weare most interested in system level characteristics,rather than device physics, such as the distinctionbetween A13 and Mach-Zehnder switching."1 Systemspecifications for a commercial switch are9: =1300 nm; switching speed 0.3 ns; switching voltageVA = 4.5 V; bias voltages VDC1 and VDC2 5 V; on-offcross talk typically > 25 dB; and insertion loss < 5dB. The switching is nonlinear with a functionalitythat depends on electrode configuration and geome-try." It ranges from sinusoidal, I = [1 - cos('rrVAc/Vmax)] /2, to a plateau function, in which I is relativelyinsensitive to the value of VAC > V. Even thesinusoidal characteristic provides relative insensitiv-ity to VAc in the region of saturation, since (IVAC )Vmax 0. In our application VAc is generated by

a photodiode-amplifier combination, thereby convert-ing the switch to a five-terminal optical device, withoptical terminals labeled A through E, as shown inFig.1(b). In addition the amplifier was designed sothat it saturates at V,, at a relatively low-input lightintensity. The cost of such a device, with its associ-ated drive electronics, is $2000. (Although it might

VDC2

A _Ii

B -

VAc

1VDC1

(a)

VDC271

A t-_ ~-figaUa

B

Fig. 1. (a) LiNbO3component.

4VDC1

(b)

switch as a device, (b) LiNbO3 switch as a

be argued that this is an optoelectronic device ratherthan an all-optical one, we take the view that, sincethe electronics are limited to one part of the device,the device can be considered as all-optical.) We dotake advantage of the electronics to shape the incom-ing pulse in the manner described in Ref. 12.

From the point of view of a computer architect, thedevice can be modeled as two interconnected 1 x 2multiplexers; its functionality and symbolism aredescribed in Fig. 2. Figure 3 demonstrates the logicfunctions that can be synthesized from the direc-tional coupler in a clocked system. As Fig. 3 shows,the switch has considerable utility as a design compo-nent and is, in fact, a complete logic element. This isfortunate, because there currently is a paucity ofoff-the-shelf optical components that are suitable foruse in computer design. Because the use of a singlecomponent in so many different ways complicates theunderstanding of the circuit designs that employ it,

A D=AC+BCBI=AC+BCC 2

Fig. 2. Optical switch.

3214 APPLIED OPTICS / Vol. 31, No. 17 / 10 June 1992

D

=:_Z� E

==Z:::� D

E

Fig. 3. Optical components as logic elements: MUX, multi-plexer; DeMUX, demultiplexer.

we take greater than usual pains in describing thecircuits below. Figure 3 also shows two additionaloptical circuit elements, which consist of fixed 3-dBdirectional couplers. When used as a splitter, theypermit fan-out of the optical signal. They can alsoserve as the optical equivalent of the wired OR gate.The splitter is essential to the design as the onlymeans of fan-out. The wired OR can be replaced by atrue OR gate made from an active switch wheninterference between coherent input signals is aconcern.

B. Optical Delay-Line Memory

The optical delay-line memory is simply a loop ofoptical fiber whose length is 1 = nl,, where I is the looplength, n is the number of bits stored in the loop, and4, is the distance traveled by the optical pulse duringone clock period. Furthermore, 1, = rv, where 7 isthe clock period and v is the speed of light in themedium. v is related to the speed of light in vacuo, c,by v = c/,q, where -q is the refractive index in themedium. In terms of the clock frequency v, v = 1/,so I = nc/vrl. For glass fibers in which i = 1.5, 102416-bit words are stored in a loop 3.3 m long at a clockfrequency of 1 THz = 1012 Hz, and in a loop 33 kmlong at a clock frequency of 100 MHz. There is anupper bound to the number of bits that can be storedin such a loop before the thermal coefficients of thelength and the refractive index of the fiber causesignificant changes in the optical length of the fiberunless these effects are compensated for. A typicalglass fiber has a linear thermal-expansion coefficientof 1 x 10-6 0C-1, and a thermal coefficient with arefractive index ranging from -10 to + 19 x 10-6'C-', with values of + 10 being typical of fused silica.Thus an uncompensated fiber loop has an upperbound on the number of bits that can be stored of

105. This bound can be increased considerably byone or more of the following means: (a) carefultemperature control, (b) encoding the clock signal inthe data being circulated in the loop, (c) controllingthe properties of the fiber so that length changescaused by thermal effects are compensated for byrefractive index changes, and (d) measuring the driftand compensating for it by adjustments in the clockfrequency. Changes in optical length may also becorrected by clocking the system with a clock signalthat has been subjected to the same temporal pertur-bation as the signals in the memory loop. Thesematters are considered in more detail in Ref. 13.

C. Synchronization and Amplitude Restoration in theDelay-Line Memory

All passive devices exhibit loss. In commercial fi-bers, this loss may range from 0.5 to 10 dB/km.LiNbO3 component insertion loss may range up to 5dB. Thus some means of amplitude restorationmust be provided. Furthermore a means must beprovided to insert and retrieve information in theloop. The circuit of Fig. 4 provides both synchroniza-tion and amplitude restoration by interrupting theloop at switch S1 to permit the bit stream to refreshitself by switching a copy of the clock signal at A intothe loop for every bit stored. The circled A with theN beneath it represents a loop of N-bit length; splitterS2 permits sampling of the signal. Initializing dataare written into the loop at SIB. The total lossaround the loop is 5 dB at S1 and 3 dB at S2, for atotal loss of 8 dB. These losses are compensated forby the electronic gain at control terminal SIC.

This technique of inserting a copy of the clocksignal to restore amplitude and timing is extremelyimportant to the entire design. The more usualtechnique employed in optical systems, that of usingoptical amplification, suffers from the disadvantagethat cross talk and accumulated noise are also ampli-fied. In addition, amplification does nothing to resyn-chronize the system. The technique is similar toclock gating used in some electronic systems forsynchronization. The use of this technique to re-store optical power level is novel.

The computation of power budgeting, loss estima-tion, and cross-talk influence is important in a designemploying dozens or hundreds of components. Wehave developed algorithms for computing power lossand accumulated cross talk in complex optical sys-tems such as the one described here.14 By usingthese algorithms, we find that the number of switchesin the architecture described below must be increased

OutputClock -Data in

NFig. 4. Simple delay-line memory.


from approximately 48 to 76 to restore power toadequate levels, assuming the cross talk and lossfigures quoted above.

D. Clock

Clocking is provided by a 1300-nm laser that ismodulated at the desired clock frequency. The dutycycle should be adjusted to < 50%. The electronicsat the control terminal of the LiNbO3 switch willstretch the pulse somewhat. This is because theswitch should be in the desired state before thearrival of an input pulse and should stay in thedesired state until after the pulse has left the device.This prevents the occurrence of glitches and runtpulses. An apt analogy can be drawn with a railroadtrack switch, which also must be in the correct statebefore the arrival of the train and which must stay inthe correct state until after the train has crossed theswitch. A given device and clock speed will have aparticular combination of duty cycle and delay thatare optimum. The principal factors to be consideredwhen determining duty cycle and delay are rise andfall times, propagation time through the electronicsat terminal C, and pulse jitter. The latter becomessignificant at clock rates of > 100 MHz, at whichdetectors begin to exhibit data-dependent delays.Measurements of these parameters on prototype driveelectronic modules show rise and fall times of < 1 nsand end-to-end propagation times of 4 ns (see Ref.15).

E. Effect of Delay on Design

A key feature of speed-scalable architectures is thatthe design technique employed does not use flip-flopsor latches for synchronization. Rather, temporalsynchronization is included as a parameter duringthe design process. Fiber lengths are calculated sothat pulses arrive at key places in the circuit (i.e.,switches, splitters, and combiners) in synchronization.Although the lack of flip-flops may seem arbitrary orunusual to those accustomed to electronic design, ithas a firm basis. It must be remembered that activeflip-flop designs incorporate positive feedback loops,such as cross-coupled NAND gates. Information prop-agates around these loops no faster than the speed oflight. The idea of static storage rests on the assump-tion that the propagation speed is much faster than

Count(a)

.~ - - - - -- .--- ----..-.. . ..... ..... . .. ......... ................ ....... .......... .. ......... . .. .C1y

Count

that at which changes are made to the flip-flop's state.It is precisely this assumption that should be reinter-preted in an optical architecture. The representa-tion of the memory function entirely in terms ofrecirculating loops is therefore justified on fundamen-tal grounds.

Delays in fibers and devices can be computed fromthe dimensions and the refractive index of the mate-rial. Delays in the detector-amplifier circuits of theswitch must be measured. In practice, an opticallength is computed for each fiber of the device,including the control fiber, and is included in thedesign calculation. Delays for the control input in-clude both the actual optical delay and the delayintroduced by the drive electronics and detector.We find it expedient to do the first design iterationassuming no delays anywhere except where they aredesired. The effect of delays is then taken intoaccount when calculating fiber interconnect lengthsby using an algorithm that represents the design as adirected graph.16

We have incorporated this algorithm and the onedescribed above that calculates the effects of powerloss and cross talk on system operation with a graphi-cally oriented optical computer-aided design systemthat runs on the Apple Macintosh computer.7

Copies of the software are available upon request.For any given set of components there is an upper

bound on the clock speed. This upper bound may befixed by clock rise and fall time, control electronicsrise and fall time, device rise and fall time, orminimum delay. The minimum delay through acircuit becomes important when designing circuitswith feedback, such as carry loops. In a design withsuch a loop, i.e., loop C in Fig. 5(a), the total delayaround the loop must be exactly equal to the clockperiod , since a bit in the delay loop must travel theloop and arrive back at 1 just in time to switch thenext incoming pulse. Therefore the calculated delay-loop length must be greater than or equal to theoptical lengths of the fibers. If the loop length isgreater, then fiber must be added to the loop to makethe length equal. Notice that the optical length ofthe path through 1 is greater than the path lengththrough S2 because of the additional delay in theformer that is due to the control electronics atterminal C.

I 1_I

(b)Fig. 5. (a) A 4-bit binary counter. (b) Simulation of the operation of the counter.


I I

This requirement that the carry-loop timing besuch that the carry bit recirculates in exactly suffi-cient time to arrive at the switch just as the next bitarrives at the switch is analogous to carry propaga-tion in a ripple-carry adder. An even more aptanalogy is to the carry flip-flop in a serial adder. Itmay be noted in this connection that for arithmeticcomputations that take place on the surface of a lightcone, as optical computations do, serial and paralleladditions take exactly the same time to performbecause of carry propagation and therefore, in thiscase, parallel computation offers no advantage overserial computation. This is not to say that there areno solutions to the carry propagation problem butrather that the speed of light offers fundamentallimitations to the extent of parallelism that can beachieved. On the other hand optical computer de-signs such as the one described here are inherentlypipelined, and thus can exploit all the advantages thatpipelined architectures afford. In addition, whereinformation does not interact, one can certainly paral-lelize large sections of such an architecture. Forexample the memory loop can be duplicated so as toprovide access to an entire machine word simulta-neously. In light of the discussion of carry propaga-tion above, any fraction of the word parallelism can bemoved between the time and space domains.

We have seen that creation of a serial memory looprequires only a single optical switch and delay line.However, the contents of such a loop should bereadily alterable in order for the loop to be useful incomputer design. The simple loop only allows bits tobe oRed in by means of the terminal adjacent to theclock input. A more general memory loop is illus-trated in Fig. 6. A second switch is added to providefor the insertion of new data into the loop. When thecontrol input of switch 2 is not asserted, the resultingcross state of the switch completes the simple delayloop. However, when the control input is assertednew data from terminal B are switched into the loop.The timing of the new data must match that of the oldin order for regeneration to continue.

II. Serial N-Bit Counter

Figure 5 shows the design and simulation of a 4-bitcounter. Input pulses to be counted are expected tooccur synchronously with the first pulse of eachgroup of 4 clock pulses. Arrival of an input pulseswitches a copy of the arriving clock pulse to fiber E ofS1. This pulse is split at S4 and sent to S2 and S3.Accumulated bits circulate in loop M, which is amemory loop similar to that discussed above. AnN-bit counter will have a delay loop containing N bits

Output

Data

Write '

Fig. 6. Delay-Line memory with write capability.

so, for this 4-bit counter, loop M has a delay of 4 A.Notice that if no bit is arriving at S3C from loop Mwhen the input bit arrives at S3A the arriving bit isswitched into loop M. However, if a bit is arriving atS3C from loop M while an input bit is arriving at S3Athen the bit from M is not copied back into M, and theinput bit at S2A is switched back into the carry loop,loop C, where it will emerge again one clock periodlater, and the process will repeat as shown in Fig.5(b).In the absence of input or carry bits the bit pattern atM will continue to circulate in the manner describedin Subsection I.B. When the loop is filled with 1's,then the next arriving bit will overflow the loop,resetting it to all 0's, and the process will repeat.We describe the construction and operation of 50-MHz and 100-MHz counters in Refs. 18 and 19.

Section III shows how fiber delay loops, LiNbO3switches, splitters, and combiners are used in thedesign of a simple bit-serial computer.

111. SCAMP: A Simple Bit-Serial Optical Computer

A. General Description

The design goal of the SCAMP (Serial ComputerAssuming Multiplexers Perfect; an early hedge intro-duced by the principal designer) is to create a stored-program optical computer with a general-purposeinstruction set that uses few enough components tobe realizable at the current state of optical technology.The architecture operates serially to reduce the com-ponent count to a surprisingly low number. Only

50 optical switches are needed in the ideal design.The minimality of the design is underscored when itis realized that every simple logic function combiningsignals requires a switch, with the minor exception ofthe wired OR mentioned above. On the other hand,data storage is inexpensive, since it consists of anoptical fiber with one switch for regeneration and onefor writing into it. The switch count is independentof the size of the storage.

Five processor data structures are implementedwith optical delay loops:

Main memory (MEM), 2N words;Memory counter (MC), WordSize bits;Accumulator (ACC), WordSize bits;Instruction register (IR), WordSize bits;Program counter (PC), WordSize bits.

An interesting aspect of this serial design is that theword size may be changed without architecturalmodifications. Hence, wherever possible, the wordsize of the computer is generalized to WordSize bits.Since the SCAMP is a single-address machine, N bitsof each instruction contain a memory address and 6bits are used for control. This leads to a minorrestriction on the word length:

WordSize 2 N + 6 bits.

The data paths between the memory structures of theSCAMP are illustrated in Fig. 7. Arrows indicatethe possible directions of transmission. The arith-


Fig. 7. SCAMP data paths.

metic and logic unit (ALU) serves as a transfer point,routing the signals according to the control bits of thecurrent instruction. In Fig. 8 all the connectionsbetween the SCAMP units are shown.

B. Instruction Set Processor Description of the SCAMPAn instruction set processor (ISP) description of theSCAMP is presented in Fig. 9. The registers andflags are defined under the heading Processor State.The main memory is defined under the headingMemory State. The 16-bit word of this exampleallows for a 10-bit address and thus, 1024 memorywords. The fields of the instruction word are de-fined under the heading Instruction Format, and thesequence of events that produces instruction execu-tion is described in the instruction interpretationsection. At the bottom of Fig. 9 the nine instruc-tions are named, and their effects are described.

C. Main Memory

Some discussion of memory access is required beforethe overall design is discussed. When a large quan-

tity of data is circulating in a delay loop, there must bea means accessing a specific element or word. Onepossibility is to embed location information in thecirculating data, a technique that is often applied torotating magnetic media. Another option is to keeptrack of the currently accessible element throughprecise external timing. The latter method was cho-sen for the SCAMP because of its simplicity. Theprecise timing does not present a problem since theclock that regenerates the memory loop is the sameone that drives the counting circuits.

Thus the SCAMP's memory consists of a 2 N wordmemory loop that is timed by an N-bit counter. Thecirculation of one group of WordSize bits in thememory loop corresponds to an increment in thecounter. This memory counter is used to indicatewhich memory word is currently available at a giventap in the loop. Obtaining a word at a given addressbecomes a matter of waiting for the memory counterto equal that address. In the SCAMP the addresscomparator unit indicates when there is a matchbetween the memory count and the desired address.

D. Instruction Execution

The SCAMP goes through four phases to executeeach instruction: instruction search, instructionfetch, operand search, and instruction execution.During the instruction-search phase a match is soughtbetween the program counter and the memorycounter; the address fields of the MC and PC signalsare compared until they are equal. A match isindicated by a pulse on the memory-found, or MEMF,signal line. When asserted during an instructionsearch, MEMF causes the current memory word to befetched and sent to the instruction register.

After an instruction has been fetched the operand-search phase begins. This is similar to the instruc-tion-search phase except that the address field of theinstruction register rather than the program counteris compared with the memory counter address field.The operand/instruction (OPR/INST') signal is usedto differentiate these two states. OPR/INST' ema-nates from a circuit that is toggled, after a neededdelay, by the MEMF pulse. When MEMF is assertedduring the operand-search phase, instruction execu-tion is initiated.

At the beginning of the instruction-execution phasethe six control bits of the instruction are extractedand synchronized. Three of these bits are used toselect a source signal, one bit is used to select adestination, and two bits are used to control theprogram counter. Table I summarizes the uses ofthese bits. The operand is fetched from memory(the MEM signal) and combined with the accumula-tor (the ACC signal) to produce eight source arith-metic signals. The three source bits and one destina-tion bit are used to choose one of the ALU signals andsend it to either the accumulator or the memory.In the latter case the original operand is replaced bythe ALU result.

The methods of PC control are enumerated inTable I. For sequential operation the PC is incre-


Fig. 8. SCAMP signal connections.

Registers and flags are defined.

IR <15:0>ACC <15:0>PC <15:0>PCad <9:0> := PC <9:0>result <15:0>carryflagzeroflag

The instruction register.The accumulator.The program counter.Address field of program counter.Temporary ALU result state.Temporary carry result state.Temporary zero result state.

Memory State

M[0:1023] <15:0> The main memory, 1024 sixteen bit words.

Instruction Format

op <5:0> := IR <15:10>source <2:0> := op <5:3>dest := op <2>PCcon <1:0> := op <5:4>address <9:0> := IR <9:0>

Instruction Interpretation

Instruction bits are defined.

The operation code of the instruction.Source calculation field.Destination bit.Specifies PC control: branching, skipping, etc.Address field of instruction.

Instruction cycle sequence is defined.

(IR - M[PCad]; next PC - PC + 1;instruction-execution;carryflag *- (ACC + M[address]) > 65535;

next zeroflag -- (result = 0);

dest - (ACC - result);

- dest (M[address] - result);

Fetch instruction, increment PC.

Do computation.

Addition determines carryflag state.

Result determines zeroflag state.

Destination is accumulator.Destination is memory.

next (PCcon = 1) A - zeroflag -< (PC - PC +);Skip on result not zero.

(PCcon = 3) A carryflag - (PC - PC + 1) )

Instruction Set Effects

instruction-execution := (Clr ( := source = 0 ) - (result - 0);

Not ( := source = 1 ) -4 (result 4- - M[address]);

Skip on carry.

and mnemonics of instructions.

Clear - set to zero.

Invert Memory.

Or ( := source = 2 ) -4 (result <- ACC V M[address]); OR Memory and Accumulator.

MoveA ( := source = 3) - (result (- ACC); Move Accumulator.

RoR ( := source = 4 ) -4 (result -- ACC <0>ACC <15:1>); Rotate Accumulator right.

Add ( := source = 5 ) -* (result (- ACC + M[address]); Add Memory and Accumulator.

MoveM ( := source = 6 ) - (result - M[address]); Move Memory.

And ( := source = 7 ) -4 (result c- ACC A M[address]); AND Memory and Accumulator.

Jmp ( := op = 30 ) -4 (PC - IR) ) Unconditional branch.

Fig. 9. ISP description of the SCAMP with a 16-bit word.

mented. In an unconditional branch the contents ofthe instruction register replace the contents of thePC. The control bits that go along have no effectsince they are not in the address field. Finally,conditional branching can take place with skip-on-not-zero or skip-on-carry. Carry and zero conditions arecomputed for every instruction, and since they aretransitory states they must be acted on as they arecomputed. While this has some obvious disadvan-tages, there are some advantages as well. For exam-ple, since ACC + MEM is always computed, it ispossible to move an operand to the accumulator andskip on the carry that results from the add of ACC +MEM in one instruction.

Table 1. SCAMP Opcode Bits

Source Destination PC Control

543 2 10

000 Zero 0 MEM 00 PC -PC + 1001 MEM' 1 ACC 01 IF DEST = THEN PC

PC + ELSE PC -PC + 2010 ACC V MEM 10 PC -IR011 ACC 11 IF CARRY THEN PC 100 ACC rotated PC + 2 ELSE PC - PC + I

right 1 bit101 ACC + MEM110MEM111 ACC A MEM


Processor State

CK 2 WCCK ~ K0 1 ~ ~ C

WS

Fig. 10. WCK generator.

E. System Timing

Figure 10 illustrates the word clock generator. Thiscircuit produces a pulse once every WordSize (whichis abbreviated as WS in the figure) clock cycles, whereWordSize is the number of bits per word in theSCAMP. As its name implies, the word clock signal(WCK) is used primarily for signaling word bound-aries in various data streams.

In Fig. 10, assuming that the clock begins cleanly atsome initial time, switch 1 permits only the first pulseto pass. When applied to switch 2, which is initiallyin the cross state, this start pulse is injected into aword-sized fiber loop that terminates at the controlinput of switch 2. This loop is simply an applicationof the memory delay line described above; the singleinitial pulse is refreshed at each circulation, and theloop is tapped to provide the necessary number ofWCK lines.

Figure 11 illustrates the pulse-driven toggle circuitthat is used to differentiate between instruction andoperand fetches. The delay introduced in the tog-gling MEMF pulse is 1 word period. This providessufficient time for the address of the next phase to bemade valid.

The toggle circuit is most easily viewed as a specialcase of the general memory structure of Fig. 6. Whenno T signal is applied, switch 4 remains in the crossstate and a 1-bit loop is formed around switch 3. Thecomplement of the output signal, which is a by-product of the regeneration at switch 3, is applied toswitch 4 as an alternate input. A properly timedpulse at T sends the complement signal through toswitch 3, where it becomes the new output.

Figure 12 is a simple circuit for converting amemory-found pulse into either an instruction-foundor an operand-found pulse, depending on the machinestate indicated by OPR/INST'. Switch 5 acts as adecoder; when OPR/INST' is low, the resulting crossstate passes MEMF to the instruction-found (INSTF)line; when OPR/INST' is a pulse train, MEMF isrouted to the operand-found (OPRF) line.

Figure 13 illustrates a pulse stretcher that gener-ates a necessary timing signal. The effect is simplyto take a single OPRF pulse and convert it into asequence of WordSize pulses. The pulse-stretchingelement is represented by the square surrounding PS.

CK out l OPR/INST

Ir ouMLEMF AT

WS-1

Fig. 11. OPR/INST toggle.

5EMO t INSTFNMEMF-L ~OPRFOPR/INST

Fig. 12. MEMF decoder.

The stretched OPRF signal holds the switch in its barstate long enough to send the desired number of clockpulses to the output.

A sample timing diagram for integrating the timingsignals into an overview of the instruction cycle ispresented in Fig. 14. At the beginning of the instruc-tion cycle OPR/INST' is low, indicating that aninstruction is being sought. After an integral num-ber of word cycles the address comparator indicatesthat the instruction has been found by sending apulse on MEMF. This generates an INSTF pulseand causes OPR/INST' to go high at the beginning ofthe next word cycle. MEMF is pulsed again whenthe operand is found. This generates the pulse onOPRF and sends OPR/INST' low at the beginning ofthe following word cycle.

F. Memory Counter

Figure 15 shows the memory counter. The memorycounter is a free-running version of the binary counterthat is described above. WCK is used to incrementthe memory counter at the beginning of each wordperiod. The least significant N bits of the countindicate which memory word is currently accessible.The counter has WordSize bits instead of N bits sothat its increment time of WordSize cycles matchesthe WordSize cycles of each word in memory; N isunlikely to divide evenly into WordSize, so the extrabits are appended to maintain synchronization be-tween the count and the circulating memory words.

G. Registers

The program counter is diagrammed in Fig. 16. Theprogram counter holds the address of the next instruc-tion to be executed. The OPRF line increments thecount once during each instruction cycle to facilitatenormal sequential operation. The SKIP line pro-vides an additional increment (during a differentword period) if a skip condition specified by aninstruction succeeds. Branching is accomplished byforcing a different count value into the WordSize fiberloop. The CTL line is the source of the branchaddress.

The memory and accumulator loops are illustratedin Fig. 17. These are identical in operation to thestructure of Fig. 6. The memory loop holds pro-

6

AK OPRFS

OPRE

WSFig. 13. OPRF pulse stretcher.


TC

WCK

MEMF

INSTF

OPRF

OPR

cK

- 1n LKjj16CLS - 1m LK + 16CLS

Intuto erhIIsrcin Fth Mmr erh I I nsuto xcto

A n d _ _~~~~t".1 -I-OPR/NS ) ~tf lAoL;; (16) -iFS Mln -t (16) -

1~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Beginning Time - , End ofof Instruction Instruction

Fig. 14. Timing example: WordSize = 16 bits, N = 10.

grams and data for the SCAMP and is WordSize X 2 N

bits long. The accumulator loop is a WordSize bitregister whose function is typical of accumulators insingle-address architectures. These loops are writ-ten to when OPRFS is asserted. The state of theDSTSEL line determines whether the memory loopor the accumulator loop receives the ALU result senton DEST.

The instruction loop of Figure 18 holds a copy ofthe current instruction during its execution. Theinstruction is obtained directly from the memory loopwhen INSTF is asserted. This register is anotherinstance of the memory construct of Fig. 6.

H. System Operation

As described above, the address comparator, which isshown in Fig. 19, is used to indicate when a desiredword in the memory loop has become accessible. Inan instruction search the address field of the memorycount is compared with the address field of theprogram count during each word period, and a matchis signaled by a pulse on MEMF. During an operandfetch it is the address portion of the current instruc-tion that is compared with the memory count.

In Fig. 19 the OPR/INST' signal applied to switch22 determines which comparison is to be made; in thecross state the PC is compared, whereas in the barstate CTL is passed. Switch 23 performs a bit-by-bitcomparison of the MC and the control signal. Eachmatching bit passes a pulse on to switch 24. Switch

OPRFSKIP

CISL

PCSELO

PCSELI

.PC

Fig. 16. Program counter.

24 ANDs each input bit with its own result from theprevious clock cycle. Any missing pulses in the datastream turn off all subsequent output. A compari-son is initiated by the oRing of the WCK to the controlof switch 24. This passes the result of the first bitcomparison. From then on the process must beself-sustaining; any failed bit comparisons turn offthe output for the duration of the word cycle. Sincean address consists of only the least significant N bitsof a word, the WCK is applied to switch 25 after adelay of N bits to pass the result of the comparison asMEMF. An additional delay of WordSize N is addedto the latter to align the pulse on a word boundary.Figure 20 illustrates the control signal extractors.These have the effect of pulling the six serial opcodebits from the current instruction word and synchro-nizing them so that they may be applied in parallel.OPRF, which indicates that an operand has beenfound, is used to initiate the extraction and subse-quent instruction execution.

Synchronization of the opcode sequence is accom-plished simply by sending delayed versions of theCTL signal to switches 26-30. The result is that thesix sequential opcode bits are made to arrive simulta-neously at the switches, where they may be clockedthrough by a single OPRF pulse.

I. Arithmetic and Logic Unit

Although it appears complex, the ALU pictured inFig. 21 is actually a straighforward application of

DEST

WCK

Fig. 15. Memory counter.

.MC

DEST

OPRFS

DSTSEL

*- MCFig.17. Memory and accumulator loops.

tion selection line.

0- ACC

4@_*eACCadvWS-1

DSTSEL is the destina-


C

20MEM

INSTF

WS

Fig. 18. Instruction loop.

Boolean algebra. Figure 22 breaks down the addi-tion circuit. It is seen that the basic logic functionsAND and OR are required. Since these-functions arethemselves useful ALU outputs, it is desirable toshare them in order to reduce the number of compo-nents. The ALU simultaneously performs eight op-erations on the current memory word and the accumu-lator. One of these eight results is sent out as theDEST signal. The carry from the addition and thezero state of DEST are computed so that skip-on-not-zero and skip-on-carry may be part of the programcounter control.

Switches 34, 36, 39, 37, 40, and 41 in Fig. 21 makeup an eight-to-one multiplexer. The three controlbits SRCSELO, SRCSEL1, and SRCSEL2 determinewhich of the eight 'computations-signals applied tothe inputs of switches 34, 36, 39, and 43 is routed toDEST. In order, the eight applied signals are MEM',MEM + ACC, ACC, MEM AND ACC, MEM, MEM ORACC, 0, and ACC rotated right one bit. Switches 32and 33 make up the adder. The previous carry isapplied to the control input of switch 32. Switch 35provides MEM AND ACC, a result that is also used inthe carry computation. Switches 38 and 42 computethe carry. The WCK applied to the control input ofswitch 42 forces the initial carry to be zero. Thisalso permits the final carry to be sent to the skipdetection switches. Switch 44 is the nonzero detec-tor. Any pulses in the DEST signal while OPRFS isasserted result in a constantselect circuit. The latter isand 46. If the instructioncondition by sending a pulsedelayed so that it arrives attime to clock the condition

pulse stream to the skipmade up of switches 45commanded a skip on

on PCSEL1, the pulse isswitch 46 at the correctthrough. PCSELO ap-

plied to switch 45 selects between thecarry detection lines.

nonzero and

J. Input and Output

The design of an input-output (I/O) subsystem musttake into account the possibility of operation of thecomputer at data rates that are considerably higherthan those possible with a conventional electronicdigital computer. The present design employs mem-ory-mapped I/O. This permits the same subsystem

Add, (E MC 2

M C 2 _ 24 -2 A4MEMF22 S-N

CIL Add COPR/INST WC A

N

Fig. 19. Address comparator. CTL is the control line.

OPRF m

Is W WfS~ WS1SRCSELO SRCSEL1 SRCSEL2 DSTSEL PCSELO PCSEL1

Fig. 20. Control signal extractors.

to do both program I/O and program loading. Thecircuit in Fig. 23 accomplishes this as follows: thehost computer interface sends a bit to the SCAMP bylatching it on the Dout line. This modulates a lightsource that is connected to switch 47. The loop oflength m + 1 circulates a single pulse that arrives atswitch 47 in time to clock consecutive data bitsthrough to the main memory loop of the SCAMP.Since the m + 1 bit loop is one bit longer than theSCAMP memory, each circulation causes a synchroni-zation with a later (or higher) bit in the SCAMPmemory. In this way bits are sent to the SCAMPwith a frequency corresponding to the circulation ofthe memory loop, and not the (much faster) SCAMPclock frequency.

In Fig. 23 the Next-bit line of the host computerinterface receives a stretched version of the pulsecirculating in the m + 1 bit loop. This pulse tells theinterface that a bit has been read and that the next bitshould be placed on the Dout line.

A similar procedure applies to data bits traveling inthe other direction. The Next-bit signal also tellsthe host computer interface that a bit from SCAMPmemory has been clocked through switch 49 and isavailable for reading on the Din line.

IV. Present and Future Operating Speed

The maximum operating speed of a machine that usesLiNbO3 switches or devices of similar logical function-ality is determined by two factors: switching timeand switch latency. The switching time specified forthe AT&T switch9 is 0.3 ns. However, this speed isconsiderably faster than the switching speed of thedrive electronics. Measurements of switching timesin the drive electronics indicate rise-fall times ofapproximately 1 ns (see Ref. 15). Therefore switch-

MEM 423

MEM it=NMEM| sm

ACC

ACC - I SA C SL

MEM SR S I-S

MEM . RSE0SCSL MACCM >_ B y

W AC A

44 PCSELo-( , wOPRFS arr Wt45 .A

Fig. 21. ALU.


An7 --I- Sn = AnBnCnI+ .B& +n- TI oot-I1 non1

Cn- I

FIl Adder

A n B An B +A 1+Bn C -1

Bn

A0 An+Bn Cn - 1(An+Bn)B0ncoi

Carry Generation

Fig. 22. Addition in the ALU.

ing time limitations will place an upper bound of < 1GHz on the system operating speed. Switch latencywill also place an upper bound on the system opera-tion, as discussed in Subsection I.E. In a systemwith feedback the propagation time through a switchmust be less than or equal to the delay imposed by thesmallest feedback loop. In the current design thereare several 1-bit feedback loops. Thus the systemclock period must be greater than or equal to themaximum latency path through a switch. We havedesigned a system to measure this latency,12 and findthat the maximum latency path through the switch isfrom terminal D or E to terminal C. Much of thislatency is introduced through the drive electronics.We observe latencies of 4-6 ns in the drive electronicsand 1-2 ns through fibers and the switch proper.This points to a maximum clock frequency in the100-MHz region. As verification we constructed acounter similar to the one described in Fig. 5 andoperated it at 50 MHz.1 8

Estimates of possible future speeds and ultimatespeed limitations are fraught with difficulty. Never-theless we venture some speculation on the matter.One advantage of speed-scalable architectures suchas the one described here is that no architecturalchange need be made to the design as the clock speedis raised, provided that switching time and switchlatency are sufficiently fast and short, respectively.So switching time and switch latency provide thelimiting factors on the system clock speed. We havediscussed possible ultimate foreseeable switchingspeeds and switch latencies with several researchersin the field of directional coupler design and fabrica-tion.20 The consensus was that the appropriate de-vices would be directional couplers with integratedshort-path interconnections and integrated drive elec-

(ost Compuig 23ut SCAfePInterface

-Din7 Next _bt

m+1

Optical C ctons 6-

Fig. 23. I/O for the SCAMP.

tronics. Such designs could be envisioned as havingboth switching times and latencies in the 1-10 psrange. This would lead to clock speeds in the 0.1-1THz range. A 1-bit loop would have a length in the2-0.2 mm range at these clock frequencies; sucha loop could be integrated easily with the switch.Even so, latency may well become the limiting factor.In latency-limited systems, it is possible to time-multiplex multiple systems on the same hardware, asdiscussed in Ref. 21.

A departure from the use of the direction couplerexchange switch as the logic element, and the use ofother kinds of logic elements would lead to minorchanges in the architecture. Such changes wouldopen the design to other switching technologies. Itis physically possible to produce and propagate 10-fsec pulses, which translates to a clock rate of 100Tbits/sec. Haner and Warren22 have actually demon-strated a 100-fsec resolution in a time-compressedwaveform, which suggests that 10 Tbits/sec mayactually be achieved in practice. Islam et al.2 3 havedemonstrated optical switching with solitons of 300fsec. Thus optical switching and transmission maynot be too disparate in attainable speeds.

V. Summary and Conclusions

We have described an optical computer that is basedon fiber-optic delay-line storage, optoelectronic direc-tional coupler logic and switching, and fiber intercon-nections. Such a system bears a distinct similarityto early electronic computers that is due, in bothcases, to the problems involved in building and reli-ably operating large numbers of active devices in animmature technology. Several new aspects of thearchitecture are dictated by the use of optics. Thefact that signal propagation is of the same speed asswitching gives a major geometrical component to thearchitecture. The signal propagation time is usedeven as the basis for information storage, so that eachsignal path in the machine must be considered as astorage element, or, at least, as modifying the behav-ior of a storage element that is the source or destina-tion of the signal.

The use of optoelectronic directional couplers asthe active elements of the machine leads to a uniquesolution of the signal amplitude and timing restora-tion problem. Since the switching of a signal iseasier than its amplification, the technique of supply-ing multiple copies of the system clock, and switchingone of these into a signal path by using a degradedcontrol signal, is used to restore both amplitude andtiming characteristics of signals. Another system-specific technique of interest in the optical designworld is the matching of optical speeds to electronicI/O speeds by transferring a single bit per memorycycle between electronic and optical systems.

The impetus for the design and implementation ofan all-optical, stored-program computer arises fromthe long-term goal of developing architectures anddesign techniques that scale with speed, that is,speed-scalable architectures: architectures that are


invariant to an increase in system speed coupled witha corresponding decrease in size. In addition thereare two shorter term subgoals. It is a short route toan operating optical computer within the currenttechnological constraints, which thus allows com-puter architects to begin doing optical computerdesign and thus to obtain feedback essential to thedesign process from operating prototypes. Such abit-serial system would also have immediate applica-tion in the control of fiber-optic communicationssystems, which already operate in a serial mode. Cur-rent techniques of controlling such systems involveconverting the serial information to electronic form,processing it at electronic speeds, and converting itback to optical form for transmission. The results ofboth computer architecture and high-speed opticalcomponent research need to be combined to make anall-optical, high-speed controller a reality.

The weak link in the current system design is theoptoelectronic logic element. Detecting an opticalsignal, electronically amplifying it, and using it tocontrol the state of a directional coupler is a signifi-cant problem. Electronics is at least limited in ex-tent, and its natural pulse-spreading tendency can beused to make the control pulse overlap the duration ofthe signals being switched. But it is clear that thiswork should supply stimulus to, and anticipate re-sults from, the many high-speed, optically switched,optical elements currently being developed in theresearch laboratories. Present devices limit operat-ing speeds to the 100-MHz region; however, futuredevices could increase this speed by a factor of 1000 ormore. The devices used in these experiments are notready for use in systems, but it is clear that anysystem operating at rates that are near 1 Tbit/sec willpose serious competition to electronic processors.Though, in one sense, it is too early to designcommercially successful optical computers, this workis aimed at helping develop a body of computerarchitecture knowledge in order to successfully ex-ploit the new devices as they move out of the researchlaboratories.

This work was supported in part by NationalScience Foundation grant CDR 8622236 and by theColorado Advanced Technology Institute, an agencyof the state of Colorado (grant 270150).

References1. S. D. Smith, A. C. Walker, B. S. Wherrett, F. A. P. Tooley,

J. G. H. Mathew, M. R. Taghizadah, and I. Janossy, "Cascad-able digital optical logic circuit elements in the visible andinfrared: demonstration of some first optical circuits," Appl.Opt. 25, 1586-1593 (1986).

2. J. Tanida and Y. Ichioka, "OPALS: optical parallel arraylogic system," Appl. Opt. 25, 1565-1570 (1986).

3. A. Huang, "Architectural considerations involved in the de-sign of an optical digital computer," Proc. IEEE 72, 780-786(1984).

4. J. R. Sauer, "An optoelectronic multi-Gb/s packet switching

network," Optoelectronic Computing Systems Center Tech.Rep. 89-06 (Optoelectronic Computing Systems Center, Univer-sity of Colorado, Boulder, Colo. 1989).

5. F. C. Williams, T. Kilburn, and G. C. Toothill, "Universalhigh-speed digital computers: a small-scale experimentalmachine," Proc. Inst. Electr. Eng. 98, 107-120 (1951).

6. S. P. Frankel, "The logical design of a simple general purposecomputer," IRE Trans. Electron. Comput. 9(3), 5-14 (1957).

7. J. H. Wilkinson, "The pilot ACE," in Computer Structures:Readings and Examples, C. G. Bell and A. Newell, eds.(McGraw-Hill, New York, 1971).

8. S. K. Korotky, G. Eisenstein, R. C. Alferness, J. J. Veselka,L. L. Buhl, G. T. Harvey, and P. H. Read, "Fully connectorizedhigh-speed Ti:LiNbO3 switch/modulator for time-division mul-tiplexing and data encoding," IEEE J. Lightwave Technol.LT-3, 1-6 (1985).

9. "2 x 2 lithium niobate polarization-independent electro-optical switch," AT&T engineering prototype specificationA4335 (AT&T Technologies, Allentown, Pa., January 1987).

10. "OGW 2 x 2 switch," Data Sheet (Crystal Technology Inc.,Palo Alto, Calif., 1987).

11. S. K. Korotky and R. C. Alferness, "Waveguide electro-opticdevices for optical fiber communication," in Optical FiberTelecommunications, I. P. Kaminow and S. E. Miller, eds.(Academic, New York, 1988).

12. V. P. Heuring, "Systems considerations in the implementationof a bit serial optical computer," Opt. Eng. 30, 1931-1935(1991).

13. D. B. Sarrazin, H. F. Jordan, and V. P. Heuring, "Fiber opticdelay line memory," Appl. Opt. 29, 627-637 (1990).

14. V. P. Heuring and J. P. Pratt, "Designing digital opticalcomputing systems: power distribution and crosstalk," inOptics for Computers: Architectures and Technologies, G. J.Lebreton, ed., Proc. Soc. Photo-Opt. Instrum. Eng. 1505,124-129 (1991).

15. A. F. Benner, H. F. Jordan, and V. P. Heuring, "Digital opticalcomputer with optically switched directional couplers," Opt.Eng. 30, 1936-1941 (1991).

16. J. P. Pratt and V. P. Heuring, "Delay synchronization intime-of-flight optical systems," Appl. Opt. to be published.

17. J. P. Pratt, "Hatch software and users manual," Optoelec-tronic Computing Systems Center Tech. Rep. 89-31 (Optoelec-tronic Computing Systems Center, University of Colorado,Boulder, Colo., 1989).

18. A. F. Benner, J. Bowman, T. Erkkila, R. J. Feuerstein, V. P.Heuring, H. F. Jordan, J. Sauer, and T. Soukup, "Digitaloptical counter using directional coupler switches," Appl. Opt.30, 4179-4189 (1991).

19. R. J. Feuerstein, T. Soukup, and V. P. Heuring, "100 MHzoptical counter using directional coupler switches," Opt. Lett.16, 1599-1601 (1991).

20. A. Mickelson and G. Moddel, Department of Electrical andComputer Engineering, University of Colorado, Boulder, Colo-rado 80309-0525 (personal communication, June 1990).

21. H. F. Jordan and V. P. Heuring, "Time-multiplexed opticalcomputers," presented at Supercomputing '91, Albuquerque,N.M., 1991.

22. M. Haner and W. S. Warren, "Generation of arbitrarily shapedpicosecond optical pulses using an integrated electroopticwaveguide modulator," Appl. Opt. 26, 3687-3694 (1987).

23. M. N. Islam, C. E. Soccolich, and D. A. B. Miller, AT&T BellLaboratories, Holmdel, N.J. 07733-1988 (personal communica-tion on low-energy ultrafast fiber soliton logic gates, January1990).


Documents

Bit-serial architecture for optical computingstrukov/ece594BWinter2011/optical.pdf · Bit-serial architecture for optical computing Vincent P. Heuring, Harry F. Jordan, and Jonathan