9
CONTROL OF A PARTITIONABLE MULTIMICROPROCESSOR SYSTEM Howard Jay Siegel, Philip T. Mueller, Jr., and Harold E. Smalley, Jr. School of Electrical Engineering Purdue University West Lafayette, IN 47907 Abstract — A dynamically reconfigurable large-scale multimicroprocessor system capable of operating as one or more independent SIMD (single instruction stream - multiple data stream) machines and/or MIMD (multiple instruction stream - multiple data stream) machines is described. The system consists of a Parallel Computation Un- it, which contains N processors, N memory modules, and an interconnection network; Q Micro Controll- ers, each controlling N/Q processors; a Memory Management System, to load the N memory modules; and a System Control Unit, to coordinate the other system components. The way in which the Micro Controllers can form varying size groups of pro- cessors in this partitionable SIMD/MIMD environ- ment is discussed. I. Introduction As a result of the microprocessor revolution it is now feasible to build a dynamically reconfigur- able large-scale multimicroprocessor system capa- ble of performing image processing tasks more ra- pidly than previously possible. There are many im- age processing tasks which can be performed on a parallel processing system, but are prohibitively expensive to perform on a conventional computer system due to the large amount of time required to do the tasks (e.g., remote sensing by satellite [16]). In addition, a multimicroprocessor system can use parallelism to perform the real time image processing required for such applications as robot (machine) vision, automatic guidance of air and space craft, and air traffic control. Two types of parallel processing systems are single instruction stream - multiple data stream (SIMD) machines and multiple instruction stream - multiple data stream (MIMD) machines. An SIMD machine typically consists of a set of N proces- sors, N memories, an interconnection network, and a control unit. The control unit broadcasts in- structions to the processors and all active ("turned on") processors execute the same instruc- tion at the same time. Thus, a single stream of instructions drives all the processors. Each pro- cessor executes instructions using data taken from a memory to which only it is connected. This pro- vides a multiple data stream. The interconnection network allows interprocessor communications. Ex- amples of such machines are the II Mac IV [2,6] This work was supported in part by the Air Force Office of Scientific Research, Air Force Systems Command, USAF, under Grant No. AFOSR-78-3581. The United States Government is authorized to repro- duce and distribute reprints for Governmental pur- poses notwithstanding any copyright notation hereon. and STARAN [3,4]. An MIMD machine typically con- sists of N processors and N memories, where each processor may follow an independent instruction stream. Hence, there are multiple instruction streams. As with SIMD architectures, there is a multiple data stream and an interconnection net- work. Examples of such machines are C.mmp [29] and Cm* [27]. Due to the low cost of microprocessors, comput- er system designers have been considering various multimicrocomputer architectures, such as [9-11,13,14,26,27]. The system described here differs from others in that: (1) it may be partitioned to operate as many in- dependent SIMD machines following the same or different instruction streams; (2) parts (or all) of the system may be operating as independent MIMD machines, while the rest of the system is operating as one or more SIMD machines; (3) the processors used for performing the compu- tations can transfer data simultaneously; and (4) a variety of problems in image processing and pattern recognition will be used to guide the design choices. This paper describes a method to control this par- titionable multiprocessor system. II. Parallelism in Image Processing SIMD machines can be used to do "local" pro- cessing of segments of images in parallel. For example, the image can be segmented, and each pro- cessor assigned a segment. Then, following the same set of instructions, such tasks as line thin- ning, threshold dependent operations, and gap fil- ling can be done for all segments of the image simultaneously. Also in SIMD mode, matrix arith- metic used in image processing for such tasks as statistical pattern recognition and fast Fourier transforms can be done efficiently. MIMD machines can be used to perform different "global" image processing tasks in parallel, using multiple copies of the image or one or more shared copies. For example, in cases where the goal is to locate two or more distinct objects in an image, each ob- ject can be assigned a processor or set of proces- sors to search for it. There are also tasks which require parallel processing in both SIMD and MIMD modes. As a sim- ple example consider the task of determining if a line drawing contains a square. In SIMD mode a parallel processing system can segment the image and each processor can locally determine which points in its segment, if any, are possible corners of squares. The system can then switch to MIMD mode, where each corner will be assigned to a processor which examines the image globally to 9

CONTROL OF A PARTITIONABLE MULTIMICROPROCESSOR …

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CONTROL OF A PARTITIONABLE MULTIMICROPROCESSOR …

CONTROL OF A PARTITIONABLE MULTIMICROPROCESSOR SYSTEM

Howard Jay Siegel, Philip T. Mueller, Jr., and Harold E. Smalley, Jr.School of Electrical Engineering

Purdue UniversityWest Lafayette, IN 47907

Abstract — A dynamically reconfigurablelarge-scale multimicroprocessor system capable ofoperating as one or more independent SIMD (singleinstruction stream - multiple data stream)machines and/or MIMD (multiple instruction stream- multiple data stream) machines is described.The system consists of a Parallel Computation Un-it, which contains N processors, N memory modules,and an interconnection network; Q Micro Controll-ers, each controlling N/Q processors; a MemoryManagement System, to load the N memory modules;and a System Control Unit, to coordinate the othersystem components. The way in which the MicroControllers can form varying size groups of pro-cessors in this partitionable SIMD/MIMD environ-ment is discussed.

I. Introduction

As a result of the microprocessor revolution itis now feasible to build a dynamically reconfigur-able large-scale multimicroprocessor system capa-ble of performing image processing tasks more ra-pidly than previously possible. There are many im-age processing tasks which can be performed on aparallel processing system, but are prohibitivelyexpensive to perform on a conventional computersystem due to the large amount of time required todo the tasks (e.g., remote sensing by satellite[16]). In addition, a multimicroprocessor systemcan use parallelism to perform the real time imageprocessing required for such applications as robot(machine) vision, automatic guidance of air andspace craft, and air traffic control.

Two types of parallel processing systems aresingle instruction stream - multiple data stream(SIMD) machines and multiple instruction stream -multiple data stream (MIMD) machines. An SIMDmachine typically consists of a set of N proces-sors, N memories, an interconnection network, anda control unit. The control unit broadcasts in-structions to the processors and all active("turned on") processors execute the same instruc-tion at the same time. Thus, a single stream ofinstructions drives all the processors. Each pro-cessor executes instructions using data taken froma memory to which only it is connected. This pro-vides a multiple data stream. The interconnectionnetwork allows interprocessor communications. Ex-amples of such machines are the II Mac IV [2,6]

This work was supported in part by the Air ForceOffice of Scientific Research, Air Force SystemsCommand, USAF, under Grant No. AFOSR-78-3581. TheUnited States Government is authorized to repro-duce and distribute reprints for Governmental pur-poses notwithstanding any copyright notationhereon.

and STARAN [3,4]. An MIMD machine typically con-sists of N processors and N memories, where eachprocessor may follow an independent instructionstream. Hence, there are multiple instructionstreams. As with SIMD architectures, there is amultiple data stream and an interconnection net-work. Examples of such machines are C.mmp [29]and Cm* [27].

Due to the low cost of microprocessors, comput-er system designers have been considering variousmultimicrocomputer architectures, such as[9-11,13,14,26,27]. The system described herediffers from others in that:(1) it may be partitioned to operate as many in-

dependent SIMD machines following the same ordifferent instruction streams;

(2) parts (or all) of the system may be operatingas independent MIMD machines, while the restof the system is operating as one or more SIMDmachines;

(3) the processors used for performing the compu-tations can transfer data simultaneously; and

(4) a variety of problems in image processing andpattern recognition will be used to guide thedesign choices.

This paper describes a method to control this par-titionable multiprocessor system.

II. Parallelism in Image Processing

SIMD machines can be used to do "local" pro-cessing of segments of images in parallel. Forexample, the image can be segmented, and each pro-cessor assigned a segment. Then, following thesame set of instructions, such tasks as line thin-ning, threshold dependent operations, and gap fil-ling can be done for all segments of the imagesimultaneously. Also in SIMD mode, matrix arith-metic used in image processing for such tasks asstatistical pattern recognition and fast Fouriertransforms can be done efficiently. MIMD machinescan be used to perform different "global" imageprocessing tasks in parallel, using multiplecopies of the image or one or more shared copies.For example, in cases where the goal is to locatetwo or more distinct objects in an image, each ob-ject can be assigned a processor or set of proces-sors to search for it.

There are also tasks which require parallelprocessing in both SIMD and MIMD modes. As a sim-ple example consider the task of determining if aline drawing contains a square. In SIMD mode aparallel processing system can segment the imageand each processor can locally determine whichpoints in its segment, if any, are possiblecorners of squares. The system can then switch toMIMD mode, where each corner will be assigned to aprocessor which examines the image globally to

9

Page 2: CONTROL OF A PARTITIONABLE MULTIMICROPROCESSOR …

determine if the corner is actually part of asquare. Examples of more complicated tasks in-volve syntactic pattern recognition.

Developments in recent years have shown the im-portance of parallelism to image processing, usingboth cellular logic arrays (e.g., CLIP [253) andSIMD systems (e.g., STARAN [153). In the designof the system proposed here various image process-ing tasks have been and will be considered. Thephilosophy of examining the problem and thendesigning the machine which can best solve theproblem, under certain economic and technologicalconstraints, will be used. It is felt that thiswill lead to a system that will function effi-ciently not only for image processing, but for alarge class of similar computational problems,such as speech processing, remote sensing usingmultispectral data, and waveform processing inbiomedical engineering.

III. System Overview

There are many problems involved in the designof PASM - a partitionable SIMD/MIMD system. It isto be a multimicroprocessor system capable of be-ing dynamically reconfigured as one or more in-dependent SIMD machines and/or MIMD machines.Figure 1 is a block diagram of the basic systemcomponents: the Parallel Computation Unit, theMicro Controllers, the Control Disk, the MemoryManagement System, the Memory Disk, and the SystemControl Unit.

Figure 1: Block diagram overview of PASM.

The heart of the System is the ParallelComputation Unit (PCU), which contains N proces-sors, N memory modules, and an interconnectionnetwork. The PCU processors are microprocessorsthat perform the actual SIMD and MIMD computa-tions. The PCU memory modules are used by the PCUprocessors for data storage in SIMD mode and bothdata and instruction storage in MIMD mode. Theinterconnection network provides a means of com-munication among the PCU processors and memorymodules.

The Micro Controllers are a set of microproces-sors which broadcast instructions to the PCU pro-cessors in SIMD mode and orchestrate the activi-ties of the PCU processors in MIMD mode. TheControl Disk stores the control instructions forthe Micro Controllers as well as the programs forthe PCU processors in SIMD mode. The MemoryManagement System controls the loading of the PCUmemory modules with data for PCU processors

operating in SIMD mode and with data and instruc-tions for PCU processors operating in MIMD mode.The Memory Disk stores these data and instructionfiles. The System Control Unit is a conventionalmachine, such as a PDP-11, and is responsible forthe overall coordination of the activities of theother components of PASM.

PASM was developed as a result of the work donein [5,17-20,243. This paper concentrates on theMicro Controllers, only briefly describing theother system components. More details about therest of PASM are in [21,23].

IV. System Control Unit

The System Control Unit is a conventionaluniprocessor, such as a PDP-11/70 or 11/45. It isresponsible for orchestrating the Memory Manage-ment System and the Micro Controllers. In addi-tion, the System Control Unit is capable of func-tioning as a serial processor, independent of therest of PASM. It can handle such tasks as programdevelopment and file system supervision while therest of PASM is executing a parallel computation.In order to perform all of these functions, theSystem Control Unit will contain the PASM operat-ing system and language compilers.

V. Parallei Computation Unit

Two alternative methods for organizing the pro-cessors, memory modules, and interconnection net-work of the PCU are being considered. Theprocessor-to-memory approach shown in Figure 2physically locates the interconnection networkbetween the processors and the memory modules.

Figure 2: Processor-to-memory configurationof the Parallel Computation Unit.

The processors communicate with each other throughthe memory modules. A pair of memory units isused for each memory module so that data can bemoved between one memory and the Memory Disk whilethe microprocessor operates on data in the othermemory. The Memory Management System will controlthis. The processing element-to-processingelement (PE-to-PE) approach shown in Figure 3directly links each processor to a memory module.The processors communicate through the intercon-nection network. In both the processor-to-memory

10

Page 3: CONTROL OF A PARTITIONABLE MULTIMICROPROCESSOR …

Figure 3: PE-to-PE configuration of the ParallelComputation Unit.

and PE-to-PE configurations the processors andmemory modules are physically numbered (addressed)

from 0 to N-1, where N=2 . The interconnectionnetwork can be partitioned into independent sub-networks of varying sizes, which are powers oftwo. The only constraint is that the physical ad-dresses of the P processors and memory modules ina partition have the same log2N - log2P low-order

bits.The networks being considered are based on re-

circulating and multistage implementations of the"PM2I" network, using a highly flexible controlstructure [21,23]. the recirculating PM2I networkis used in SIMD machine algorithms in [18] and itsability to operate in a partitioned environment isdiscussed in [53. The capabilities of a multis-tage PM2I network are studied in [7] and the wayin which such a network can be partitioned isanalyzed in [243.

The N PCU processors will be constructed from"off-the-shelf" microprocessor chips. Bit-slicemicreprogrammable microprocessors will be used todesign processors appropriate to the system needs.Features such as a hardware stack are being con-sidered [22]. Recall that each PCU memory moduleconsists of two memory units. Each of thesememory units will be constructed from "off-the-shelf" random access memory chips. Such problemsas the exact architecture of the processors, thesize of the memory modules, and the choice of theprocessor-to-memory or PE-to-PE configuration arecurrently being investigated.

VI. Memory Management System

A mechanism for supplying data to the PCUmemory units in an efficient manner is of greatimportance. The mechanism used here is the MemoryManagement System. The Memory Management Systemcan be thought of as an intelligent connection

network which acts to route data from the MemoryDisk system to a set of PCU memory units as it isneeded. The system supplies only data to PCU pro-cessors operating in SIMD mode and supplies bothinstructions and data to PCU processors operatingin MIMD mode. Data associated with distinct SIMDjobs are maintained as blocks on the Memory Disk.In the case of MIMD jobs such blocks contain bothinstructions and data. Prior to job execution theappropriate PCU memory units are loaded from theMemory Disk. It will not be necessary for PCUprocessors to remain idle during the loadingand/or unloading procedure if double buffering isemployed. This technique involves the loadingand/or unloading of one PCU memory unit while theother memory unit in that same module is used byan executing job. It is often useful to loadseveral memory units with a common block of datafrom the disk. To facilitate this, a memory databus is used to interconnect the PCU memory unitswith the Memory Disk system. Each memory unit toreceive a transmitted block of memory data is en-abled by the Memory Management System. The de-tails of this system and implementations of it in-volving parallel memory devices are currently be-ing developed.

VII. Micro Controllers

A. PartitioningMany computations can be more efficiently exe-

cuted if the N PCU processors are partitioned intomany smaller groups of processors, each groupbehaving like an SIMD or an MIMD machine. Thisrequires a flexible control scheme, capable ofproviding independent instruction streams togroups of different sizes. For example, there maybe: (1) four groups of processors, each of sizeN/16, all following one instruction stream, multi-plying four pairs of matrices of the same size;(2) two groups of processors, each of size N/8,each acting as an independent SIMD machine, eachprocessing copies of the same image, but usingdifferent programs; and (3) one group of N/2 pro-cessors, in MIMD mode, doing a syntactic patternrecognition type of task. Furthermore, the threecollections of processors in this example may beall operating on the same image, different spec-tral views of the same image, or on different im-ages.

The basic method to provide multiple controll-ers to broadcast instructions so that the systemcan be partitioned into independent SIMD and MIMD

machines is shown in Figure 4. There are Q=2q Mi-cro Controllers, physically addressed (numbered)from 0 to Q-1. Each controls N/Q PCU processors.Possible values for N and Q are 1024 and 16,respectively.

A Micro Controller is a microprocessor which isattached to a memory module. Each memory moduleconsists of a pair of memories so that memoryloading and computations can be overlapped. InSIMD mode, each Micro Controller fetches instruc-tions from its memory module, executing the con-trol flow instructions (e.g., branches) and broad-casting the data processing instructions to thePCU processors. The addresses of the N/Q proces-sors which are connected to a Micro Controllermust all have the same low-order q bits so that

11

Page 4: CONTROL OF A PARTITIONABLE MULTIMICROPROCESSOR …

the interconnection network can operate in thepartitioned environment. The value of these low-order bits is the address of the Micro Controller.

An SIMD machine of size MN/Q, where M = 2m and1 < M < Q, is obtained by loading M Micro Con-trollers with the same instructions. The physicaladdresses of these Micro Controllers must have thesame low-order q - m bits since the physical ad-dresses of all PCU processors in a partition ofsize MN/Q must have the same low-order q - m bitsin order for the interconnection network to func-tion properly.

In each partition the PCU processors and MicroControllers are assigned logical addresses. Con-sider an arbitrary partition of size 2 P,n - q < p < n. The PCU processors in this parti-tion are numbered (addressed) logically from 0 to

2P - 1, even though their physical addresses are

of the form i*(2n"p) + J, for 0 _< i < 2 P, where J

is fixed at an integer from 0 to 2 n~ p - 1. Simi-larly, the Micro Controllers in the partition are

logically addressed from 0 to 2 p " ( n " q ) - 1, eventhough their physical addresses are of the form

i*(2n"p) + L, for O < i < 2p~<n"cl) and L = J.

This is based on the fact that the only constrainton partitioning is that the physical addresses of

the 2P processors in a partition have the same n-plow-order bits. For example, if N=1024 and Q=16,a partition of size 256 may consist of PCU proces-sors whose physical addresses are 1, 5, 9,...1021,and Micro Controllers whose physical addresses are1, 5, 9, and 13. The PASM language compilers andoperating system are used to convert from logicalto physical addresses. Thus, a system user dealsonly with logical addresses.

B_. Communications with System Control UnitAs mentioned previously, when large SIMD jobs

are run, that is jobs which require more than N/Qprocessors, more than one Micro Controller exe-cutes the same set of instructions. Since eachMicro Controller has its own memory, if more thanone Micro Controller is to be used then severalmemories must be loaded with the same set of in-structions. Another occasion which requiresseveral Micro Controller memories to contain thesame instructions is when the same program is tobe run for several different sets of data. Forexample, suppose 16 different pictures are to beprocessed using a program that requires 64 proces-sors for each picture. Each Micro Controller willbe executing the same program, but each will beworking on a different picture, i.e., each MicroController memory will contain the same set of in-structions, but each set of 64 PCU memories willcontain a different picture.

The best way to load several Controllermemories with the same set of instructions is forthe System Control Unit to load them all at thesame time. This may be accomplished by connectingthe Control Disk to all of the Micro Controllermemory modules via a bus as shown in Figure 4.Each memory unit is either enabled or disabled,depending on the contents of two mask registerscalled the Micro Controller Memory Load (MCML) re-

T0 CONTROL DISK

Figure 4: Micro Controllers.

gisters. One mask register is for the A memoryunits, the other is for the B memory units. Amemory unit is enabled if its corresponding bit isa "1," otherwise it is disabled. Using these twoQ-bit registers any arbitrary set of memory unitscan be loaded simultaneously, even mixtures of Aand B memories. The number of Micro Controllersis relatively small, e.g. 16, so these registerswill not be excessively long.

Two more registers of length Q are required forcommunications between the System Control Unit andthe Micro Controllers. One specifies which memorythe Micro Controller is to use, i.e., its A memoryor its B memory, and is called the MicroController Memory Select (MCMS) register. A "1"in the i-th bit means Micro Controller i is to usethe A memory, while a "0" in the i-th bit meansMicro Controller i is to use the B memory. Theother register contains the go/done status of theMicro Controllers and is called the MicroController Status (MCS) register. A "0" in thei-th bit means that Micro Controller i is done.When the i-th bit is set to a "1" the Micro Con-troller sets its program counter to zero and be-gins executing or broadcasting the contents of thememory unit of its module that is specified by theMCMS register. When the Micro Controller is doneit sets its MCS register bit to "0" and sends aninterrupt to the System Control Unit to inform itthat a Micro Controller has finished. This regis-ter is slightly more complex than the other regis-ters mentioned since both the System Control Unitand the Micro Controllers must be able to read andmodify it.

C. Communications among Micro ControllersWhen SIMD jobs which require more than one Mi-

cro Controller are run, how should the Micro Con-

12

Page 5: CONTROL OF A PARTITIONABLE MULTIMICROPROCESSOR …

trollers communicate with each other when tryingto execute statements such as "if any" or "ifall"? The task of computing an "if any" type ofstatement that involves several Micro Controllerscan be handled using the existing hardware. Thiscan be accomplished by applying a recursive dou-bling algorithm to sets of the PCU processors.This software approach is not discussed here, butan inexpensive faster hardware approach is ex-plored.

One method is to set up a bus which connectsthe Micro Controllers. Another method is for theMicro Controllers to send a request to the SystemControl Unit to make the proper evaluation, andthen send control signals back to the Micro Con-trollers. The bus interconnection is used heresince it is faster and does not require interrupt-ing the System Control Unit.

The following implementation of the bus inter-connection requires that each Micro Controller hasaccess to the job identification number (ID) forthe job it is running. This requirement can bemet easily by having some fixed location in eachMicro Controller memory unit contain the ID of thejob contained in that memory unit. The ID for ajob on a Micro Controller may range from 0 to 2Q-1since there may be at most 2Q jobs, i.e., one ineach of the 2Q memory units.

With this piece of data it is simple to setupan interconnection system using the MicroController Communication Bus (MCCB). When an "ifany" type instruction is encountered each MicroController associated with the job sends a requestto the bus controller to use the communicationbus. When one of the Micro Controllers becomesthe first item in the queue the bus controllersends that Micro Controller a "permission to usethe bus" signal so the Micro Controller may broad-cast its job ID to all of the Micro Controllers(including itself) via the MCCB ID bus. The IDbus need only be q+1 bits wide since the range ofIDs is 0 to 2Q-1. If a Micro Controller is runningthe job with the ID which is on the ID bus it thenputs^ its local results onto the MCCB data bus.Then, while all this information is on the bus,all of the Micro Controllers associated with thejob read the data and take the appropriate action.Each Micro Controller serviced removes itself fromthe communication bus queue.

The MCCB data bus is one bit wide and will beconstructed using "wired and" technology, i.e.,the bus is a Q input "wired and" gate. This al-lows all of the Micro Controllers associated witha job to put their data on the bus simultaneously.For example, in the case of the "if any" instruc-tion, when the job ID appears on the ID bus eachMicro Controller puts its local results on thedata bus. A "1" is sent if none of its PCU pro-cessors met the condition, a "0" is sent if any ofits PCU processors met the condition. A MicroController which does not match the job ID willpresent a "1" to the data bus. If any PCU proces-sor running the job meets the condition the buswill be "0," however if no PCU processor meets thecondition the bus will be "1." All of the MicroControllers will then have access to this informa-tion, which will be needed to execute the condi-tional branch in the common instruction stream.

Another example is the case of the "if all" in-

Figure 5: Micro Controller (MC) MCCB Interface.

struction. In this case a Micro Controllerpresents a "1" to the data bus if all of its PCUprocessors meet the condition. If any PCU proces-sor does not meet the condition a "0" is present-ed. Again Micro Controllers not involved willpresent a "1." If the data bus is a "1" all of thePCU processors running the job met the condition.

The hardware required to interface each MicroController to the MCCB is shown in Figure 5. Thejob ID is transmitted from a single Micro Con-troller to the MCCB ID bus by the tristate bufferonly when that Micro Controller is given permis-sion to use the MCCB. Then each Micro Controlleruses its comparator to compare its job ID to theMCCB ID bus. In each case where they are the samea "1" is sent to the corresponding Micro Controll-er informing it that it may now use the MCCB databus. If the Micro Controller is executing an "if"instruction and the output from its comparator isa "1," then the local result of the "if" instruc-tion is applied to the MCCB data bus.

A major problem associated with any bus in amultiprocessing system is contention. The MCCB isallocated on a priority basis. A priority ringbased on the physical addresses of the Micro Con-trollers is used. Micro Controller i+1 modulo Qhas a higher priority than Micro Controller i,0 < i < Q. The priority ring is broken by a Q bitshTft register which contains exactly one "1." A"1" in i-th bit of the shift register indicatesthat Micro Controller i-1 modulo Q has the highestpriority, and therefore Micro Controller i has thelowest priority. After each bus cycle the shiftregister is circularly shifted one bit, such thatif Micro Controller i had the lowest priority itis given the highest priority for the next bus cy-cle» A bus cycle is two or three clock cyclesdepending on how the microprogram is written,i.e., it is long enough for the MCCΘ to be used inan "if any" type of instruction. A priority ringcircuit design and simulation is in [233.

The queue for the communication bus simply con-sists of a Q bit register which is loaded at theend of each bus cycle. A "1" in the i-th bit ofthe register indicates that Micro Controller i ispresently waiting to use the communication bus.After a Micro Controller has finished using thebus it resets its corresponding bit to zero, thusremoving itself from the queue. A block diagramof the MCCB controller is shown in Figure 6.

13

Page 6: CONTROL OF A PARTITIONABLE MULTIMICROPROCESSOR …

PERMISSION TO USE MCCB LINES

Figure 6: MCCB Controller.

D. Communications with the PCU processorsThe processors used in the PCU are to be con-

structed using user microprogrammable "bit-slice"components, now readily available from manufactur-ers such as Texas Instruments [28], Advanced MicroDevices [1], and Intel [83. Bit-slice componentsare designed such that several sets of these com-ponents can be combined to form processors of ar-bitrary word length. Bit-slice processor com-ponents typically include a computational unit, asequencer, and special hardware to allow suchfeatures as pipelining and lookahead addition.The computational unit contains the processor'sregisters, the mechanism for register transfers,and the mechanism for arithmetic computation. Thesequencer controls microprogram execution by cal-culating the next execution address. The choiceof bit-slice processors over single chip micropro-cessors is due for the most part to the advantagesof speed and versatility seen in user micropro-grammable bit-slice processors.

As they presently exist, bit-slice microproces-sors can be an order of magnitude faster in termsof throughput than single chip microprocessors[123. One reason for the bit-slice processor'sspeed is the technology from which the chips aremade. Schottky bipolar transistors are utilized asopposed to slower forms of logic often used insingle chip microprocessors. Another reason isthat bit-slice processors are microprogrammable bythe user and as such can support an instructionset that is custom fit to a particular applica-tion. In addition, some bit-slice processors havearithmetic capabilities not available on any ofthe presently existing single chip microproces-sors.

The unique structure of PASM dictates that theprocessors used in the PCU be unique themselves.Since bit-slice processors are microprogrammableby the user they can be made to function in somespecial ways. The term micro-function refers tothe set of tasks carried out by applying one con-trol word to the computational unit. Typicalbit-slice processors, such as Texas Instrument'sSN74S481 or Intel's 3000, offer a set of commonlyused micro-functions as the only micro-functionsavailable to the user. Micro-functions can berepresented by fewer bits than the actual controlword if the set of available micro-functions islimited. For example the Texas Instrument'sSN74S481 has an eleven bit function word and atwenty-four bit control word. The micro-functionsare translated into control words by a programmedlogic array. The microprogram of the processor can

then be stored efficently, the number of interchipconnections is kept at a minimum, and the job ofwriting microcode can be compared with that ofwriting macrocode. The size of the microprogramstore and the number of interchip connections be-come especially important when considering theconstruction of a large array of processors. Thecost of limiting the set of available micro-functions is that of limiting overall versatility.However, most bit-slice processors offer a set ofmicro-functions complete enough for almost allpractical purposes. If the micro-functions pro-vided are not sufficient for PASM, then a custom-ized encoding of control words into micro-functions may be developed.

When PASM is operating in SIMD mode many pro-cessors are executing the same instruction streamin a synchronous fashion. If the Micro Controllershandle instruction decoding and sequencing for thesets of PCU processors under their control, as op-posed to letting each PCU processor handle its owninstruction decoding and sequencing, the number ofduplicate control memory stores and sequencinghardware chips is reduced by a factor of N/Q.

For a set of PCU processors to operate in MIMDmode, the above scheme is not sufficient. Thereason for this stems from the fact that each PCUprocessor in this set may execute a different in-struction at the same time. To allow for MIMD modethe above scheme must be modified so that a subsetof the PCU processors is capable of handling itsown instruction decoding and sequencing while inMIMD mode and of allowing Micro Controllers to dothese tasks while in SIMD mode. The size of such aset of processors would be dictated by the intend-ed application of the system in question. However,it may be reasonable to expect that the size ofsuch a set could be much smaller than the overallsize of the PCU without significantly hinderingcomputational capability.

^. Enabling and disabling PCU processorsIn SIMD mode all of the active PTO processors

will execute instructions broadcast to them bytheir Micro Controller. A masking scheme is amethod for determining which PCU processors willbe active at a given point in time. An SIMDmachine may have several different maskingschemes. The masking scheme provides the systemuser with a device to enable some processors anddisable others.

The general masking scheme uses an N-bit vectorto determine which PCU processors to activate.Processor number i will be active if the i-th bitof the vector is a 1, for 0 <^ i < N (where the loworder bit is the 0-th). For example, if N = 8 andthe bit vector is 00101011, then only processors0, 1, 3, and 5 will be active. Obviously, a gen-eral mask can activate any set of the processors.These masks are specified in the SIMD program, andare part of the instruction stream broadcast bythe control unit. A mask instruction is executedwhenever a change in the active status of the pro-cessors is required. The II Mac IV, which has 64processors and 64-bit words, uses general masks.When N is larger, say 1024, a scheme such as this,where the mask size is N bits, becomes less ap-pealing in terms of the difficulty in constructingand storing each 1024-bit mask.

14--

Page 7: CONTROL OF A PARTITIONABLE MULTIMICROPROCESSOR …

One way the general masking scheme can be im-plemented is by a hardware decoder in each MicroController. To execute a mask instruction, theMicro Controllers do not broadcast the entire maskto each processor, but only transfer the one bitof the mask that pertains to that processor. As-sume the mask is being used in a partition of size

2P=MN/Q, where n-q < p < n. Due to the fact thatthe physical addresses of alt of the PCU proces-sors in a partition must agree in their n-p low-order bit positions, the system compiler mustrearrange the programmer's logical general mask toform a physical general maslc! This physical maskwill be in a form which can be accessed morereadily by the Micro Controllers than can the log-ical mask. The PCU processor whose logical ad-dress is Mj + i will be the j-th processor con-trolled by the i-th logical Micro Controller,0 _< i < M, 0 _< j < N/Q. The logical mask bit Mj +i is moved to the physical mask position (N Q)i +j. Then the Micro Controller whose logical ad-dress is i will load its N/Q-bit mask registerwith bits i*N/Q through (i+1)*N/Q-1 of the physi-cal mask (the translation of logical Micro Con-troller addresses to physical addresses wasdescribed in VII. A . ) . This method of loadingwill send the i-th bit of the logical general maskto the i-th logical PCU processor.

The PE_ address masking scheme uses an n-position mask to specify which of N PCU processorsare to be activated, each position of the maskcorresponding to a bit position in the logical ad-dresses of the processors. Each position of themask will contain either a 0, 1, or X ("don'tcare") and the only processors that will be activeare those whose address matches the mask: 0matches 0, 1 matches 1, and either 0 or 1 matchesX. Superscripts are repetition factors; square

brackets denote a mask. For example, [Xn-103 ac-tivates all even numbered processors.

In [18] PE address masks are used to write SIMDmachine algorithms. The way in which PE addressmasks interact with various interconnection net-works is analyzed in [173. Other properties ofthese masks are discussed in [20,23].

Like general masks, PE address masks are speci-fied in the SIMD program. PE address masks aremore restricted than general masks, in that a gen-eral mask can activate any arbitrary set of pro-cessors and a PE address mask cannot. However,for N » 64, general masks are impractical interms of storage requirements and ease of program-ming, and so system architects must consider al-ternatives.

A negative PE address mask is the same as aregular PE address mask, except it activates allthose processors which do not match the mask. Todistinguish negative PE address masks they areprefixed with a minus sign. This type of mask,introduced in C2C0, can activate sets of proces-sors a single regular PE address mask cannot;e.g., all processors except for number 0.

To ease decoding of the masks two bits are usedto represent each mask position. Figure 7 showsthe mask word format for each Micro Controller,when N = 1024. The mask word consists of 2n+1 =21 bits, which allows masks having up to n = 10positions and a sign bit to be specified. S is

Figure 7: PE address mask binary encoding,for N 1024.

the sign bit for the mask. Of the remaining 20bits, the 2(n-q) = 12 high-order bits pertain tothe PCU processors in a Micro Controller group,while the low-order 2q = 8 bits pertain to the Mi-cro Controller addresses. The entire mask word isa physical mask and is used with the physical ad-dresses of the PCU processors and the Micro Con-trollers. A logical mask is a subset of the phy-sical mask and consists of only that part of themask needed to control the processors in a parti-tion. The Mask Decoder of Figure 8 transforms the21-bit mask word into a 64-bit general mask vec-tor, for N = 1024 and Q = 16. A Mask Decoder ispart of each Micro Controller. Each logical PEaddress mask is treated as the high-order portionof the physical mask. The portion of the physicalmask not considered part of the logical mask isinitialized with "X"s and left unaltered by pro-gram execution. The sign bit of the logical andphysical masks is the same.

Figure 8: PE address Mask Decoder for the MicroController whose physical address Is 0

15

Page 8: CONTROL OF A PARTITIONABLE MULTIMICROPROCESSOR …

Figure 9: Masking portion of Micro Controlleri. "e" is PCU processor enable bit.

The system of Figure 9 allows 64-bits of an ar-bitrary general mask vector or the output of thePE address Mask Decoder to be sent to the MaskVector Register of each Micro Controller. As acompromise between the flexibility of general maskvectors and the consiseness of PE address masks,the Micro Controller is allowed to fetch the64-bit vector from the Mask Vector Register, per-form various logical operations on the vector, andthen reload it. A logical OR of two (or more)vectors generated by PE address masks isequivalent to taking the union of the sets of pro-cessors activated by the masks. A logical AND isequivalent to the intersection. The complementoperation can be used to implement negative PE ad-dress masks instead of using the exclusive-orgates shown in Figure 8. Further details are inC23].

VIII. Conclusions

A partitionable SIMD/MIMD system, PASM, con-sisting of a System Control Unit, a Memory Manage-ment System, Micro Controllers, and a ParallelComputation Unit was described. The Parallel Com-putation Unit is composed of N microprocessors, Nmemory modules, and an interconnection network.Each of the Q Micro Controllers controls a groupof N/Q PCU microprocessors. The Micro Controllerscan be combined to form larger groups of micropro-cessors. The Memory Management System loads andunloads the N memory modules, where each module iscomposed of a pair of memory units. The SystemControl Unit coordinates the actions of the MemoryManagement System and the Micro Controllers.

An overview of PASM was presented, and then theMicro Controllers were discussed in more detail.The aspects of the Micro Controllers describedwere partitioning, communications with the SystemControl Unit, communications among Micro Controll-ers, communications with the PCU processors, andenabling and disabling the PCU processors.

There are many other problems to consider inthe design of PASM, including: assemblers, com-pilers, operating systems, storage format for im-age data, interfacing of specialized input/outputdevices as a part of the Memory Management System,fault tolerance, choosing an implementation of theinterconnection network, designing instructionsets and the exact structure of the microproces-sors to be used as PCU processors and Micro Con-trollers, clock pulse synchronization of the Micro

Controllers, the use of the Micro Controllers toorchestrate PCU processors in MIMD mode, and find-ing cost-effective values for N and Q. Thesetasks are currently being explored, using avariety of image processing problems as a basisfor analyzing and evaluating design alternatives.

All of the problems involved in designing thesoftware and hardware of PASM are strongly inter-related. It is the consideration of these in-terrelations and the coordinated design of thesystem hardware and software which will aid inproducing a machine that is both a versatile andefficient parallel computational tool for imageprocessing and related fields.

References1 Advanced Micro Devices, The AM2900 Family Data

Book (1976).2 G. Barnes, et. al, "The Illiac IV computer,"

IEEE Trans. Contput., Vol. C-17 (Aug. 1968),pp. 746=757.

3 K. E. Batcher, "The multidimensional accessmemory in STARAN," IEEE Trans. Comput., Vol.C-26 (Feb. 1977), pp.~T4"7-177.

4 K. E. Batcher, "The flip network in STARAN"1976 Int'l. Conf. Parallel Processing (Aug.T97o"),~pp.~65-7T7"

5 J. F. Bogdanowicz, H. J. Siegel, "A partition-able muIti-microprogrammable-micoprocessorsystem for image processing," 1978 IEEEComputer Society Workshop on PatternRecognition and Artificial Intelligence (Apr.1978), pp. 141-144.

6 W. J. Bouknight, et. al, "The Illiac IV sys-tem," Proc of the IEEE, Vol. 60 (Apr. 1972),pp. 369-388.

7 T. Feng, "Data manipulating functions in par-allel processors and their implementations,"IEEE Trans. Comput., Vol. C-23 (Mar. 1974),pp. 309-318.

8 Intel 3000 Data Sheets, Intel Corporation,Santa Clara, California 95051.

9 S. I. Kartashev, S. P. Kartashev, "A micropro-cessor with modular control as a universalbuilding block for complex computers,"EUROMICRO 1977 (Oct. 1977), pp. 210-216.

10 6. J. Lipovski, "On a varistructured array ofmicroprocessors," IEEE Trans, on Comput., Vol.C-26 (Feb. 1977), pp. 125-138.

11 G. J. Lipovski, A. Tripathi, "A reconfigurablevaristructure array processor," 1977 Int'I.Conf. Parallel Processing (Aug. 1977), pp.165-1747

12 E Lowe, "A 16-bit microcomputer for missileguidance and control applications," Proc ofJACC, (1977).

13 G. J. Nutt, "Microprocessor implementation ofa parallel processor," Fourth Annual Symp.Computer Architecture (Mar. 1977), pp.147-152.

14 M. C. Pease, "The indirect binary n-cube mi-croprocessor array," IEEE Trans. Comput., Vol.C-26 (May 1977), pp. ~&^tfT.

15 D. Rohrbacher, J. L. Potter, "Image processingwith the Staran parallel computer," ComputerVol. 10 (Aug. 1977), pp. 54-59.

16 S. Ruben, et. al., "Application of a parallel ..'.processing computer in LACIE," 1976 I t'I •Conf. Paral lei Processing (Aug. 1976), pp."24__32.

16

Page 9: CONTROL OF A PARTITIONABLE MULTIMICROPROCESSOR …

17 H. J. Siegel, "Analysis techniques for SIMDmachine interconnection networks and the ef-fects of processor address masks," IEEE Trans.Comput•, Vol. C-26 (Feb. 1977), pp. 153-161.

18 H. J. Siegel, "Single instruction stream -multiple data stream machine interconnectionnetwork design," 1976 I t'l. Co f. ParalleiProcessing (Aug. 1976), pp. 273-282.

19 H. J. Siegel, "The universality of varioustypes of SIMD machine interconnection net-works," Fourth Annual Symp• ComputerArchitecture (Mar. 1977), pp. 7*0-79.

20 H. J. Siegel, "Controlling the active/inactivestatus of SIMD machine processors," 1977Int'I. Conf• Parallel Processing (Aug. 197?),pg. 183.

21 H. J. Siegel, "Preliminary design of a versa-tile parallel image processing system," ThirdBiennial Conf. on Computing in Indiana (Apr.1978), ppTTT-2"57

22 H. J. Siegel, P. T. Mueller, "The organizationand language design of microprocessors for anSIMD/MIMD system," Second Rocky Mountain Symp•on Microcomputers: Systems, Software,Architecture (Aug. 1978).

23 H. J. Siegel, P. T. Mueller, H. E. Smalley,Preliminary Design Alternatives for aVersatile Parallel Image Processor, School oTElectrical Engineering, Purdue University,TR-EE 78-32 (June 1978), 69 pp.

24 H. J. Siegel, S. D. Smith, "Study of multis-tage SIMD interconnection networks," FifthAnnual Sym. Computer Architecture (Apr. 1978),pp. 223-229.

25 C. D. Stampoulous, "Parallel algorithms forjoining two points by a straight line seg-ment," ^EEE_ TnaiTS. ̂ omput^ , Vol. C-23 (Jun.1974), pp. 642-646.

26 H. Sullivan, T. R. Bashkow, K. Klappholz, "Alarge scale homogeneous, fully distributedparallel machine," Fourth Annual Sym. ComputerArchitecture (Mar. 1977), pp. 105-124.

27 R. J. Swan, S. H Fuller, D. P. Siewiorek,"Cm*, a modular, multimicroprocessor," in t±Collection of Papers on Cm*, technical report,CS Dept., Carnegie-Mellon (Feb. 1977).

28 Bipolar Microcomputer Components Data Book,Texas Instruments Inc., Dallas, Texas.

29 W. A. Wulf, C. G. Bell, "C.mmp - a multi-mini-processor," FJCC (Dec. 1972), pp.765-777.

17