PARALLEL MEMORY SYSTEM FOR A PARTITIONABLE …hj/conferences/17.pdf · '» Mark Washburn Purdue University School of Electrical Engineering West Lafayette, IN 47907 Abstract — PASM

PARALLEL MEMORY SYSTEM FOR A PARTITIONABLE SIMD/MIMD MACHINE

Howard Jay SiegelFrederick Kemmerer

'» Mark WashburnPurdue University

School of Electrical EngineeringWest Lafayette, IN 47907

Abstract — PASM is a large-scale partition-able SIMD/MIMD multimicroprocessor system beingdesigned for image processing tasks. To improvemachine throughput, a memory management systememploying parallel secondary storage devices anddouble-buffered primary memories has been de-vised. The memory system is an intelligent one,using communicating microprocessors which arededicated to handling data requests and filemanagement. The memory system bus structure isorganized to exploit much parallelism intransferring data from the secondary memories tothe primary memories of virtual SIMD and MIMDmachines.

I_. Introduction

As a result of the microprocessor revolu-tion, it is now feasible to build a dynamicallyreconfigurable large-scale multimicroprocessorsystem capable of performing image processingtasks more rapidly than previously possible.There are several ways to harness the parallelprocessing power of a multimicroprocessor system:SIMD, MSIMD, MIMD, and PSM.

An SIMD (single instruction streammultiple data stream) machine [53 typically con-sists of a set of N processors, N memories, aninterconnection network, and a control unit (e.g.Illiac IV C23). The control unit broadcasts in-structions to the processors and all active("turned on") processors execute the same in-struction at the same time. Each processor exe-cutes instructions using data taken from a memoryto which only it is connected. The interconnec-tion network allows interprocessor communication.An MSIMD (multiple-SIMD) system is a parallelprocessing system which can be structured as twoor more independent SIMD machines (e.g. MAPC16D). An MIMD (multiple instruction stream -multiple data stream) machine C53 typically con-sists of N processors and N memories, where eachprocessor may follow an independent instructionstream (e.g. C.mmp C383). As with SIMD architec-tures, there is a multiple data stream and an in-terconnection network. A PSM (partitionableSIMD/MIMD) system C22] is a parallel processing

This research was sponsored in part by the AirForce Office of Scientific Research, Air ForceSystems Command, USAF, under Grant No. AFOSR78-3581. The United States Government is author-ized to reproduce and distribute reprints forGovernmental purposes notwithstanding any copy-right notation hereon.

system which can be structured as two or more in-dependent SIMD and/or MIMD machines (e.g. PASMC283).

PASM, a particular PSM-type system for imageprocessing and pattern recognition, is currentlybeing designed at Purdue University C22D. Due tothe low cost of microprocessors, computer systemdesigners have been considering various multimi-croprocessor architectures Ce.g. 3, 9, 12, 13,17, 34, 36]. The system described here was thefirst in the literature to combine the followingtwo features:1) it may be partitioned to operate as many in-dependent SIMD and/or MIMD machines of varyingsizes; and2) a variety of problems in image processing andpattern recognition are being used to guide thedesign choices.

In the next section, a brief overview ofPASM is presented. The sections followingdescribe various aspects of the PASM memory sys-tem. The use of parallel secondary storage dev-ices, double-buffered primary memories, and dedi-cated microprocessors for memory management arediscussed.

II. PASM Overview

PASM, a partitionable SIMD/MIMD system C22,28, 293, is a dynamically reconfigurable multimi-croprocessor machine for image processing. It isa parallel processing system which can be struc-tured as one or more independent SIMD and/or MIMDmachines of varying sizes. A block diagram ofPASM is shown in Figure 1.

The heart of the system is the ParallelComputation Unit, which contains N processors, N

Figure 1: Block diagram overview of PASM.

Figure 2: The Parallel Computation Unit.

memory modules, and the interconnection network.The Parallel Computation Unit processors are mi-croprocessors that perform the actual SIMD andMIMD computations. The Parallel Computation Unitmemory modules are used by the Parallel Computa-tion Unit processors for data storage in SIMDmode and both data and instruction storage inMIMD mode. The interconnection network providesa means of communication among the Parallel Com-putation Unit processors and memory modules.

The Micro Controllers are a set of micropro-cessors which act as the control units for theParallel Computation Unit processors in SIMD modeand orchestrate the activities of the ParallelComputation Unit processors in MIMD mode.Control Storage contains the programs for the Mi-cro Controllers. The Memory Management Systemcontrols the loading and unloading of the Paral-lel Computation Unit memory modules from theMemory Storage System. The System Control Unitis a conventional machine, such as a PDP-11, andis responsible for the overall coordination ofthe activities of the other components of PASM.

The Parallel Computation Unit is organizedas shown in Figure 2. A pair of memory units isused for each Parallel Computation Unit memorymodule so that data can be moved between onememory unit and secondary storage while theParallel Computation Unit processor operates ondata in the other memory unit. The Parallel Com-putation Unit processors, which are physically

numbered (addressed) from 0 to N-1, where Nf2_n,communicate through the interconnection network.The interconnection network being considered is avariation of the data manipulator [4], a multi-stage implementation of the "PM2I" network C20,21, 24, 32], called the Augmented DataManipulator (ADM) C30D. Other possibilities arecube and shuffle-exchange type networks C10, 11,17D. Any of these interconnection networks can

Figure 3: Organization of the Micro Controllers(MCs).

be partitioned into independent sub-networks ofvarying sizes, which are powers of two, if the

physical addresses of the 2*"1 processors andmemory modules in a partition have the same n-plow-order bits C30, ZM.

The method used to provide multiple control

units is shown in Figure 3. There are Q f ^ MicroControllers, physically addressed (numbered) from0 to Q-1. Micro Controller i controls the N/QParallel Computation Unit processors whose low-order q physical address bits equal i. Each Mi-cro Controller has a memory module which containsa pair of memory units, allowing memory loadingand unloading and computations to be overlapped.A virtual SIMD machine of size RN/Q, where

R=2 and 0 <̂ r <̂ q, is obtained by loading R Mi-cro Controller memory modules with the same in-structions simultaneously. Similarly, a virtualMIMD machine of size RN/Q is obtained by combin-ing the efforts of the Parallel Computation Unitprocessors of R Micro Controllers. For eitherSIMD or MIMD mode, the physical addresses ofthese R Micro Controllers must have the samelow-order q-r bits since the physical addressesof all Parallel Computation Unit processors in apartition must agree in their low-order bit posi-tion in order for the interconnection network tofunction properly.

Given a virtual machine of size RN/Q, theParallel Computation Unit processors and memorymodules for this partition have logical addresses

(numbers) 0 to (RN/Q) - 1 , R = 2r, O<^r<^q.Assuming that the Micro Controllers have been as-signed as described above, then the logical num-ber of a Parallel Computation Unit processor ormemory module is the high-order r+n-q bits of thephysical number. Recall that all of the physicaladdresses of the processors in a partition musthave the same q-r low-order bits. For example,for N = 1024, Q = 16, and R = 4, one allowablechoice of Parallel Computation Unit processors toform a partition of size RN/Q is those whose phy-sical addresses are 3, 7, 11, 15,...1023. The

high-order r+n-q = 8 bits of these 10-bit physi-cal addresses are 0, 1, 2, 3,...255, respective-ly. The value of the low-order q-r = 2 bits ofall the physical processor addresses are equal tothree.

Similarly, the Micro Controllers assigned tothe partition are logically numbered (addressed)from 0 to R-1. For R > 1, the logical number ofa Micro Controller is the high-order r bits ofits physical number. Recall all of the physicaladdresses of the Micro Controllers in a partitionmust agree in the low-order q-r bits. For R = 1,there is only one Micro Controller and it is con-sidered logical number 0. For example, if N =1024, Q = 16, and R = 4, one allowable choice offour Micro Controllers is those whose physicaladdresses are 3, 7, 11, and 15. The high-order r= 2 bits of these four bit physical addresses are0, 1, 2, and 3, respectively. The value of thelow-order q-r = 2 bits of all the physical MicroController addresses are equal to three.

This brief overview of PASM is provided asbackground for the following sections. More de-tails about PASM and partitionable interconnec-tion networks can be found in C22-31D.

The Memory Management System in PASM willhave its own intelligence and will use the paral-lel secondary storage devices of the MemoryStorage System. As guidelines for design pur—poses, it is assumed that N and Q are at least

1024 and 16, respectively. (Systems with 2 to

2 microprocessors have been proposed C17, 36].)Giving the Memory Management System its own in-telligence will help prevent the System ControlUnit from being overburdened. The parallelsecondary storage devices will allow fast loadingand unloading of the N double-buffered ParallelComputation Unit memory modules and will providestorage for system image and picture data andMIMD programs. The Memory Management System andMemory Storage System are described further inthe following sections.

III. PASM Memory Storage System

Secondary storage for PASM's Parallel Compu-tation Unit memory modules is provided by theMemory Storage System. The Memory Storage Systemwill consist of N/Q independent Memory Storageunits, where N is the number of Parallel Computa-tion Unit memory modules and Q is the number ofMicro Controllers in PASM. The Memory Storageunits will be numbered from 0 to (N/Q)-1. EachMemory Storage unit is connected to Q ParallelComputation Unit memory units. For 0 <̂ i < N/Q,Memory Storage unit i is connected to thoseParallel Computation Unit memory modules whosephysical addresses are of the form:

Recall that, for 0 <̂ k < Q, Micro Controller k isconnected to those Parallel Computation Unit pro-cessors whose physical addresses are of the form:

Figure 4: Organization of the Memory StorageSystem for N = 32 and Q = 4.MSSU is Memory Storage System unit.MC is Micro Controller. PCU PE isParallel Computation Unit ProcessingElement (processor - memory modulepair).

Thus, Memory Storage unit i is connected to theith Parallel Computation Unit processor/memorymodule pair of each Micro Controller. This isshown for N = 32 and Q = 4 in Figure 4.

The two main advantages of this approach fora partition of size N/Q are that (1) all of theParallel Computation Unit memory modules can beloaded in parallel and (2) the data is directlyavailable no matter which partition (Micro Con-troller group) is chosen. This is done by stor-ing the data for a task which is to be loadedinto the ith Parallel Computation Unit memorymodule of the virtual machine of size N/Q inMemory Storage unit i, 0 <_ i < N/Q. MemoryStorage unit i is connected to the ith ParallelComputation Unit memory module in each Micro Con-troller group (i.e., Parallel Computation Unitmemory modules Q * i, (Q * i) + 1, (Q * i) +2,...). Thus, no matter which Micro Controller

group of N/Q Parallel Computation Unit processorsis chosen, the data from the ith Memory Storageunit can be loaded into the ith Parallel Computa-tion Unit memory module of the virtual machine,for all i, 0 <̂ i < N/Q, simultaneously.

For example, in Figure 4, if the partitionof size N/Q = 8 chosen consists of the ParallelComputation Unit processors connected to MicroController 2, then Memory Storage unit 0 wouldload Parallel Computation Unit memory module 2, 1would load 6, 2 would load 10, etc. If insteadMicro Controller 3's Parallel Computation Unitprocessors were chosen, Memory Storage unit 0would load Parallel Computation Unit memorymodule 3, 1 would load 7, 2 would load 11, etc.

Thus, for virtual machines of size N/Q, thissecondary storage scheme allows all N/Q ParallelComputation Unit memory modules to be loaded inone parallel block transfer. This same approach

can be taken if only (N/Q)/2 district MemoryStorage System units are available, where

0 — d 5. n~Q I n th''s caser however, 2 parallelblock loads would be required instead of justone.

Consider the situation where a virtualmachine of size RN/Q is desired, 1 <_ R <̂ Q, andthere are N/Q Memory Storage System units. Ingeneral, a task needing RN/Q Parallel ComputationUnit processors, logically numbered 0 to RN/Q-1,would require R parallel block loads if the datafor the Parallel Computation Unit memory modulewhose high-order n-q logical address bits equal iis loaded into Memory Storage unit i. This istrue no matter which group of R Micro Controllers(which agree in their low-order q-r address bits)is chosen.

For example, consider Figure 4, where N = 32and Q = 4. Assume that a virtual machine of size16 is desired. The data for the Parallel Compu-tation Unit memory modules whose logical ad-dresses are 0 and 1 is loaded into Memory Storageunit 0, for memory modules 2 and 3 into unit 1,for memory modules 4 and 5 into unit 2, etc. As-sume the partition of size 16 is chosen to con-sist of the Parallel Computation Unit processorsconnected to Micro Controllers 0 and 2 (i.e., alleven physically numbered processors). Then theMemory Storage System units first load ParallelComputation Unit memory modules physically ad-dressed 0, 4, 8, 12, 16, 20, 24, and 28 (simul-taneously), and then load memory modules 2, 6,10, 14, 18, 22, 26, and 30 (simultaneously). Asexplained in section II, given this assignment ofMicro Controllers, the Parallel ComputationalUnit memory module whose physical address is 2 *1 has logical address i, 0 <̂ i <̂ 16. Assume theParallel Computation Unit processors and memorymodules associated with Micro Controllers 1 and 3are chosen. First memory modules physically ad-dressed 1, 5, 9, 13, 17, 21, 25, and 29 are load-ed simultaneously, and then modules 3, 7, 11, 15,19, 23, 27, and 31 are loaded simultaneously. Inthis case, the Parallel Computation Unit memorymodule whose physical address is (2 * i) + 1 haslogical address i, 0 <_ i < 16. No matter whichpair of Micro Controllers is chosen, only twoparallel block loads are needed.

Thus, for a virtual machine of size RN/Q,

this secondary storage scheme allows all RN/QParallel Computation Unit memory modules to beloaded in R parallel block transfers, 1 < R < 8,As stated above for the special case where R = 1,the same approach can be taken for R > 1 if only

(N/Q)/2 distinct Memory Storage System units are

available. In this situation, however, R * 2parallel block loads would be required instead ofjust R.

The actual devices that will be used asMemory Storage System units will depend upon thespeed requirements of the rest of PASM, cost con-straints, and the state of the art of storagetechnology at implementation time. Possibilitiesto be investigated include disks, bubblememories, and CCD's.

1^. Local Variable Storage in PASM

The PASM Memory Management System makes useof the double-buffered arrangement of the Paral-lel Computation Unit memory modules to enhancesystem throughput. The scheduler, using informa-tion from the System Control Unit such as numberof Parallel Computation Unit processors neededand maximum allowable run time, will sequencetasks waiting to execute C29D. Typically, all ofthe data for a task will be loaded into the ap-propriate Parallel Computation Unit memory unitsbefore execution begins. Then, while a ParallelComputation Unit processor is using one of itsmemory units, the Memory Management System can beloading the other unit for the next task. Whenthe task currently executing completes, theParallel Computation Unit processor can switch toits other memory unit for doing the next task.

Based on image processing and pattern recog-nition tasks which have been examined, the fol-lowing conclusion has been reached. Due to theuse of double-buffering, the potentially largeParallel Computation Unit memory modules, and thespecial purpose design of PASM, the time sharingof the Parallel Computation Unit processors andthe use of conventional paging is not desirable.

There may be some cases where all of thedata will not fit into the Parallel ComputationUnit memory space allocated. Assume a memoryframe is the amount of space used in a ParallelComputation Unit memory unit for the storage ofdata from secondary storage for a particulartask. There are tasks where many memory framesare to be processed by the same program (e.g.,maximum likelihood classification of satellitedata C353). The double-buffered Parallel Compu-tation Unit memory modules can be used so that assoon as the data in one memory unit is processed,the Parallel Computation Unit processor canswitch to the other unit and continue executingthe same program. When the Parallel ComputationUnit processor is ready to switch memory units,it signals the Memory Management System that ithas finished using the data in the memory unit towhich it is currently connected. Hardware toprovide this signaling capability can be providedin different ways, such as using interrupt linesfrom the Parallel Computation Unit processors orby using logic to check the address lines between

the Parallel Computation Unit processor and itsmemory modules for a special address code. Afterthe appropriate tests to ensure that the newmemory frame is available C293, the processorswitches memory units. The Memory ManagementSystem can then load the "finished" memory unitwith the next memory frame or next task. Such ascheme, however, requires some mechanism whichcan move variable length portions of programs ordata sets (i.e., local data) stored in one unitof a memory module to the other unit when theassociated processor switches to access the nextmemory frame.

Three hardware methods are considered forimplementing local variable storage. Each wouldbe used only when multiple memory frames are tobe processed. The first method consists of aseparate local memory allocated to each ParallelComputation Unit processor for the purpose ofstoring local variables. This local memory wouldbe in addition to the processor's memory module.Such a local memory would not be affected by thechanging connections of memory units associatedwith it. The second method would consist ofsplitting the local variable storage, and using avariable length portion of each memory unit aslocal variable storage. This scheme would re-quire w/2 words of storage in each memory unit toimplement w words of local variable storage.This specially allocated space in the memory un-its would be protected by hardware when the asso-ciated Parallel Computation Unit processorchanges memory units. The third method storeslocal variables in the memory units in much thesame way as method two, but in this case w wordsare required in each Parallel Computation Unitmemory module for w words of local variablestorage. This scheme preserves local variablestorage by maintaining a current copy of the lo-cal variables in both memory units associatedwith a given Parallel Computation Unit processor.

Of the three methods described above, methodone is the least flexible since it requires afixed amount of memory to be dedicated to localvariable storage at all times. This method maytend to utilize inefficiently the special localvariable memories it requires since thesememories will have to be large enough to handletasks which may require amounts of local variablestorage many times greater than that of a typicaljob. For example, a task may require that a por-tion of a reference image be stored within thelocal variable storage space. Such a task mightbe executed infrequently but would require a rel-atively large amount of local variable storagespace. Other tasks run by PASM might be executedfar more frequently but would require far lesslocal variable storage space than the referenceimage example above. The result is that whilethe tasks requiring a small portion of the avail-able local variable storage space are being run,the bulk of the available local variable storagespace is not utilized. Furthermore, if a taskrequires more local variable storage than expect-ed (i.e., more Local variable storage than thefixed size dedicated memory has space for), aproblem arises which will require additionalhardware and/or software overhead to solve.

The second method described above makes the

most efficient use of the memory space availablefor local variable storage in that for w words oflocal variable storage required, only w words ofactual memory space are used. Since w would bevariable, only the amount of local variablestorage space required by a given task would needto be allocated to the task. This method, howev-er, has several inherent disadvantages. First,when a Parallel Computation Unit processor exe-cuting a given task begins processing the lastmemory frame associated with the task, the MemoryManagement System will normally load the inactivememory unit with data for the next task to berun. If the local variable storage system is inuse however, the next task cannot be loaded intothe memory unit since the w/2 words of localvariable storage in the inactive memory unit mustbe preserved until the current task is complete.A second disadvantage of this method is that theParallel Computation Unit processor addresseswhich access local variable storage stored in theinactive memory unit must be translated to prop-erly address the local variable storage in thismemory unit. Such address translation is likelyto require additional hardware and may cause ad-ditional delay in address decoding.

The third method described above would makeless than optimal use of the space allocated forlocal variable storage in the Parallel Computa-tion Unit memory module (2w words of the memorymodule are needed for w words of local variablestorage), but it does not require the addresstranslation of method two and provides much moreflexibility than method one. It also eliminatesthe problems encountered in method two when a newtask is loaded into a memory unit which containslocal variable storage associated with a previoustask. This method maintains a copy of localvariables in both memory units associated with agiven Parallel Computation Unit processor so thatswitching memory units does not alter the localvariable storage associated with the processor.The implementation of variable size local vari-able storage for this method is simpler and morestraightforward than that of method two abovesince the total address space for a single pro-cessor is fixed at the size of a single memoryunit. In method two, the total address spacewould be the fixed size of a memory unit plusw/2. More image processing and pattern recogni-tion algorithms suitable for implementation onPASM need to be studied to determine if the effi-ciency gained by optimal utilization of memoryspace in method two will be significant enough tooffset the problems associated with this method.Currently, method three appears to be the mostpromising.

One possible hardware arrangement to imple-ment method three is described below. The ar-rangement makes use of two characteristics of thePASM memory access requirements:1) secondary memory will not be able to load agiven memory unit at the maximum rate it can ac-cept data, and2) Parallel Computation Unit processors will notoften be able (or desire) to write to memory onsuccessive memory cycles.Because of these two characteristics, ParallelComputation Unit processor stores to local vari-

able storage Locations in an active memory unitcan be trapped by a bus interface register andstored in the inactive memory unit by stealing acycle on the secondary memory bus. In essence,this technique makes use of the conventionalstore-through concept as described in C7, 143.

An exception to the second characteristicmentioned above is multiple precision data. If16 bit words are assumed, then for higher preci-sion it may be desirable to use two or four wordsas a group. However, a simple buffering schemecan handle this possibility.

The amount of memory allocated as localstorage is determined by the contents of a k-bitbase register. This register may be altered by

the Memory Management System. If 2 locationsare available in each memory unit for ParallelComputation Unit processor use, local storage can

be allocated in blocks of 2 words. B blocks,

1 <̂ B <̂ 2 , can be allocated for local storage bystoring B in a base register. This has the ef-fect of allocating all memory locations from 0 to

B2 -1 as local storage. When a processor writesto a local variable location a k bit block com-parator causes the memory address and data beingwritten to be trapped by a bus interface regis-ter. A cycle request flip-flop is set to indi-cate to the logic which controls the buses asso-ciated with the Parallel Computation Unit memorymodule that a cycle is needed on the secondarymemory bus. When the cycle is granted, theflip-flop is reset and the data in the bus in-terface register is gated into the inactivememory unit. In this way, the space allocatedfor local variable storage remains updated inboth memory units at all times. It is assumedthat the bus interface register will have maximumpriority for secondary memory bus usage sincethis would prevent the processors in the ParallelComputation Unit from having to wait to write toa location designated for use as local storage.

The method described above is applicable toany system which allows its processing tasks toutilize several separate memories and which re-quires that identical copies of variable amountsof certain data be maintained in all memories soused.

V. Altering Loading Sequences

To further increase the flexibility of PASM,a task may alter the sequence of data processedby it during execution. As an example, considera task which is attempting to identify certainfeatures within a series of images. The taskmight examine a visible spectrum copy of an imageand, based on features identified within the im-age, choose to examine an infrared spectrum copyof the same image. Rather than burden the SystemControl Unit to perform data loading sequence al-terations, the task is allowed to communicatedirectly with the Memory Management System.

In the case of an SIMD task, the associatedMicro Controller(s) determines if changes are re-quired in the data loading sequence for the task.If so, a Micro Controller specifies the nature of

the changes and communicates them to the MemoryManagement System without involving the SystemControl Unit. Each Micro Controller in the PASMsystem has the capability to generate loading se-quence changes. For tasks which require R MicroControllers (1 <̂ R <_ Q), logically numbered 0 toR-1, control instructions exist so that logicalMicro Controller 0 will handle loading sequencechanges. Micro Controller 0 uses logical Paral-lel Computation Unit processor number 0 of thevirtual machine to establish a control informa-tion list in logical Parallel Computation Unitmemory module 0. (There are Q Parallel Computa-tion Unit processors which can possibly be logi-cally numbered 0 in a virtual machine. They arethose Parallel Computation Unit processors whichare physically numbered 0, 1, 2,...,Q-1.) Thislist specifies in a concise fashion the loadingsequence alterations required and includes infor-mation such as the IDs of the data files to beloaded, the Parallel Computation Unit memorymodules which are to receive the data, and thelocations within the Parallel Computation Unitmemory modules where the data is to be loaded.The Micro Controller initiates the transfer ofthis list to the Memory Management System by us-ing logical Parallel Computation Unit processor 0to write a pointer to the list into the highestaddressable memory location of its memory module.Through the use of a simple address comparison,the write into this memory location generates aninterrupt to the Memory Management System. TheMemory Management System recognizes the interruptas a request for a loading sequence change anddetermines which Micro Controller is making therequest. The Memory Management System uses thelist of control information (via the pointer pro-vided) to determine the loading sequence changesrequired.

An alternative method of interrupt genera-tion is to use an interrupt line from each of theQ possible logical Parallel Computation Unit pro-cessor 0's to the Memory Management System. Themethod selected for interrupt generation willdepend upon the interrupt capabilities of the mi-croprocessor used in the Parallel Computation Un-it. While loading sequence control informationcould be passed directly from the Micro Controll-ers to the Memory Management System, the lengthof the connections required may make implementa-tion more difficult and costly.

One hardware scheme which can transfer thecontrol information list from a Parallel Computa-tion Unit memory module to the Memory ManagementSystem is shown in Figure 5. The hardware systemshown is based on having the Memory ManagementSystem coordinate the recognition of Micro Con-troller interrupts and the associated transfersof control information lists from the ParallelComputation Unit memory modules. The interruptrecognition portion is handled by the ParallelComputation Unit processor Interrupt Conrol Logicwhile the transfer of control information listsis handled by the Parallel Computation Unitmemory module Access Control Logic.

Consider the following example in a virtualmachine whose processor logically numbered 0 isphysically numbered i. Suppose processor i es-tablishes a control information list in one of

Figure 5: Hardware scheme for dynamicallyaltering the loading sequence ofthe memory modules.

its memory units and writes a pointer to the listinto its corresponding interrupt generation loca-tion. The memory write to the interrupt genera-tion location is signaled to the Interrupt Con-trol by a pulse on the Interrupt Request Linecorresponding to processor i. This pulse causesthe Interrupt Control to signal the MemoryManagement System that processor i has generatedan interrupt to the Memory Management System.The Memory Management System then uses the AccessControl to read the interrupt generation locationin the Parallel Computation Unit memory module toobtain the pointer to the control informationlist. The control information list is then readfrom the Parallel Computation Unit memory moduleby the Memory Management System. Finally, theMemory Management System signals the InterruptControl to generate a pulse on the Interrupt Ac-cepted Line to processor i.

The same hardware arrangement described forSIMD tasks is used for MIMD tasks. With eachgroup of N/Q MIMD processors, there is associateda memory supervisor which is logical processor 0within the group. The memory supervisorpossesses the hardware for Memory Management Sys-tem interrupt generation and loading sequence al-terations using the same arrangement describedfor SIMD mode. All processors associated with agive memory supervisor make requests for loadingsequence changes through the memory supervisor,without involving the Micro Controllers or SystemControl Unit. This reduces System Control Unitcontention problems, as mentioned above, andhelps prevent the Micro Controller(s) orchestrat-ing the virtual MIMD machine from becoming over-burdened.

The scheme described here is well suited toparallel computer systems which execute multipleparallel tasks since it can easily keep track ofand arbitrate multiple requests for data loadingsequence alterations. This technique also makesefficient use of the multiple memory arrangementof PASM by using the hardware structure of thememory system to provide for communication ofloading sequence alteration information from thememory system to the controller which loads datainto the memory system.

VI. PASM Memory Management System

Tasks for which the Memory Management Systemis responsible include file system maintenance,scheduling of Parallel Computation Unit memorymodule loading and unloading, and Memory StorageSystem bus control. A set of microprocessors arededicated to performing the Memory ManagementSystem tasks in a distributed fashion, i.e. oneprocessor will handle Memory Storage System buscontrol, one will handle the scheduling tasks,etc. This distributed processing approach ischosen in order to provide the Memory ManagementSystem with a large amount of processing power atlow cost. In addition, dedicating specific mi-croprocessors to certain tasks simplifies boththe hardware and software required to performeach task.

Figure 6: Distributed Memory Management System.

The basic architecture of the Memory Manage-ment System is shown in Figure 6. The MemoryManagement System consists of a master processorwhich coordinates the concurrent tasks executedby the slave processors, a shared memory forstorage of data required by more than one proces-sor, a local ROM and RAM for each processor forstorage of code and local data respectively, andan interface to the shared memory for each pro-cessor.

A shared memory approach is used to allowthe processors to communicate with each other andto share data. This approach is planned due tothe need to share relatively large quantities ofdata such as file tables and task queues. As anexample, consider a queue of Memory Storage Sys-tem to Parallel Computation Unit memory moduledata transfer operations pending. This queue

would need to be accessible to both the processorin charge of the Memory Storage System bus systemand the processor in charge of scheduling suchtransfers.

To reduce contention for the shared memory,each processor uses a local ROM and RAM forstorage of code and local data. In addition, theshared memory may be interleaved [14] to furtherreduce contention. The degree of interleavingdesirable may be determined by simulation studiesor queuing theory analysis C63 of the MemoryManagement System.

The processors within the Memory ManagementSystem may be implemented using commerciallyavailable fixed instruction set microprocessors.The new generation of 16-bit processors C15, 18,19, 333 are particularly attractive since manyprovide special hardware for operations such aslocked increment and test, memory protection andmanagement, and problem/supervisor state switch-ing. Features such as these would considerablysimplify the hardware and software design of theMemory Management System. An alternative to the16-bit processors are the less expensive 8-bitmicroprocessors currently available E8, 373. Thechoice of a processor type will be governed bythe amount of processing required to perform thetasks associated with the Memory Management Sys-tem and the cost trade-offs involved.

The division of tasks chosen is based on themain functions which the Memory Management Systemmust perform. The functions to be performed in-clude:1) communication with the System Control Unit andgenerating slave tasks based on Parallel Computa-tion Unit memory module load/unload requests fromthe System Control Unit,2) interrupt handling and generating slave tasksfor data Loading sequence changes requested bythe Parallel Computation Unit processors physi-cally numbered 0 to Q-1 (see previous section),3) scheduling of Memory Storage System datatransfers,4) control of input/output operations involvingperipheral devices and the Memory Storage System,5) control and maintenance of the Memory Manage-ment System file directory information and thecreation and deletion of data files, and6) control of the Memory Storage System bus sys-tem.

Most Memory Management System operations willbe initiated by the System Control Unit since itwill be responsible for coordinating the opera-tion of the PASM system. For this reason, themaster processor is chosen to communicate withthe System Control Unit and to perform the taskspawning operations associated with System Con-trol Unit requests.

Parallel Computation Unit processor inter-rupt handling is assigned to one slave processor.This slave sends requests for Parallel Computa-tion Unit memory module data loading sequencechanges to the master processor.

Scheduling of all Memory Management Systemoperations involving data transfers using theMemory Storage System bus system is assigned toanother slave processor. One slave processor isdevoted solely to performing scheduling opera-tions since the scheduling of data transfers will

be complex and time consuming if near optimaloperation of this system is to be realized.

Another slave is devoted to handlinginput/output between the Memory Storage Systemand peripheral devices such as magnetic tape un-its and color video displays. This slave wouldhandle any communications with the peripheraldevices and schedule access to the Memory Storageunits.

The control and maintenance of the MemoryManagement System file system is assigned to oneor more slave processors. To understand why mul-tiple slave processors may be required, considerthe configuration of the Memory Storage System.It will consist of N/Q secondary storage deviceswhich operate in a parallel fashion. The secon-dary storage devices will be required to locateand transfer data files based on file IDspresented to the Memory Management System. Forthe suggested values of N=1024 and Q=16, a totalof 64 secondary storage devices may be involvedin transferring data files at any given time. Itis apparent that the file location operations as-sociated with this many devices will exceed theprocessing capabilities of one slave processor.The exact number of slave processors to be devot-ed to file directory maintenance will be deter-mined by simulation and/or queuing theory ana-lyses of the Memory Storage System system.Another possibility is to assign a microprocessorto each Memory Storage System unit for filedirectory maintenance (e.g. intelligent disks),and have a single slave coordinate this activity.

A slave processor is devoted to performing theoperations associated with the configuration andcontrol of the Memory Storage System bus system.This would involve setting the control signalsneeded to connect each Memory Storage System unitto the appropriate Parallel Computation Unitmemory module.

The hardware structure of the Memory Manage-ment System is such that additional slave proces-sors may be added to perform tasks that are notconsidered to be part of the Memory ManagementSystem processing load at this time. In an actu-al prototype Memory Management System, interfacesfor additional slave processors would be providedto facilitate system expansion and the incorpora-tion of new features into the Memory ManagementSystem.

VII. Conclusions

An overview of PASM, a partitionable SIMD/MIMDsystem for image processing and pattern recogni-tion being designed at Purdue University, wasgiven. To improve the throughput of this large-scale dynamically reconfigurable multimicropro-cessor system, a highly parallel memory systemwas described. The memory system uses double-buffered primary memories, parallel secondarymemories, and a set of dedicated microprocessors.The organization of this memory system waspresented and its advantages were discussed.

Acknowledgements

The authors thank L. J. Siegel, R. J. McMil-len, and P. T Mueller, Jr., for their commentsand suggestions.

References

G. Barnes, et al.., "The Illiac IV computer,"IEEE Trans. Comp., Vol. C-17, No. 8, Aug.1968, pp. 746-757.W. J. Bouknight, et al., "The Illiac IV sys-tem," Proc. IEEE, Vol. 60, Apr. 1972, pp.369-388.

F. Briggs, K. S. Fu, K. Hwang, and J. Patel,"PM4 - a reconfigurable multimicroprocessorsystem for pattern recognition and imageprocessing," Nat'l. Comp. Conf., June 1979,pp. 255-265.T. Feng, "Data manipulating functions inparallel processors and their implementa-tions," IEEE Trans. Comp., Vol. C-23, No. 3,Mar. 1974, pp. 309-318.M. J. Flynn, "Very high-speed computing sys-tems," Proc. IEEE, Vol. 54, Dec. 1966, pp.1901-1909.F. T. Fung and H. C. Torng, "Analysis ofmemory conflicts in a multiple microproces-sor system," IEEE Trans. Comp., Vol. C-28,No. 1, Jan. 1979, pp. 28-37.J. Hayes, Computer Architecture and

Organization, McGraw-Hill, New York, 1978.Intel, Component Data Catalog, Intel Corp.,1978.J. Keng and K. S. Fu, "A special computerarchitecture for image processing," 1978IEEE Comput. Soc. Conf. Pattern Recognitionand Image Processing, June 1978, pp.287-290.T. Lang and H. S. Stone, "A shuffle-exchangenetwork with simplified control," IEEETrans. Comp., Vol. C-25, No. 1, Jan. 1976,pp. 55-65.D. H. Lawrie, "Access and alignment of datain array processor," IEEE Trans. Comp., Vol.C-24, No. 12, Dec. 1975, pp. 1145-1155.G. J. Lipovski, "On a varistructured arrayof microprocessors," IEEE Trans. Comp., Vol.C-26, No. 2, Feb. 1977, pp. 125-138.G. J. Lipovski and A. Tripathi, "A reconfi-gurable varistructure array processor," 1977Int'l. Conf. Parallel Processing, Aug. 1977,pp. 165-174.R. E. Matick, "Memory and storage," inIntroduction tp_ Computer Architecture, H. S.Stone, editor, Science Research Associates,Chicago, Illinois, 1978.S. P. Morse, W. B. Pohlman, and B. W.Ravenel, "The Intel 8086 microprocessor: a16-bit evolution of the 8080," Computer,Vol. 11, No. 6, June 1978, pp. 18-27VG. J. Nutt, "Microprocessor implementationof a parallel processor," 4th Annual Symp.Comp. Arch., Mar. 1977, pp. 147-152.M. C. Pease, "The indirect binary n-cubemultiprocessor array," IEEE Trans. Comp.,Vol. C-26, No. 5, May 1977, pp. 458-473.B. L. Peuto, "Architecture of a new mi-croprocessor," Computer, Vol. 12, No. 2,Feb. 1979, pp. 10-21.

M. Shima, "Two versions of 16-bit chip spanmicroprocessor, microcomputer needs,"Electronics, Vol. 51, No. 26, Dec. 1978, pp.81-88.

H. J. Siegel, "Single instruction stream -multiple data stream machine interconnectionnetwork design," 1976 Int'l Conf. ParallelProcessing, Aug. 1976, pp. 273-282.H. J. Siegel, "Analysis techniques for SIMDmachine interconnection networks and the ef-fects of processor address masks," IEEETrans. Comp., Vol. C-26, No. 2, Feb. 1977,pp. 153-161.

H. J. Siegel, "Preliminary design of a ver-satile parallel image processing system,"3rd Biennial Conf, oji Computing in_ Indiana,Apr. 1978, pp. 11-25.H. J. Siegel, "Partitionable SIMD computersystem interconnection network universali-ty," 16th Annual Allerton Conf. on_Communication, Control, and Computing, Oct.1978, pp. 586-595.H. J. Siegel, "Interconnection networks forSIMD machines," Computer, Vol. 12, No. 6,June 1979, pp. 57-65.H. J. Siegel, "Partitioning permutation net-works: the underlying theory," 1979 Int'l.Conf. Parallel Processing, Aug. 1979.H. J. Siegel, R. J. McMillen, and P. T.Mueller, Jr., "A survey of interconnectionmethods for reconfigurable parallel process-ing systems," Nat'l. Comp. Conf., June 1979,pp. 529-542.H. J. Siegel and P. T. Mueller, Jr., "Theorganization and language design of mi-croprocessors for an SIMD/MIMD system," 2ndRocky Mt. Symp, on Microcomputers, Aug.1978, pp. 311-340.H. J. Siegel, P. T. Mueller, Jr., H. E.Smalley, Jr., "Control of a partitionablemultimicroprocessor systems," 1978 Int'l.Conf. Parallel Processing, Aug. 1978, pp.9-17.

H. J. Siegel, L. J. Siegel, R. J. McMillen,P. T. Mueller Jr., and S. D. Smith, "AnSIMD/MIMD multimicroprocessor system for im-age processing and pattern recognition,"1979 IEEE Comp. Soc Conf. PatternRecognition and Image Processing, Aug. 1979.

H. J. Siegel and S. D. Smith, "Study of mul-tistage SIMD interconnection networks," 5thAnnual Symp. Comp. Arch., Apr. 1978, pp.223-229.S. D. Smith and H. J. Siegel, "Recirculat-ing, pipelined, and multistage SIMD inter-connection networks," 1978 Int'l. Conf.Parallel Processing, Aug. 1978, pp. 206-214.S. D. Smith and H. J. Siegel, "An emulatornetwork for SIMD machine interconnectionnetworks," 6th Int'l. Symp. Comp. Arch.,Apr. 1979.E. Stritter and T. Gunter, "A microprocessorarchitecture for a changing world: theMotorola 68000," Computer, Vol. 12, No. 2,Feb. 1979, pp. 43-52.R. J. Swan, S. H. Fuller, and D. P.Siewiorek, "Cm*: a modular, multi-microprocessor," Nat'l. Computer Conf., June1977, pp. 645-6551

P. H. Swain, H. J . Siegel, and B. W. Smith,"A method for classifying multi-spectral re-note sensing data using context," Symp, onMachine Processing oƒ_ Remote Sensing Data,June 1979, pp. 343-353.Sul l ivan, H., Bashkow, T. R., and Klappholz,K., "A large scale homogeneous, f u l l y d is -tr ibuted paral lel machine," 4th Symp. AnnualComp. Arch., Mar. 1977, pp. 105-124.

J . F. Wakerly, " In te l MCS-48 microcomputerfamily: a c r i t i que , " Computer, Vol. 12, No.2, Feb. 1979, pp. 22-31.W. A. Wulf and C. G. B e l l , "C.mmp - amulti-miniprocessor," Proc. FJCC, Dec. 1972,pp. 765-777.

Documents

PARALLEL MEMORY SYSTEM FOR A PARTITIONABLE …hj/conferences/17.pdf · '» Mark Washburn Purdue University School of Electrical Engineering West Lafayette, IN 47907 Abstract — PASM