16
Benchmarking a massively parallel computer by a synthetic aperture radar processing algorithm Y. Luo", I.G. Cummmg\ M.J. Yedlin" Canada ABSTRACT The Synthetic Aperture Radar (SAR) is a very computational intensive sig- nal processing application for a Massively Parallel Computer. It involves a series of complex array manipulations, such as FFTs, complex array multi- plications, transpositions, and space-variant convolutions, which are quite common in two-dimensional computing or signal processing applications. Optimally implementing a SAR processing algorithm on a MPP such as the Connection Machine CM-2 or CM-200 is a suitable way to study the architecture and to test the sustained performance of a MPP. In this paper, it is shown how the SAR algorithm can be mapped onto the parallel archi- tecture of the CM and how suitable the CM architecture and CM-Fort ran features are for the implementation of the SAR algorithm. The computing speeds of an optimized CM-Fortran SAR processing benchmark program are also investigated and discussed. The results show that 1/4 of the real- time processing speed (nearly 1 Gflops) for the satellite SAR application can be achieved on the CM-2 or CM-200. Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

Canada - WIT Press · for the Slicewise mode use the processing nodes as the basic physical units. The slicewise mode can make better use of the performance potential of the vector-processing

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Canada - WIT Press · for the Slicewise mode use the processing nodes as the basic physical units. The slicewise mode can make better use of the performance potential of the vector-processing

Benchmarking a massively parallel

computer by a synthetic aperture radar

processing algorithm

Y. Luo", I.G. Cummmg\ M.J. Yedlin"

Canada

ABSTRACT

The Synthetic Aperture Radar (SAR) is a very computational intensive sig-nal processing application for a Massively Parallel Computer. It involves aseries of complex array manipulations, such as FFTs, complex array multi-plications, transpositions, and space-variant convolutions, which are quitecommon in two-dimensional computing or signal processing applications.Optimally implementing a SAR processing algorithm on a MPP such asthe Connection Machine CM-2 or CM-200 is a suitable way to study thearchitecture and to test the sustained performance of a MPP. In this paper,it is shown how the SAR algorithm can be mapped onto the parallel archi-tecture of the CM and how suitable the CM architecture and CM-Fort ranfeatures are for the implementation of the SAR algorithm. The computingspeeds of an optimized CM-Fortran SAR processing benchmark programare also investigated and discussed. The results show that 1/4 of the real-time processing speed (nearly 1 Gflops) for the satellite SAR applicationcan be achieved on the CM-2 or CM-200.

Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

Page 2: Canada - WIT Press · for the Slicewise mode use the processing nodes as the basic physical units. The slicewise mode can make better use of the performance potential of the vector-processing

394 Applications of Supercomputers in Engineering

INTRODUCTION

The Synthetic Aperture Radar (SAR) signal processing is a very compu-tational intensive application for a Massively Parallel Computer. In orderto efficiently implement a SAR processing algorithm on a MPP such asthe Connection Machine, some CM architecture and CM Fortran featuresrelated to the SAR signal processing are discussed and analyzed in the fol-lowing sections. A detailed mapping procedure from the SAR processingalgorithm to the CM architecture is also addressed. This implementationshows that a processing speed significantly higher than those of commoncustom hardware SAR processors and some other supercomputers can beachieved on the Connection Machine.

CM ARCHITECTURE

The Connection Machine system (manufactured by Thinking Machines Corp.Cambridge, Mass.) consists of a parallel processing unit containing up to64K data processors, a front end computer (SUN SPARC or VAX), an I/Osystem and other peripherals. The central element in the system is theparallel processing unit, which contains [1]:

• from 2K to 64K data processors

# one or multiple sequencers that control the data processors

• an interprocessor communication network

• I/O controller and framebuffer modules

Slicewise Processing NodesAll CM-2s or CM-200s are organized into processing nodes that consist of 32bit-serial processors with their associated memory, a 64-bit floating-pointaccelerator, and communication interfaces for interprocessor communica-tion. There are two forms of execution modes for the CM programs. Theprograms compiled for the Pans mode use the bit-serial CM processors asthe basic physical processing-plus-memory units. The programs compiledfor the Slicewise mode use the processing nodes as the basic physical units.

The slicewise mode can make better use of the performance potentialof the vector-processing abilities of the 64-bit floating-point accelerator.When invoked for slicewise mode, the compiler views the CM as a set ofvector processors.

Interprocessor Communication ModesThe CM systems provide three forms of interprocessor communication:

Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

Page 3: Canada - WIT Press · for the Slicewise mode use the processing nodes as the basic physical units. The slicewise mode can make better use of the performance potential of the vector-processing

Applications of Supercomputers in Engineering 395

• General communication : A hypercube message passing router, phys-ically connects the addresses differing by one bit. Within a certainrange, the hypercube communication of the CM is almost indepen-dent of the communication distance [2]:

• NEWS grid communication: Nearest neighbor grid-based connectionprovides hardware to wire each processor to its four nearest neighborsin a 2-D Cartesian grid. It is particularly suitable for the commu-nication between two addresses with indices differing by 1, but thisadvantage will diminish with the increase in the communication dis-

tance [3];

• Global communication: This involves cumulative and reduction com-putations along array dimensions, such as MAXVAL, SPREAD andglobal scanning. It's much faster than general communication on theCM, although not as fast as NEWS.

CM Parallel I/O SystemEvery CM system has from 2 to 16 channels available for data and/or imageI/O. In particular, the CM I/O system also offers a massive data storingdevice called DataVault, which has 30 to 60 GBytes of storage in its basicconfiguration and is capable of transferring data at a sustained rate above25 megabytes per second. Loading CM memory with data for processingby DataVault can be very fast. The CM DataVault also has independentI/O interfaces with the other peripherals such as tape drives. Thus, the CMneed not participate in the tape-to-Data Vault transfers, leaving it free toperform other operations while these transfers are going on. This feature isparticularly useful for SAR data downloading and processing.

CM FORTRAN FEATURES

Here we briefly introduce some important CM Fortran features described

in [4].

Fortran 90 StyleThe CM Fortran language is an implementation of Fortran 77 supplementedwith array processing extensions from the ANSI Fortran 90. The essenceof the Fortran 90 array processing feature is that an array object can bereferenced by name in an expression or passed as an argument to an intrinsicfunction, and the operation is performed on every element of the array. Thisfeature maps naturally onto the parallel architecture of the CM system,which brings thousands of processors to bear on large arrays, processing allthe elements in unison.

Virtual Processing

Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

Page 4: Canada - WIT Press · for the Slicewise mode use the processing nodes as the basic physical units. The slicewise mode can make better use of the performance potential of the vector-processing

396 Applications of Supercomputers in Engineering

Since many data sets are larger than even the largest CM, the system usesa virtual processing mechanism, whereby each physical processor simulatessome number of virtual processors by subdividing its memory, to ensurethat a virtual processor is assigned to each array element. Each physicalprocessor executes its instructions as many times as there are virtual pro-cessors assigned to it, a process called VP looping. The array objects areallocated in CM memory in VP (Virtual Processor) sets. The CM Fortrancompiler creates VP sets and maps CM Fortran arrays onto them. TheVPs in a VP set are logically configured into an n-dimensional rectangulargrid. More relevant to the physical memory layout is the ordering of VPsalong a grid axis to suit the requirement that VPs with adjoining addressesbe physically connected. The CM Fortran compiler provides the LAYOUTcompiler directive to express the ordering and weights of VP axes. Thedirective will determine the mapping of arrays to physical processors andtheir intercommunication modes. There are three forms of axis ordering,NEWS, SEND and SERIAL. The NEWS and SEND, respectively makes aVP set grid axis be mapped in NEWS addressing and router addressing.The SERIAL ordering makes all elements along this axis local (within oneprocessing node). The weights of VP axes make the VPs along the highest-weighted axis local as much as possible (allocated into one node). Thereforethe fastest communication can be achieved along this axis.

Array LayoutThe CM Fortran compiler lays out arrays in CM memory in two differentways:

• without LAYOUT directive, canonical layout: All the grid axes areNEWS-ordered and all are assigned the weight 1. Arrays of the sameshape are placed in the same VP set and their corresponding elementsare placed in the same VP. Thus elementary operations between thearrays require no interprocessor communication. The canonical lay-out optimizes performance in the same way that the virtual processormechanism does: by spreading array elements across the whole physi-cal machine to maximize Processing Elements (PEs) use and minimizeVP looping;

• using LAYOUT directive, non-canonical layout: This provides theuser more flexible ways to specify the axis ordering and weights ofthe VP set. It also supports serial ordering, which allocates the serialdimensions within the memory of the VPs. The inter-element opera-tions along the serial dimension only involve memory indexing ratherthan interprocessor communication which usually is much slower.

Elemental Code Block, Start-up Overhead and Unwinding Serial Loop

Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

Page 5: Canada - WIT Press · for the Slicewise mode use the processing nodes as the basic physical units. The slicewise mode can make better use of the performance potential of the vector-processing

Applications of Supercomputers in Engineering 397

The slicewise compiler excels when it deals with large blocks of elemental op-erations on one or more conformable CM arrays. Since these blocks executeentirely within the respective processing nodes, they enable the compiler tomake best use of the FPU registers and pipelined vector processing. Everyelemental code block or communication operation involves some start-upoverhead time required for the CM to receive addresses and data from thefront end. So the larger the elemental block, the more iterations in theVP looping, and thus the greater the time over which to amortize the over-head. However, the maximum number of parallel addresses in one elementalcode block is limited by the CM sequencer address register. Consequently,unwinding the serial loops can also improve communication performancefurther on serial dimensions because this action increases elemental codeblock size and thus helps amortize the start-up overhead.

CM Fortran UtilitiesThe CM Fortran provides many subroutine utilities to access certain CMhardware features not accounted for by either Fortran 90 or CM Fortran,and also to achieve better performance by low-level operations than thecurrent CM Fortran compiler.

# CMF_FE_ARRAY_TO_CM and CMF_FE_ARRAY_FROM_CM trans-fer the contents of a front end array to and from a CM array. Thesesubroutines transfer the array contents as fast as the connection be-tween the front end and the CM allows. They're particularly usefulfor non-CM parallel I/O operations such as parameter I/Os;

* Lookup table utilities: These subroutines use certain indirect address-ing hardware associated with each processing node rather than therouter. Therefore, they work substantially faster. They are limitedby the available memory in a single processing node as a completecopy of the table is placed in each node's memory. These utilitiesare very useful for referencing interpolator coefficients and looking upindices;

* Vector-valued subscripts on serial axes: By taking advantage of theCM's gather-scatter hardware, these utilities work much faster thanthe equivalent CM Fortran code. They provide a very efficient way toindex the serial dimensions:

* Parallel prefix or recurrence subroutines: The CM has special hard-ware and microcode that perform the parallel prefix operations alongone axis of an array. So these subroutines are much faster than theequivalent CM Fortran implementations which currently can not becompiled into efficient parallel operations.

CMSSL

Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

Page 6: Canada - WIT Press · for the Slicewise mode use the processing nodes as the basic physical units. The slicewise mode can make better use of the performance potential of the vector-processing

398 Applications of Supercomputers in Engineering

The Connection Machine Scientific Software Library (CMSSL) providesdata parallel implementation of many familiar numerical routines whichare used in signal processing, the solutions of partial differential equations,optimizations and statistical analysis. Here we only address the FFT utilitywhich is the kernel operation in SAR processing.

The CMSSL FFT utility provides many very flexible and convenientways to specify the FFT operations. Different operations (forward, inverseor no-operation) can be simultaneously done along different dimensions.This feature makes better use of the parallel architecture of the CM system.Different bit-order addressing can also be used for controlling the inputand output indexing. This feature matches the nature of the Cooley-Tukeyalgorithm's bit-reverse ordering in the transformation domain and can avoidsome unnecessary interprocessor communication.

As the kernel operation in Cooley-Tukey FFT algorithm, the butterflyoperation always operates between data at a distance of 2^. On the CM,the most efficient communication method for this is the hypercube router.Therefore the efficient layout for FFT arrays should be SEND or SERIALordering.

MAPPING OF SAR PROCESSING ONTO CM USING CM-FORTRAN

In satellite Synthetic Aperture Radar (SAR), lines of received radar energyconsisting of approximately 8000 complex samples are collected typically atthe rate of 1400 lines per second. Data is stored in a large two-dimensionalarray, in which each line is placed in the range direction, and successivelines are built up in the azimuth direction. In order to form the SARimage, large two-dimensional convolutional operators have to be applied.These are usually done with Fast Fourier Transforms (FFTs). In addition,a space-variant interpolation has to be performed in the range direction tocorrect for the coupling of received energy between the range and azimuthdirections (this is referred to as range cell migration correction or RCMC).

With this algorithm, SAR processing requires approximately 4000 mil-lion operations per second, and so is a suitable application for testing theperformance limits of modern parallel computers. Because only 5 to 15minutes of data are collected every 100 minutes (one satellite orbit), a pro-cessor operating at 1/10 of real time is suitable for many ground stationapplications. In the remainder of the paper, we will show how a typicalSAR processing algorithm is mapped onto a CM machine, using a parallelversion of the Fortran language.

The SAR processing is clone on two-dimensional data blocks, where theblock sizes are chosen to be powers of two. with the size chosen from arith-metic/memory trade-off. In each block, there are N^ points in the rangedirection (typically 4096), and N% points in the azimuth direction (typically

Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

Page 7: Canada - WIT Press · for the Slicewise mode use the processing nodes as the basic physical units. The slicewise mode can make better use of the performance potential of the vector-processing

Applications of Supercomputers in Engineering 399

Nr by Nainputarray

pt

Nr pt FFT alongeach of Narange lines

Na timesNrptmultip.

Nrpteachrange

1 FFT alongofNa> linesk

Napt azim..phasefunctions

Nr by Naoutputarray

DetectionNapt I FFT ateach of Nrazim. pulses

Nr timesNaptmultip.

Figure 1: The SAR Processing Flow Chart

2048). For the purposes of the benchmark test, the SAR processing stepsare implemented as in Figure 1.

Because the Connection Machine (CM-2 or CM-200) is a SIMD (Sin-gle Instruction Multiple Data) massively parallel processing computer, it isimpossible to pipeline the different parts of SAR processing as in conver-sional signal processors. Therefore. I/O operations, data conversions, rangecompression, RCMC, azimuth compression, magnitude detection and aver-aging have to be executed step by step on all the processors. Even thoughit is possible to use the different sections (sequencers) independently, theefficiency might not be higher. As mentioned before, the CM distributedmemory is always associated with the processing nodes and VP sets, soevery element of an array has a. distributed memory location. Under thiscircumstance, the corner-turning operation is unnecessary as it just uses adifferent addressing scheme of available distributed memory. Actually, inthis benchmark program, range inverse FFT and azimuth forward FFT areperformed in the same time. This is also one the advantages of the CM.

Generally, the only way to increase the processing speed is to reduce theexecution time of each step as much as possible. Several possible ways toachieve better performance have been investigated as outlined below:

Layout and FFT Bit Order

1. Lay out the first dimension (In the range dimension, as in Fortran, the

Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

Page 8: Canada - WIT Press · for the Slicewise mode use the processing nodes as the basic physical units. The slicewise mode can make better use of the performance potential of the vector-processing

400 Applications of Supercomputers in Engineering

left-most index changes fastest) serially. This means all the operationsalong the range dimension can be executed within the processing nodeand without interprocessor communication. This layout also providesthe possibility to use CMF_AREF_1D for indexing on the serial axisso that a faster circular shift of integer RCMC can be implemented.This layout is memory costly. It can save many communications butcosts a large amount of VP loopings for any operations along the rangedimension;

2. Lay out all the second dimensions (azimuth dimension) of the parallelarrays by the SEND ordering. This is necessary for the FFT becausethe hyper cube router communication is more suitable for the butterflyoperation. Even one NEWS-orderecl axis will increase running timeby approximately 30% to 150%. Laying out all the array parallel axesin SEND order can avoid unnecessary interprocessor communicationcaused by any operations between the mix-ordered arrays. Fortu-nately, in this algorithm, we don't need any specific operations basedon the NEWS grid and elementary arithmetic operations are indepen-dent from axis ordering because they're executed entirely within therespective processing nodes;

3. Specify different bit orders for the FFT input and output along eachactive axis. However, specify the same bit order for both input andoutput along each non-activated axis. This is based on the bit-reversenature of the Cooley-Tukey FFT algorithm and can avoid unneces-sary bit-reversing manipulations which usually need a large numberof interprocessor router communications. To meet the bit-reversedproperty in the frequency domain, all the operations in the frequencydomain should be done in the bit-reversed order. For example, therange and azimuth matched filter and range migration all should becomputed in bit-reversed order. This requires the bit-reversed indicesto be pre-computed.

Two Dimensional Range Matched FilterReplicate the range matched filter along the azimuth dimension so that itcan directly multiply the 2-D signal array. This action saves many commu-nications as it involves no interprocessor communication if two conformablearrays are laid out properly.

Lookup TableUse the lookup table utility for looking up interpolator coefficients and theRCMC shift indices. This can substantially improve the performance byusing in-node hardware instead of vector-valued subscripts which involverouter communication.

Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

Page 9: Canada - WIT Press · for the Slicewise mode use the processing nodes as the basic physical units. The slicewise mode can make better use of the performance potential of the vector-processing

Applications of Supercomputers in Engineering 401

Integer Circular ShiftUse the CMF_AREF_1D utility to complete the integer circular shift alongthe serially-ordered range dimension. Even though there's an intrinsic func-tion CSHIFT in CM Fortran, which is designed for circular shifting alongone axis, it only works efficiently along the NEWS ordered axis (The SENDordering is used because of the EFT requirement) and for shift amountsequal to a power of 2. Thus, the CMF_AREF_1D utility is faster.

Serial LoopUnwind the serial loop of RCMC convolution so as to increase the size ofelemental code block and to amortize the start-up overhead.

Vector SummationUse CMF_SCAN_ADD parallel prefix utility for 4-point summation. Theadvantage of using this utility is that it not only can achieve higher per-formance, but also can get more useful by-products. For example, in thisalgorithm, four different summed or averaged results can be obtained si-multaneously by calling the utility once: 4-point averaging, 3/4 of 3-pointaveraging, 1/2 of 2-point averaging and 1/4 of the original values.

Other Considerations

D Use the sin/cos functions instead of complex exponential function forcomputing the azimuth matched filter on the fly. This action usuallycan save some execution time.

D Use SQRT function instead of complex ABS function to detect themagnitude of the compressed signal.

These two optimizations are simply based on performance measurementsrather than the CM architecture requirements as they both don't involvecommunication. It is also possible to use a sin/cos lookup table instead ofdirectly calling the intrinsic functions. That will require complicated indexcomputations (mod of 2?r. converting negative values to positive indices)which might cause interprocessor communication. According to actual sim-ulation, it is better to use the sin/cos intrinsic functions directly.

Parallel I/O

1. Use Fortran file I/O to input small arrays such as parameters, interpo-lator coefficients and then use the CMF_FE_ARRAY_TO_CM utilityto transfer them (or some of them) to the CM.

2. Use the CM parallel file system (Data.Vault) to input and output largeCM arrays such as raw signal data, range matched filter and finallyprocessed image data, etc. The CM parallel file system can not only

Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

Page 10: Canada - WIT Press · for the Slicewise mode use the processing nodes as the basic physical units. The slicewise mode can make better use of the performance potential of the vector-processing

402 Applications of Supercomputers in Engineering

transfer large data blocks at high speed directly between CM memoryand the Data Vault, but also keep the CM array layout information inthe CM files.

PERFORMANCE DISCUSSION

Timing Results on CM-2The following are the timing results of executing the benchmarked SARprocessing algorithm on an 8K CM-2. Different signal array sizes have beentried so as to determine the optimal processing array size for the 8K CM-2.The various processing step performances are also investigated to determinethe general processing requirement profile.

Here, the CM elapsed time is the wall clock time on the front end. Allperformance timings are in seconds and where appropriate are scaled toMFLOPS. Because the front end is under the UNIX time-sharing mode,the elapsed time usually needs to be measured several times. The elapsedtime is never less than CM Busy Time. Its ratio over CM Busy Time reflectsone part of the program's parallel efficiency.

Processing Time Distribution For 2K*4K Array

sn, ,

CM Elapsed Time

I | CM Time

0 Range Rg IFFTFFT & & Azim.Muftip. FFT

4-ptSummation

Figure 2: Processing Time Profile for the CM-2

From Table 1, one can easily find that the azimuth IFFT (parallel di-mension performance) is particularly poor when the azimuth length of thearray is 512. That's because the minimum array parallel extension of 8Kslicewise mode CM should be IK (8K/32*4). Thus a 512 azimuth extensioncauses the zero-padding along this axis (range dimension is locally laid out)thereby decreases the execution efficiency (some processors are idle).

Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

Page 11: Canada - WIT Press · for the Slicewise mode use the processing nodes as the basic physical units. The slicewise mode can make better use of the performance potential of the vector-processing

Applications of Supercomputers in Engineering 403

Table 1: Timing Results on 8K processor CM-2

RangeFFT&Filtermulti-plica-tion

RangeIFFT&Az-irnuthFFT

RCMCIntegerShift

RCMCFrac-tionalShift

Convo-lution

Azimut]Comp-ression

TotalOn-lineOpera-tions

CM200performance

ElapsedTime(sec)

0.3808scaled to1454MFLOPS

1.5620scaled to618MFLOPS

0.3376

0.6220

0.2632scaled to1020MFLOPS

1.969

4.534

CMBusyTime(sec)

0.3774scaled to1467MFLOPS

1.5580scaled to619MFLOPS

0.2753

0.5590

0.2627scaled to1022MFLOPS

1.965

4.459

CM- 2 performance

ElapsedTime (sec)

0.6433scaled to861MFLOPS

2.478scaled to389MFLOPS

0.5446

1.031

0.4492scaled to598MFLOPS

3.295

7.447

CMBusyTime(sec)

0.5979scaled to926MFLOPS

2.4360scaled to396MFLOPS

0.4395

0.9231

0.1469scaled to601MFLOPS

3.218

7.175

improvementfactor (CM BusyTime ratio)

1.58

1.56

1.60

1.65

1.70

1.64

1.61

Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

Page 12: Canada - WIT Press · for the Slicewise mode use the processing nodes as the basic physical units. The slicewise mode can make better use of the performance potential of the vector-processing

404 Applications of Supercomputers in Engineering

The performance results in Table 1 are summarized in Figure 2 andFigure 3. Figure 2 clearly shows that the range and azimuth compressions(FFTs and complex multiplications, including the generation of azimuthmatched filter coefficients) play the major roles in the whole procedure.Figure 3 shows comparison of the performances of the CM-2 of 3 differentarray sizes for the main processing steps of the algorithm.

CM Processing Speed for Different Array Size

1000# Range FFT &

Filter Multip.

xRCMCFractionalShift

o Range IFFT &Azimuth FFT

A B C

Array Size

A: 512*1024 B: 1024*2048 C: 2048*4096

Figure 3: Performance vs. Signal Array Size for the CM-2

Comparison Between the CM-2 & CM-200 PerformanceThe following table lists the CM-2 and CM-200 performance comparisonsfor various steps of the S AR processing. A improvement factor can be easilyobtained from the comparisons.

According to the improvement factors in Table 2, the reliable estimatedCM-200 performance for the benchmark program is: 1.6*CM-2 perfor-mance, i.e. 4.90 seconds CM Busy Time and 4.97 seconds CM elapsed time.The improvement factor of the CM-200's clock frequency is 1.5, and thereare some other improvements such as instruction execution speed and off-chip communication speed. Therefore the total performance improvementfactor of 1.6 is a. reliable estimate.

Notes: Here CM-2 and CM-200 refer to the following CM system con-figuration:

1. CM-2 Configuration:

. Front End: SUN SPARC

Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

Page 13: Canada - WIT Press · for the Slicewise mode use the processing nodes as the basic physical units. The slicewise mode can make better use of the performance potential of the vector-processing

Applications of Supercomputers in Engineering 405

Table 2: Comparison of CM-200 With CM-2 Execution Time

RangeFFT&Filtermulti-plica-tion

RangeIFFT& Az-imuthFFT

RCMCIntegerShift

RCMCFrac-tionalShift

Convo-lution

AzimutComp-ression

TotalOn-lineOpera-tions

CM200performance

ElapsedTime (sec)

0.3808scaled to1454MFLOPS

1.5620scaled to618MFLOPS

0.3376

0.6220

0.2632scaled to1020MFLOPS

1.969

4.534

CMBusyTime[sec)

0.3774scaled to1467MFLOPS

1.5580scaled to619MFLOPS

0.2753

0.5590

0.2627scaled to1022MFLOPS

1.965

4.459

CM-2 performance

ElapsedTime (sec)

0.6433scaled to861MFLOPS

2478scaled to389MFLOPS

0.5446

1.031

0.4492scaled to598MFLOPS

3.295

7.447

CMBusyTime(sec)

0.5979scaled to926MFLOPS

2.4360scaled to396MFLOPS

0.4395

0.9231

0.4469scaled to601MFLOPS

3.218

7.175

improvementfactor (CM BusyTime ratio)

1.58

1.56

1.60

1.65

1.70

1.64

1.61

Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

Page 14: Canada - WIT Press · for the Slicewise mode use the processing nodes as the basic physical units. The slicewise mode can make better use of the performance potential of the vector-processing

406 Applications of Supercomputers in Engineering

• Physical Size: 4K+4K CM2 (2 sequencers)

# Clock Speed: 7 Mhz

• CM Memory Size: 256 Kbits/processor

• FPU Configuration: wtl 3164, 64 bit precision

2. CM-200 Configuration:

• Front End:SUN SPARC

# Physical Size: 4K+4K CM200

# Clock Speed: 10.50 Mhz

• CM Memory Size: 1024 Kbits/processor

• FPU: 64 bit precision

CONCLUSIONS AND SUGGESTIONS

Connection Machine Sizing for I/10th Real Time SAR ProcessingAccording to the execution timing results on the CM-2 and the estimatedCM-200 performance, the full size NaxNr=2048x4096 SAR processing bench-mark needs about 7.95 seconds elapsed time on the CM-2 and about 4.97seconds elapsed time on the CM-200 for all on-line processing excluding I/Oand data format conversion. Since the CM Data Vault can transfer data at asustained rate above 25 MBytes/sec, the total of approximately 20 MBytesdata I/O for one aperture signal processing should take less than one second.The I/O data conversions (8 bits integer to 32 bits float and 32 bits floatto 16 bits integer) would only need a very small amount of time because nocommunication is involved.

This conservative estimate of total run time (including I/O and dataconversion) of the full size benchmark on the CM-200 should be about 7seconds, i.e. less than l/10th real time. Therefore, the following CM systemconfiguration is recommended for I/10th real time satellite SAR processing:

1. 8K processor CM-200

2. 10.50 Mhz or higher clock frequency

3. 1 Mbits or larger memory per processor

4. 64 bit floating-point unit

5. 30 GBytes or more Data Vault

6. fully-configured CM I/O controllers

Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

Page 15: Canada - WIT Press · for the Slicewise mode use the processing nodes as the basic physical units. The slicewise mode can make better use of the performance potential of the vector-processing

Applications of Supercomputers in Engineering 407

Under the current layout of working arrays (i.e. serial range dimension),the largest CM processing unit size for full size 2K*4K array is 16K pro-cessors because of the limitation of the VP sets mapping. Because the firstdimension of the working arrays are laid out locally, the parallel dimensionsonly spread out along the azimuth direction (maximum size 2K). Underslicewise mode, the number of processing elements is equal to the numberof physical processors divided by 32. Since the FPU's vector length is 4, thesizes of VP sets need to be a multiple of 4 times the number of processingelements. All the arrays (for non-serial axes) whose sizes don't meet this re-quirement will be padded with some zeros to the nearest effective sizes. Thiszero-padding deactivates some virtual processors and thereby decreases theexecution efficiency. Hence, the minimum parallel dimension size for 16Kprocessors is 16K/32*4=2K. If the same layouts are used in a larger CM(larger than 16 K), 2K*4K array processing performance improvement can

not be guaranteed.The improvement between the 16K CM200 and the 8K CM200 can only

be achieved by the parallel operations along the azimuth dimension becauseall the operations along the range axis are executed serially within oneprocessing node. A better way to improve the performance might be toincrease the working clock frequency because the costs of all the parallelcommunications and serial operations will be reduced by that. This willprovide a linear scaling of the CM's performance.

Independent Use of CM SequencersThere is a possibility to use the different sections of the CM system inde-pendently even though the CM is generally a SIMD system, because a largeCM system processing unit is usually divided into several sections and iscontrolled by several independent sequencers.

This function of the CM might be useful for Doppler parameter esti-mation in SAR processing. A CM system larger than 16K can be used asseveral sections and Doppler parameter computation and the other auxil-iary processing can be done simultaneously with the main signal processingwork. The synchronization between these independent sequencers is de-termined by the time-sharing multi-process scheduling on the UNIX frontend or multiple front ends through NEXUS, a sequencer combiner or mul-

tiplexer.

Acknowledgments: The authors would like to thank Dr. J. Claerbout,head of the Stanford Exploration Project, also Thinking Machines Corp.for help in accessing the Connection Machines. Particularly, the assistancefrom Dr. Biondo Biondi of TMC is greatly appreciated.

Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

Page 16: Canada - WIT Press · for the Slicewise mode use the processing nodes as the basic physical units. The slicewise mode can make better use of the performance potential of the vector-processing

408 Applications of Supercomputers in Engineering

References

[1]. Thinking Machines Corporation, CM-200 Series Technical Summary,TMC, Cambridge, Mass., 1991.[2]. McBryan, Oliver A. 'The Connection Machine: PDE Solution on 65,536Processors' Parallel Computing Vol.9, No. 1, pp!24, Dec. 1988.[3]. Wu, Min-You Shu, Wei. 'Performance Estimation of Gaussian-Eliminationon the Connection Machine,' in 1989 International Conference on ParallelProcessing, pp!81184. Proceedings of 1989 International Conference onParallel Processing, IEEE, New York.[4]. Thinking Machines Corporation, Connection Machine Fortran Pro-gramming Guide, TMC, Cambridge, Mass., July, 1991.

Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517