On parallel scan-conversion algorithms for transputer networks

Journal of Microcomputer Applications (1990) 13,43-55

On parallel scan-conversion algorithms for transputer networks

H. E. Bez and L. Parks

Department of Computer Studies, University of Technology, Loughborough, Leicestershire LEI I 3TU, U.K.

Two fundamental scan line procedures are considered from the point of view of parallel implementation on transputer networks. The paper describes algorithms for the scan conversion of polygons, and the related hidden surface elimination problem for three- dimensional models with polygonal data structures. In each case parallel designs have heen implemented and timed in various configurations against a functionally equivalent, but not necessarily optimized, single processor code. The results presented demonstrate that scan line algorithms can be efficiently implemented on suitably configured networks of transputers.

1. Introduction Computer graphics is widely recognized as an ideal problem domain for parallel computing [l-3]. Many end-users of systems with graphical interfaces are demanding increasing sophistication and, by implication, a much higher performance from the

graphics component. Examples, for which high quality fully rendered images are required on the fly, include computer-aided engineering design, flight simulation, and

animation. One solution is to design special hardware accelerators to provide the required performance [4,5]. This approach, however, tends to be inflexible and is not readily updated, generalized or reconfigured. Alternatively, software solutions are

feasible using reconfigurable general purpose parallel processing systems which meet the computational demands of these and similar applications. The software approach, to which we address ourselves in this paper, more readily lends itself to rapid prototyping and optimization.

Until recently, however, the availability of readily usable and affordable parallel processing systems was not wide; but the advent of the INMOS transputer, and other similar microcomputer chips, has lead to a proliferation of relatively inexpensive workstations with a reconfigurable parallel processing sub-system. These novel systems

provide an opportunity to design and compare parallel algorithms for a wide variety of application areas-including computer graphics. Variations in the structure of algorithms can be investigated to optimize performance for different transputer configurations, matching algorithm design and network topology. In this paper we describe some experimental work on the parallel implementation on a transputer-based workstation, of

some graphical procedures that are fundamental to many computer graphics applications, and to the implementation of general purpose graphics systems, such as GKS. Many problems in this domain are ideally suited to coarse-grain parallel implementation and the transputer is therefore a compute node of the appropriate type for our purposes.

43

0745-7 138/90/O 10043 + 13 $03.00/O (6 1990 Academic Press Limited

44 H. E. Bez and L. Parks

Figure 1. Overview of the transputer system.

The problems chosen, i.e. the scan conversion hidden surfaces, direct the attack at the potential applications.

of polygons and the elimination of output bottleneck of many graphics

The loan of a transputer system, based on the INMOS BOO4 and BOO3 evaluation boards hosted by a PC AT clone, from the RAL/DTI transputer initiative has enabled us to implement and test our algorithms in OCCAM 2. An overview of the architecture, which is an example of a MIMD parallel processing machine without shared memory, is shown in Figure 1.

Two standard configurations for MIMD parallel processing systems, to exploit different types of parallelism, are processor pipelines and processor farms. A processor pipeline is an ordered set of processors in which the output of each processor is the input of its successor, and concurrency is achieved by allowing a number of sub-tasks to be in various stages of execution at the same time. A processor farm consists of a network of slave, or worker, processors linked to a master or farm processor which is in turn linked to a host machine or a further network. Each of the slave processors performs the same task and, normally, data packets (i.e. items for one execution of the task) are sent one at a time to the network by the master processor. The master may also receive output from the slave processors and deliver it to the required destination. Alternatively, in cases where this might lead to a bottleneck and impair efficiency, output data may be routed more directly to its destination-this is often the case for transputer representations of processor farms where, due to the limited number of communication links available, it is necessary to construct the farm in a way that requires all but one of the slave processors not to be linked directly to the master processor.

I I i__J L_____I

Figure 2. A pipeline of processors.

On parallel scan-conversion algorithms for transputer networks 45

Worker

I

Master Worker

2

Figure 3. A processor farm architecture

It is to be noted that if there are to be more than three worker processors on a transputer based farm implementation then there will be insufficient transputer links available to map the network shown in Figure 3 directly onto the hardware. The usual solution to this [6] is to map the process farm onto the hardware topology illustrated in Figure 4 where it can be seen that the physical connectivity is similar to that of a pipeline of processors; the software configuration is however quite different. The mapping uses two transputer links per worker processor allowing the remaining two to be used, for example, for routing computed data to its destination.

Throughout the paper we have based our efficiency calculations on the time required to run the algorithms on a multi-transputer network, relative to the time required to run the same algorithm on a single transputer. We are thus measuring how efficiently the algorithms distribute, rather than estimating their performance against a best serial solution.

We define the speedup S(A) and efficiency E(A) of a distributed algorithm A, executing on N processors, to be

S(A) = TJT, and E(A) = lOO.S(A)/N

where T, denotes the time taken for A to execute on a network consisting of q transputers in some configuration.

--Jgq+Z~E~ * * r+Jq Figure 4. Transputer realization of a processor farm.


In general, if high efficiency values are to be achieved, it is necessary to both balance the work loads between the processors and to maximize the compute to communication ratio for the network.

In the second section of the paper we discuss our parallel solution to the problem of filling the interior of a region defined by a polygonal boundary. The third section describes the extension of the polygon fill primitive to a hidden surface elimination algorithm by incorporating depth processing.

2. Scan converting polygons

Initially two approaches to the scan conversion of polygons were considered; the simple scan line algorithm using scan line coherence, and the approach that incorporates both scan line and edge coherence. Serial algorithms for both techniques are widely documented [7]. Other workers [8] have considered the same problem for a shared bus, coarse-grained parallel processing architecture with global memory. They have sought to compare alternative parallelizations of standard algorithms whereas our algorithm is based on a strategy designed, ab initio, for parallel execution. We compare our approach and results with those of Ghosal and Patnaik [8]. Our method is shown to extend, in a natural way, to a full hidden surface elimination algorithm.

The algorithm assigns particular scan lines to particular processors according to the number of worker processors available. If two workers are available, then for each polygon one processor is responsible for the even numbered scan lines that intersect the polygon, and the other the odd numbered ones. In general if there are N worker processors available then each one is responsible for processing every Nth scan line within the polygon extent, and the mapping chosen processes those scan lines with numbers having remainder r on division by N on processor number r + 1. The pth processor, where 1 <pd N, computes polygon intersection data for those scan lines numbers, y, that satisfy the congruence

(p- I)=y mod N

as shown in Figure 5. This distribution of the computation, when used in conjunction with the broadcasting

of each polygon to every worker processor (in the way described below) is preferable to partitioning the screen into contiguous regions, as polygon coherence provides automa- tic load balancing between the worker processors. It is clear that any number of processors, up to a maximum determined by the total number of scan lines of the graphics device, may be employed according to the number of nodes available, or to meet the required performance. The method is similar to static interleafing as described by Hu and Foley [9].

Pairs of vectors {[x0, y(i)], [xG+ l), yo+ l)]}, re resenting polygon edges, are p processed on each transputer using the incremental formulae,

xi+l =.q+N/m andyi+,=y,+N,

where N is the number of processors and m is the slope of the edge, to determine the intersections that its scan lines have with the edge, see Figure 5. It is to be noted that the


Polygon edge

\

(X( J 1, Y(i ))

Figure 5. The distribution of scan lines to processors. If N=4 then the point (x, , , I y, + ,) will be computed subsequent to (x,, y,) on the same processor, i.e. processor number I+ y; mod 4.

formulae use real arithmetic. This could, of course, be replaced by an integer method akin to Bresenham’s algorithm for scan converting vectors [7].

The question of initializing or seeding this incremental procedure on each processor and on each polygon edge arises. To do this it is necessary to find, for the pth processor, the scan line of least index that intersects the edge being processed. If {[x(&y(j)], [x(i-t I), y(j+ 1)]) denotes a polygon edge, and y(j+ l)>yo’), then the y-value of the seed for processor number p, is given by

y(j)-y(jJ mod N+(p- 1) if @- l)>yo) mod N

and

YO-YO mod N+(N+p- 1) if @- l)<y(j) mod N

from which the corresponding x value may be determined. For cases where y(j + 1) <y(j), the roles of the end points are reversed and the same formulae used to determine the seed. In this way the processing of each edge begins at the vertex with the lesser J!-coordinate.

The intersections computed in this way are then sorted, by increasing x co-ordinate into the existing list of intersections for the polygon to which the edge belongs, i.e. the one currently being scan converted. When all the edges of a polygon have been processed in this way, pairs of points defining coherent horizontal pixel runs are available to the display system for visualization.

The pseudo-code procedure SC_2 shown in Figure 6, describes the main features of our algorithm for scan converting polygons; the OCCAM equivalent of SC_2 executes on each worker processor as shown in Figure 7. As can be seen from the pseudo-code, the process se-2 contains no pseudo-parallel constructs. This is also the case for the processes throughput and output; however the three processes execute in pseudo-parallel


fashion on a single processor, i.e. each compute node of the network executes an instance

of a parameterized PROC, denoted Q(), having the form

PAR throughput SC 2 output

The process for the pth processor can be written Q(p,. . .), where it is to be noted that the

chief difference between the processes Q(p,. . .) and Q@‘,. . .) is in the set of scan lines

processed by each (as determined by the congruence (p - 1) -y mod N, discussed earlier in the paper). The distributed algorithm, on an N processor network, may therefore be expressed as

PLACED PAR p = I FOR N

Q(P,. .)

The treatment of singularities, relating to the special cases where scan lines are parallel to polygon sides or intersect polygon vertices, is by standard methods and is omitted from

the pseudo-code for clarity.

PROC sc_2 /* two-dimentional polygon scan conversion procedure */ WHILE polygons to scan convert DO

BEGIN get a polygon II={r,,. . ., rk} from throughput FOR each polygon edge E E {(r,,rJ,. . .,(r,.,r,)} DO

BEGIN (i) find the first scan line that intersects the edge E and evaluate the

.x-co-ordinate of the intersection point.

(ii) step along the edge E incrementing x appropriately for each scan line intersection

(iii) sort into the existing list of intersections (for this line) for the polygon being processed (x value. type) sorted on increasing x value

END FOR each scan line DO

BEGIN pair the computed data points for the polygon fI on their x values and add these pairs to the list for the whole polygon set being processed&note that at this point they are just added to the end of the list and not sorted into it. Each member of the list, for a given scan line, has the form (xl ,x2,poly_id)) where xl <x2

END END

Figure 6. Main steps of the two-dimensional polygon scan converter.

The timing tests, shown in Table 1 below, are those obtained for scan converting the polygon TI, of Figure 8, and a polygon 0 which may be obtained from 17 by multiplying its vertex co-ordinates by two.


from Previous > to Next

to Previous < from Nexl

Computed dota to destlnotlon

Figure 7. Slave processor for polygon scan conversion

Table 1. Timed execution of the scan conversion algorithm for polygons II and 0

No. of

Processors

1 2 3 4

Time S E

l-l 0 n 0 n 0

480 1322 267 686 I.8 I.93 90% 97% 194 478 2.47 2.77 82% 92% 163 372 2.94 3.55 14% 90%

(2,38)

(22, 80)

(46, 52) (56, 52)

(26, 30) (40,301 (52, 30)

(22,2) (46, 2) (58,2)

(66. 30)

Figure 8. The test polygon II


As the figures in Table 1 show, the performance of the polygon scan converter is problem dependent and the question arises as to what is a suitable polygon-or set of polygons- on which algorithms can be benchmarked.

The larger the polygon the more computation each processor in the network has to do without any increase in the communication overhead, and consequently, the better the performance. The algorithm is very efficient on the larger polygon 0 but also performs well on the smaller, and perhaps more typical, polygon II.

3. The hidden surface elimination algorithm

Most of the widely known hidden surface elimination techniques [7] are amenable to some form of parallel implementation. For example the ray tracing method has been efficiently implemented on a transputer based system configured as a processor farm [6]. Our hidden surface algorithm is scan line based and has been developed from the polygon scan conversion algorithm described in the previous section. It incorporates z- depth sorting on line segments, belonging to polygon interiors, in a manner analogous to the painter’s algorithm. An outline is given in the pseudo-code algorithm shown in Figure 9 and further details are contained in the appendix; hse executes after the process SC_3 has terminated. The SC_3 procedure is a simple extension of SC_2, incorporating three dimensional processing by incrementing the z-coordinate in addition to the x- coordinate of each intersection point. The compute node for the complete algorithm is shown in Figure 10.

PROC hse FOR each scan line DO

BEGIN (i) sort the vector segments on their largest z-coordinate to form a list L

(ii) re-sort the list L to resolve problems that may occur when vectors z-extents overlap

m)

END/“’

output the modified list L to be written into refresh buffer in order

Figure 9. Main steps of the hidden detail removal algorithm.

To resolve the ambiguities alluded to in step (ii) of hse, it is necessary to carry out some z- depth tests to ensure that the vector segments are scan converted in the correct order. If V denotes the vector at the start of the sorted list L, then before the position of V in the list L is confirmed, it must be tested against each vector U in the list L whose z-extent overlaps the z-extent of V.

This test is a sequence of up to three sub-tests, performed in order of increasing complexity. As soon as one of the sub-tests succeeds the position of V, relative to a vector U having overlapping z-extent with V, on L is confirmed. If all three tests fail the positions of V and U in the list L must be swapped.

(1) The vectors x-extents do not overlap; hence V must be scan converted before U, and be in its correct position relative to U in the list L.

(2) V (tested line) is wholly on that side of the line containing U (test line) which is

On parallel scan-conversion algorithms for transputer networks 5 1

from Prevrous ____ -/<i--+--S- to Next

c to Previous < 8 ( output ___ from Next

computed data to destlnotlon

Figure 10. Slave processor for hidden surface elimination

further from the viewer; hence V must be scan converted before U, and be in its correct position relative to U in the list L.

(3) U(tested line) is wholly on that side of the line of V(test line) which is nearer to the viewer; hence V must be scan converted before U and be in its correct position, relative

to U, in the list L. A pseudo-code representation of the re-sort procedure is given in the Appendix to the paper. The code uses a left-handed co-ordinate system, as is conventional for image-

space computations in computer graphics. We ran the algorithm for a number of worst case polygon data sets, i.e. those for

which the initial sorting process produces a list ordering that is completely inverted to

that which is required. These cases provide the highest achievable efficiency values since they require the maximum amount of sorting. The results for a typical worst case are shown in Table 2 where it can be seen that the increased degree of computation involved,

over the two-dimensional case, has produced significantly improved efficiency coefficients for the distributed codes. Although the algorithm produces super-linear performance with two processors for these problems, it should be stressed that for a more typical problem the speed-up will be sublinear.

Table 2. Typical worst-case performance for the hse algorithm

No. of processors Time S E

1 8050

2 3922 2.05 103% 3 2713 2.97 99% 4 2116 3.8 9.5%


4. Conclusions

The results presented in the paper show that scan-line algorithms for polygon filling and hidden surface elimination can perform well on networks of transputers. They exhibit good speed-up and efficiency coefficients for small numbers of processors. It is to be

noted that the algorithms described are also suited to implementation on a shared memory parallel processing system, of the type considered by Ghosal and Patnaik [8].

Their two-dimensional polygon fill methods also appear to perform well showing

comparable speedup figures to ours on similar polygons. However it does seem that their theoretical performance estimates, and therefore presumably also their experimental

results, exclude consideration of the computation overhead involved in the construction of the edge table (which uses a parallel bucket sort), and this makes direct comparison difficult. They do not consider the three dimensional case in the paper cited.

The current version of the scan-line algorithm for hidden surface elimination, whilst

taking full account of coherence in the X, y and z directions during the initial phase of processing (in SC-~), it does not use z-coherence in the re-sort procedure; i.e. re-sort makes no use of the fact that data items (i.e. segment lists) for scan lines that are ‘close together’ are likely to require re-sorting in precisely the same way. It should be possible to develop hse to incorporate this depth coherence into the final sorting, enabling scan lines to be re-sorted in groups and thus reducing the total time required to process a model. However, such modifications are unlikely to significantly alter the speedup and efficiency values for the distributed codes. In addition the use of integer only methods. such as Bresenham’s [7], in the processing of polygon edges will improve the throughput of the code if not the relative performance of the distributed forms.

Acknowledgements

We wish to thank the Rutherford Appleton Laboratory and the Department of Trade and Industry for the provision of both a transputer system and a training course under

the transputer initiative, to support the work described in this paper. In addition we would like to thank Professor D. J. Evans, of the Department of Computer Studies at Loughborough, for allowing us access to an alternative transputer system, also on loan

under the transputer initiative, for some of the development work.

References

1. A. Glassner & H. Fuchs 1985. Hardware enhancements for raster graphics. In Fundamental Algorithms for Computer Graphics, NATO AS1 Series F: Computer and Systems Sciences. Vol. 17. Berlin: Springer-Verlag, 63 1-658.

2. P. M. Dew. J. Dodsworth & D. T. Morris 1985. Systolic array architectures for high performance CAD/CAM workstations. In Fundamental Algorithms for Computer Graphics, NATO ASI Series F: Computer and Systems Sciences, Vol. 17. Berlin: Springer-Verlag, 659-694.

3. A. C. Kilgour 1985. Parallel architectures for high performance graphics systems. In FundamentaI Algorithms for Computer Graphics, NATO AS1 Series F: Computer and Systems Sciences, Vol. 17. Berlin: Springer-Verlag, 695-703.

4. G. Abram & H. Fuchs 1984. VLSI Architectures for computer graphics. Proc. NATO studies. Berlin; Springer-Verlag.

5. A. Thomas 1987. Specialised hardware for computer graphics. In Techniques for Computer Graphics. Berlin: Springer-Verlag.


6. J. Packer 1987. Exploiting concurrency; A ray tracing example, INMOS Technical Note. 7. 7. J. D. Foley & A. Van Dam 1982. Fundamentals of Interactive Computer Graphics. Reading.

MA: Addison-Wesley. 8. D. Ghosal & L. M. Patnaik 1986. Parallel polygon scan conversion algorithms: performance

evaluation on a shared bus architecture. Computers and Graphics, 10, 7-25. 9. M. Hu &J. D. Foley 1985. Parallel processing approaches to hidden-surface removal in image

space. Computers and Graphics. 9, 303-3 17.

Helmut Bez received a first class degree in Mathematics in 1972 from the University of Wales, and MSc and DPhil degrees from Oxford Univer- sity in 1973 and 1976 respectively. In 1976 he joined Rolls-Royce Aero Engines, and in 1980 was appointed to the academic staff at Loughbor- ough University of Technology where he is a senior lecturer in the department of Computer Studies. His research interests include, computer graphics, computer aided design, parallel processing, man-

I machine interfaces and mathematical methods. Publications include research papers in these areas and a successful textbook on mathematics for computer science.

Lesley Parks received her BSc(Econ) degree in 1969 from the University of London and then worked in education for several years. In 1986 she was awarded an MSc in computer science from the University of Newcastle and was appointed as a programmer in the Department of Computer Studies at Loughborough University of Technology. Her research interests include educational software and parallel processing.

Appendix

This appendix contains pseudo-codes for the sc3 procedure and the w-sort step of the hse algorithm presented in the paper-pseudo-codes for the depth checking primitives of resort are also included.

List elements U and V, for a given scan line, have the form

V= [xv(l), XV(~), zv( l), zv(2), poly_id], xv(l) <XV(~) and

Li = [xu( l), xu(2), zu( l), zu(2), poly_id], xu( 1) <.x24(2)

PROC SC 3 /* three-dimentional polygon scan conversion procedure */ WHILE polygons to scan convert DO

BEGIN get a polygon II-{r,.. ., r,} from throughput FOR each polygon edge E-E ((r,,r2),. . .,(rkrr,)} DO

BEGIN (i) find the first scan line that interesects the edge E and evaluate the x- and

z-coordinates of the intersection point.


(ii) step along the edge E incrementing x and z appropriately for each scan line intersection

(iii) sort into the existing list of intersections (for this line) for the polygon being processed (x value, z value, type) sorted on increasing x value

END FOR each scan line intersecting II DO

BEGIN pair the computed data points for the polygon II on their .X values and add these pairs to the list for the whole polygon set being processed-note that at this point they are just added to the end of the list and not sorted into it. Each member of the list, for a given scan line, has the form (xl ,x2,zl ,z2,poly_id) where x 13 x2

END END

PROC re-sort (L); /* sorting procedure for z-depth processing, L is the input list with pointer array NL */ Lqointer + L-head REPEAT

Vt L(L_ pointer) /* determine the sublist of vectors, zlist with pointer array NZ, having overlapping z-extent

with V */ compute zlist (V) _ pointer+zlist_head swapped + true REPEAT

IF not (swapped) THEN

z_ pointer+ NZ(z_ pointer) Utzlist (_ pointer) swappedttrue IF No_Overlap_of X( V, Cl)

THEN swappedcfalse

ELSE BEGIN

IF V_Behind_U( V. U) THEN

swappedtfalse ELSE

BEGIN IF U_Infront_V( V,U)

THEN swappedefalse

END IF swapped

THEN BEGIN

swap V and U in L V+ L(Lgointer) compute zlist (V) /* replace V by U in existing zlist */ zqointertzlist_head

END


UNTIL (NZ(z_ pointer) = Null) L_ pointer+NL(L_ pointer)

UNTIL (NL(L_ pointer) = Null)

The pseudo-codes for the depth checking functions of re-sort now follow

FUNCTION No_Overlap_of_X( V, cr): Boolean BEGIN

IF ((xv(2)<xu(l)) OR (xu(2)<xv(l))) THEN

END

No_Overlap_of_X+true ELSE

No_Overlap_of_X+false

FUNCTION V_Behind_U( V, v>: Boolean /* V tested line(finite), U test line(infinite) */ compute_test_line_constants (U,A(Cr)J( v),C( v)) BEGIN

IF ((A(U)xv( 1) + B( U)zv( 1) + C(v) < 0) AND (4 CJJxv(2) + B( &v(2) + C(U) < 0)) THEN

V_Behind_Uttrue ELSE

V_Behind_U+false END

FUNCTION U_Infront_ V( V,v): Boolean /* V test line(infinite), U tested line(finitej */ compute_test_line_constants (V.A( v),B( v),C( v) BEGIN

IF ((A( v)xu( 1) + B( v)zu( 1) + C( V’) > 0) AND (A( V’)xu(2) + B( V)zu(2) + C( v, > 0)) THEN

U_Infront_V+-true ELSE

U_Infront_ V+- false END

PROC compute_test_line_constants (L,A(L),B(L),C(L)) BEGIN

A(L)+-zL(2) - zL( 1) B(L)+xL( 1) - xL(2) C(L)+ - ZL( 1) B(L) -.rL( 1) A(L)

END

Documents

On parallel scan-conversion algorithms for transputer networks