Computer ray tracing speeds

Computer ray tracing speeds

Paul Robb and Barbara Pawlowski

The results of measuring the ray trace speed and compilation speed of thirty-nine computers in fifty-sevenconfigurations, ranging from personal computers to super computers, are described. A correlation of raytrace speed has been made with the LINPACK benchmark which allows the ray trace speed to be estimatedusing LINPACK performance data. The results indicate that the latest generation of workstations, usingCPUs based on RISC (Reduced Instruction Set Computer) technology, are as fast or faster than mainframecomputers in compute-bound situations.

I. IntroductionFor the past several years, we have had a project at

the Lockheed Palo Alto Research Laboratories to mea-sure the performance of new computing systems asthey have been introduced. While most of our interesthas been on personal computers and workstations, wehave also tested a number of main frames, minicom-puters, and minisupercomputers.

The testing program used two benchmarks. Thefirst of these was a proprietary ray trace program simi-lar to one developed by Optical Research Associates(ORA), who published the results for fifty-nine com-puting systems in 1980.1 This program traces a singleray 4560 times through an optical system comprised ofeighteen spherical surfaces and a flat image plane.The program does not handle tilted or decenteredsurfaces; ray aiming is not done, nor is the optical pathlength calculated. The ray tracing speeds reported bythis benchmark, while valid for comparing the speed ofone computer with another, are faster than one wouldachieve with a production ray trace program having amore general capability.

The second benchmark, which was not run on all ofthe systems, was a test of the compilation speed of thecomputer. This was intended to measure the relativeI/O speed of the computer, since compilation is an I/Ointensive operation, to determine its suitability as asoftware development engineering workstation. Thesubroutine used for this test was a FORTRAN subrou-

The authors are with Lockheed Palo Alto Research Laboratories,3251 Hanover Street, Palo Alto, California 94304.

Received 6 February 1989.0003-6935/90/131933-00$02.00/0.© 1990 Optical Society of America.

tine 1300 lines in length which decodes a lens inputrunstream.

The computation tests were made with a minimumof 64 bits of precision. For most of the computerstested, this meant that the calculations were per-formed in double precision. An exception to the rulewas the Cray, which has a 64 bit word length. Thecalculations were performed in single precision on thismachine.

Several of the systems tested have a parallel, orvector, computing capability. In the presentation anddiscussion of our results, we use the term parallel pro-cessor when referring to these systems. The ray trac-ing results reported in this paper are applicable toserial, or scalar, calculations only.

II. LINPACK ComparisonWhile there is no substitute for testing one's own

problem on a computer being considered for purchase,it is nonetheless useful to have as a guide a comparisonof the speed of the ray tracing benchmark with abenchmark in more common use. The LINPACKbenchmark has evolved into one of the most widelyused single benchmarks to measure relative perform-ance in scientific and engineering environments. Thebenchmark is considered a good one because it mea-sures the overall performance of the hardware andcompiler in a straightforward manner, on a computa-tion typical of many types of engineering and scientificwork. LINPACK is a linear equations package thatparticularly emphasizes floating point addition andmultiplication. The LINPACK benchmark used in thispaper measures the time required to solve a 100 X 100system of linear equations. The code is vectorizableand systems having a parallel or pipeline capability areable to take advantage of it. These computers pro-duce LINPACK performance ratings which are far inexcess of what a user would actually get when tracingrays serially.

1 May 1990 / Vol. 29, No. 13 / APPLIED OPTICS 1933

Table 1. Results of Benchmark Study Presented with Computing Systems In Alphabetical Order

Computing System Processor Clock Rate or Ray Trace Ray Surfaces Compilation LINPACKtype Cycle Time (Seconds) per second (Seconds) MFlops

Alliant FX-80Alliant FX8-8Apollo DN4000, 68020 & 68881Ardent Titan-lAT&T 6300, 8086 CPUAT&T 6300, 8086/8087Convex C1Convex C2Cray X-MP, CFT COSCray X-MP, CFT77 COSCray X-MP, CF77 UNICOSFPS Model 500 (Scalar Processor)IBM clone (Everex) 80386/Weitek 1167 (Microway NDP-386)IBM clone 80386/80387 (Microway NDP-386 32 bit compiler)IBM clone (Everex) 80386/80387 (Lahey compiler)IBM clone (MSE) 80386/80387 (Microsoft 4.0 compiler)IBM clone (MSE) 80386/80387 (Professional compiler)IBM clone (Everex) 80386/80387 (Lahey compiler)IBM PC AT, 80286/80287IBM PC/XT, 8088/8087 (Microsoft 4.0 compiler)IBM PC, 8088 & 80286IBM PC, 8088Mac 11/68881 (Absoft Fortran 2.3)Mac (Absoft Fortran 2.2)Mac IIX/68881 (Absoft Fortran 2.3)Mac Plus (Absoft Fortran 2.2)Mac SE (Absoft Fortran 2.2)Mac SE/68881 Prodigy Accelerator (Absoft Fortran 2.3)Mac SE Prodigy Accelerator (Absoft Fortran 2.2)Mac SE/68881 Radius Accelerator (Absoft Fortran 2.3)Mac SE, Radius Accelerator (Absoft Fortran 2.2)MIPS M2000Multiflow Trace 7Silicon Graphics 4D/25 (Personal Iris)Silicon Graphics 4D/20 (Personal Iris)Silicon Graphics 4D/22OGTX(1 processor)Silicon Graphics 4D/70GSilicon Graphics 4D/80GTStellar (FPS Model 300 FTN)Sun2Sun 3/160,68020 & 68881Sun 3/160, 68020 & 68881 & FPASun 3/160, 68020 OnlySun 4/110, TI FPASun 4/110, Weitek FPASun 4/260 SMD, Weitek FPASun 4/280, TI FPASun 4/330, TI FPASun SPARC Station 1, Weitek FPATektronix 4336, 68020 & 68881UNiSYS 1100/84UNiSYS 1100/92VAX ll/780DECstation 3100VAX 6210VAX 8600VAX 8650

parallelparallelscalarparallelscalarscalarparallelparallelparallelparallelparallelscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarparallelscalarscalarscalarscalarscalarparallelscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalar

170 nanosec170 nanosec25 mhz16.67 mhz8 mhz8 mhz55 nanosec40 nanosec8.5 nanosec8.5 nanosec8.5 nanosec30 nanosec20 mhz25 mhz20 mhz20 mhz20 mhz6.7 mhz8 mhz4.77 mhz8 mhz8 mhz15.67 mhz15.67 mhz15.67 mhz7.83 mhz7.83 mhz16.67 mhz16.67 mhz16.67 mhz16.67 mhz25 mhz130 nanosec20 mhz12.5 mhz25 mhz12.5 mhz16.67 mhz30 mhz10 mhz16.67 mhz16.67 mhz16.67 mhz14.28 mhz14.28 mhz16.67 mhz16.67 mhz25 mhz20 mhz20 mhz50 nanosec15 nanosec

16.67 mhz

45 nanosec45 nanosec

1.902.00

13.002.45

1176.40124.60

5.403.100.530.400.400.779.10

13.4015.2720.0028.0045.97

135.00208.00657.79

2561.0033.00

270.0025.00

1317.001152.98

32.00263.00

33.00272.00

1.503.202.093.431.503.002.503.83

573.0021.70

7.00201.00

3.9010.007.702.802.104.20

14.7014.00

2.4020.00

2.308.647.504.90

4560043320

666535363

74695

1604427948

163472216600216600112519

95216466567443323094188564241713234

2625321

34666675

2708329

2625319

5776027075414552525957760288803465622621

1513993

12377431

222158664

11252309434125720629

58946189

361004332

37670100281155217682

16.7 8.500017.2 7.600042.0 0.1400

6.5000

34.7 3.00008.6 10.00001.0

46.00001.5

15.8 8.5000

21.025.0

21.0

21.044.021.097.086.024.024.030.528.5

0.0091

0.0038

0.0760

4.8 3.6000113.0 6.0000

5.9 1.60001.00004.0000

12.5 1.10001.5000

34.0 0.101037.0 0.400037.0

1.200018.5 0.860013.0 1.1000

1.60002.60001.4000

46.00.3800

11.5 1.80000.1400

3.5 1.60000.4600

14.0 0.48009.2 0.7000

It should be noted that the LINPACK benchmark isso large that the data cache on most CPU boards isineffective. Consequently, this benchmark is domi-nated by memory system performance for single/dou-ble precision operands. This is appropriate for mostmodern optical design computer programs, which re-quire that large regions of memory be set aside to holdthe lens prescription.

An important difference between the ray tracebenchmark program and the LINPACK benchmark is

that LINPACK performs only the basic arithmetic oper-tions (addition, subtraction, multiplication, and divi-sion), while the ray trace program also calculatessquare roots. Occasionally computers will be foundthat trace rays much more slowly than LINPACK per-formance ratings would indicate. We have found thatthe difference is due to the way the machine handlesthe square root function; it is particularly pronouncedfor computers that do not implement a math coproces-sor.

1934 APPLIED OPTICS / Vol. 29, No. 13 / 1 May 1990

2

Table II. Results of Benchmark Study Presented In Order of Increasing Ray Speed

Computing System Processor Clock Rate or Ray Trace Ray Surfaces Compilation LINPACKtype Cycle Time (Seconds) per second (Seconds) MFlops

IBM PC, 8088Mac Plus (Absoft Fortran 2.2)AT&T 6300, 8086 CPUMac SE (Absoft Fortran 2.2)IBM PC, 8088 & 80286Sun 2Mac SE, Radius Accelerator (Absoft Fortran 2.2)Mac II (Absoft Fortran 2.2)Mac SE Prodigy Accelerator (Absoft Fortran 2.2)IBM PC/XT, 8088/8087 (Microsoft 4.0 compiler)Sun 3/160, 68020 OnlyIBM PC AT, 80286/80287AT&T 6300, 8086/8087IBM clone (Everex) 80386/80387 (Lahey compiler)Mac 11/68881 (Absoft Fortran 2.3)Mac SE/68881 Radius Accelerator (Absoft Fortran 2.3)Mac SE/68881 Prodigy Accelerator (Absoft Fortran 2.3)IBM clone (MSE) 80386/80387 (Professional compiler)Mac IX/68881 (Absoft Fortran 2.3)Sun 3/160,68020 & 68881IBM clone (MSE) 80386/80387 (Microsoft 4.0 compiler)VAX 11/780IBM clone (Everex) 80386/80387 (Lahey compiler)Tektronix 4336, 68020 & 68881UNiSYS 1100/84IBM clone 80386/80387 (Microway NDP-386 32 bit compiler)Apollo DN4000, 68020 & 68881Sun 4/110, Weitek FPAIBM clone (Everex) 80386/Weitek 1167 (Microway NDP-386)VAX 6210Sun 4/260 SMD, Weitek FPAVAX 8600Sun 3/160, 68020 & 68881 & FPAConvex C1VAX 8650Sun SPARC Station 1, Weitek FPASun 4/110, TI FPAStellar (FPS Model 300 FTN)Silicon Graphics 4D/20 (Personal Iris)Multiflow Trace 7Convex C2Silicon Graphics 4D/70GSun 4/280, TI FPASilicon Graphics 4D/80GTArdent Titan-IUNiSYS 1100/92DECstation 3100Sun 4/330, TI FPASilicon Graphics 4D/25 (Personal Iris)Alliant FX8-8Alliant FX-80MIPS M2000Silicon Graphics 4D/220GTX(1 processor)FPS Model 500 (Scalar Processor)Cray X-MP, CFT COSCray X-MP, CFT77 COSCray X-MP, CFT77 UNICOS

scalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarscalarparallelscalarscalarscalarparallelscalarparallelparallelscalarscalarscalarparallelscalarscalarscalarscalarparallelparallelscalarscalarscalarparallelparallelparallel

8 mhz7.83 mhz8 mhz7.83 mhz8 mhz10 mhz16.67 mhz15.67 mhz16.67 mhz4.77 mhz16.67 mhz8 mhz8 mhz6.7 mhz15.67 mhz16.67 mhz16.67 mhz20 mhz15.67 mhz16.67 mhz20 mhz

20 mhz20 mhz50 nanosec25 mhz25 mhz14.28 mhz20 mhz

16.67 mhz45 nanosec16.67 mhz55 nanosec45 nanosec20 mhz14.28 mhz30 mhz12.5 mhz130 nanosec40 nanosec12.5 mhz16.67 mhz16.67 mhz16.67 mhz15 nanosec16.67 mhz25 mhz20 mhz170 nanosec170 nanosec25 mhz25 mhz30 nanosec8.5 nanosec8.5 nanosec8.5 nanosec

2561.001317.001176.401152.98657.79573.00272.00270.00263.00208.00201.00135.00124.6045.9733.0033.0032.0028.0025.0021.7020.0020.0015.2714.7014.0013.4013.0010.00

9.108.647.707.507.005.404.904.203.903.833.433.203.103.002.802.502.452.402.302.102.092.001.901.501.500.770.530.400.40

34667475

132151319321329417431642695

1885262526252708309434663993433243325674589461896466666586649521

100281125211552123771604417682206292221522621252592707527948288803094334656353633610037670412574145543320456005776057760

112519163472216600216600

97.0

86.0

28.544.024.0

37.0

21.021.030.524.0

21.034.025.0

21.046.0

0.0038

0.0091

0.0760

0.1010

0.1400

0.3800

42.0 0.140018.5 0.8600

0.460013.0 1.100014.0 0.480037.0 0.400034.7 3.0000

9.2 0.70001.40001.2000

1.0000313.0 6.0000

8.6 10.000012.5 1.1000

1.60001.50006.5000

11.5 1.80003.5 1.6000

2.60005.9 1.6000

17.2 7.600016.7 8.50004.8 3.6000

4.000015.8 8.50001.0

46.00001.5

LINPACK results are measured in MFlops, millionsof floating point operations per second.

Ill. ResultsThe results of the tests will be found in Tables I and

II. Table I presents the results with the computingsystems displayed in alphabetic order, for ease of use.Table II presents the same data, but the results aresorted in the order of increasing ray trace speed. Theresults of the compilation time tests are also displayed.

Of the fifty-seven configurations we tested, LIN-PACK results have been published2 for thirty-three ofthem. These are also included in Tables I and II.Upon analyzing the data, it was apparent that therewas an excellent correlation between ray trace per-formance and LINPACK MFlop rate, provided that themachines were separated into the two groups: serialmachines and parallel machines.

In Fig. 1, ray trace speed is plotted as a function ofLINPACK MFlop rate for twenty-six serial machines.


FPS Model 500 off scale(8.5 MFLOPS)

z0° 40000

r.l

W 30000

< 20000

SUN 3/160/68881/FPA

10000

Apollo DN4000

SUN 3/160/6881 -

0'

250000 -|

9z00

U

00

200000'

150000'

100000'

50000'

0'

LINPACK MFLOPS

0 10 20 30 40LINPACK MFLOPS

The ray trace speed of amated from the formula:

serial computer may be esti-

ray trace speed = 26,250(LINPACK MFlop Speed)

for MFlop rates up to 1.1 and

Fig. 1. Ray trace speed as a func-tion of LINPACK MFlop rate forthe serial processor systems eval-

uated.

Fig. 2. Ray trace speed as a func-tion of LINPACK MFlop rate for

50 the parallel processor systemsevaluated.

ray trace speed = 11,300(LINPACK MFlop Speed) + 16,450

for MFlop rates between 1.1 and 10.For machines which have a pipeline or a parallel

processing capability, the ray trace speed is plotted as a


0 10 20 30 40

LINPACK MFLOPS

function of LINPACK MFlop rate on Fig. 2. The raytrace speed of a parallel computer tracing rays seriallymay be estimated from the formula:

ray trace speed = 4700(LINPACK MFlop Speed)

In Fig. 3, the results of Figs. 1 and 2 are combined toillustrate the ray trace performance difference be-tween the two types of computers. For the parallelprocessing systems the slope of the function is aboutone-fifth that of the serial machines; this is because thecompilers used on the parallel computers were able tovectorize the LINPACK program and ran this bench-mark at speeds significantly faster than would be pos-sible in a serial mode. Since the ray trace benchmarkis a purely serial program, automatic vectorization wasnot possible and the performance of these machines intracing rays is much less than the LINPACK MFlop ratewould suggest.

An example of this may be found by comparing theray trace speeds of the Convex C2 with the SiliconGraphics 4D/70G. Both of these computers trace raysat a speed of approximately 28,000 ray surfaces persecond; yet the Convex C2 is rated at 10 Mflops and theSilicon Graphics 4D/70G is rated at only 1.1 MFlops.In this example, the difference is a factor of 9:1 becausethe Convex C2 does well on the LINPACK test andpoorly in tracing rays.

In Fig. 4, the predicted ray trace speed is plotted as afunction of measured ray trace speed for all of theserial computers. In some cases the predicted raytrace speed does not correlate well with the measuredspeed. The most notable examples of this are the

50 Fig. 3. Scalar and parallel pro-cessor systems on the same plot as

Figs. 1 & 2.

early Sun 4 series of workstations, which have goodMFlop ratings but whose ray trace performance wassignificantly less than would have been expected. Thedifference was traced to Sun's implementation of thesquare root function. The early Sun 4 machines usedthe Weitek 1164/1165 Floating Point Accelerator,which is the same FPA used for the Sun 3/160. This islike putting a governor on the machine for problemswhich require square roots; thus, the Sun 4 traced raysat about the same speed as the Sun 3. During 1989Sun implemented the TI 8847 FPA which increasedthe ray trace speed by 2.75:1.

IV. The Effect of the Math CoprocessorAll workstation manufacturers make provision for a

math coprocessor, and some also supply an optionalFloating Point Accelerator. These are often com-bined on a single auxiliary chip. The use of theseauxiliary processors makes a dramatic difference in thespeed of the computer. Within the Sun 3/160 familyof workstations, which use the Motorola 68020 CPUand 68881 Math Coprocessor, we measured the speedof the ray trace program in three configurations, astaken from Table I and illustrated below:

Configuration Time Ray Surfaces per Second

68020 only 201 s68020 plus 68881 21.7 s68020 plus 68881 plus FPA 7.0 s

4313,993

12,377

The math coprocessor increases the speed by a factorof about 9:1, inclusion of the Floating Point Accelera-tor gives a total speed increase of nearly 30:1.


120000

100000

80000

60000 MIPS M2000 G 41

40000 - SG 4D/25 - * SUN 4/330 T1 FPADECstation 3100-

SG 4D/80- UNiSYS 1100/92SG4Dno- *SUN 4/280 TI FPA

SG 4D/2O - r SUN 4/11 0 T FPA

20000- / * *- SUN SPARCstation IAX 8600 VAX 8650

VAX m210 * SUN 4/260 Weitek FPAu SUN 4/110 Weitek FPA

* UNiSYS 1100/84

0 20000 40000 60000U .

RAY SURFACES PER SECOND -

The speed of the fully configured Sun 3/160 is thsame as a VAX 8600 and is three times as fast as a VA)11/780, for a small fraction of the cost.

V. Motorola 68020 vs Intel 80386Computers based on the Intel 80386 CPU have re

ceived wide publicity during the past year and thquestion of the performance of the 80386 vs the Motorola 68020 has come up. From Table II, the followindata are taken:

Configuration

Everex 80386 + Weitek 1167Sun 3/160 68020 + Weitek FPA

Ray SurfacesClock Rate per Second

20 MHz16.67 MHz

9,52112,377

The performance of the Intel 80386 CPU with theWeitek 1167 is about 30% less than a similarly config-ured Motorola 68020 CPU even though the clock rate isfaster.

VI. Reliability of the ResultsAn everpresent question when using a benchmark

program is the validity of extrpolating results obtainedfrom a 250 line FORTRAN program with one subroutinethat takes approximately 7800 words of memory to theperformance of a real world lens design program con-sisting of 115,000 lines of code in 550 subroutines thatrequires 1.5 million words of memory.

As a result of the benchmark study reported in thispaper, we selected the Silicon Graphics 4D/22OGTX

80000 100000 120000 Fig. 4. Measured ray trace speedplotted as a function of predicted

PREDICTED ray trace speed.

e workstation as the baseline computer for optical de-X sign at Lockheed and have ported our in-house optical

design program from the Unisys 1100/92 mainframecomputer to this machine. We calculated the expect-ed performance of several systems from the ray tracebenchmark data by comparing the 1100/92 benchmark

e results with those of the other systems. The predicted- performance compares with the measured perform-g ance as follows:

Computer System

Mac SE/68881 ProdigyAccelerator (Absoft 2.3)

Silicon Graphics 4D/70Silicon Graphics 4D/220

Computational test casePredicted Time Actual time

(s) (s)

621

5829

642

5324

A test case that exercises various computationalphases of the program was used. The predicted timesfor this test case are in excellent agreement with theactual times recorded on the three systems includingthe Silicon Graphics 4D/22OGTX workstation. Thisagreement indicates that the ray trace benchmark re-sults reliably predict the performance of the full lensdesign program.

VII. ConclusionsWe find that there is a good correlation between

speed of a computer in executing the LINPACK bench-mark and in tracing rays with the ray trace benchmark.


.4

�DI-U

9:1Z0UWW9W0.WWUTX,19W

9

Given a LINPACK performance number and knowledgeof whether the machine is a serial or parallel computer,it is possible to estimate the ray trace speed of a com-puter using the equations above.

The most recent generation of workstations, usingCPUs based on RISC technology, are faster than mini-computers and are as fast or even faster than main-frame computers in compute-bound situations. Com-puters that use the MIPS RISC CPUs can be as muchas 40% faster in both LINPACK and ray tracing speedsthan otherwise equivalent computers using theSPARC CPU design developed by Sun Microsystems.The performance of these workstations, notably theMIPS M2000 and the Silicon Graphics 4D series, istruly remarkable.

It is serious error to compare the performance ofserial computers with parallel or pipeline computersbased on LINPACK MFlop results. This differencemay be seen from the values of the slopes in Fig. 3,which differ by a factor of about 5:1. This means thatthe selection of a parallel or pipeline computer, basedon comparing MFlop ratings with a serial machine, todo work which is primarily serial in nature will result inthe selection of a computer whose performance is inad-equate by a factor of as much as 5:1.

This is to be expected since the design of the parallelor pipeline machines has been optimized to processproblems which are parallel in nature, and serial ma-chines are optimized for serial calculations.

The usual practice when implementing a large opti-cal design program on a shared computer is to eitheruse a virtual memory machine or segment the programif the machine has a fixed memory size. In both cases,

the high I/O speed of these computers makes it appearthat the entire program is in memory, available to theuser at any time.

While the workstations just mentioned are as fast orfaster than large shared systems in compute boundsituations, they cannot at this time compete in terms ofI/O speed. This means that the software designermust play to the strengths of the workstation, whichare CPU speed and memory: memory being cheap,the program should be held entirely in physical (notvirtual) memory, thereby avoiding an I/O bottleneck,and the user will receive the full benefit of the highspeed of these new CPUs.

The authors thank the numerous people who assist-ed us in putting together this collection.References1. D. E. Gustafson, "Dedicated Minicomputers in Optical Design,"

Proc. Soc. Photo-Opt. Instrum. Eng. 237,42-47 (1980). The raytracing speeds reported in this paper are higher than those report-ed by ORA by the ratio of 19/15.35 or about 24%. ORA adjuststhe speed downward to account for the fact that the test lens has aflat plate in front, the aperture stop is a dummy surface, and theimage surface is not traced through. We have taken the approachthat this test lens is representative of the type of problems en-countered in the world of optical design and have not made anycorrection for these factors. To obtain ray surfaces per secondvalues comparable to ORA's published results, the ray surfacespeeds presented in Tables I and II need to be multiplied by afactor of 15.35/19. The elapsed times obtained by ORA areidentical to ours.

2. J. J. Dongarra, "Performance of Various Computers Using Stan-dard Linear Equations Software in a Fortran Environment,"Argonne National Laboratory Technical Memorandum No. 23(The Argonne National Laboratory, Argonne, Ill, 1988).


Documents

Computer ray tracing speeds