The Design and Implementation of - The College of ...cs6958/papers/HWRT-seminar/fpga-raytracer.pdf · that there is no way to differentiate between the original 3D scene and the 2D

The Design and Implementation of

a Hardware Accelerated

Raytracer Using the TM3a

FPGA Prototyping System

By

J. Fender

A THESIS SUBMITTED IN PARTIAL FULFILMENTOF THE REQUIREMENTS FOR THE DEGREE OF

BACHELOR OF APPLIED SCIENCE

DIVISION OF ENGINEERING SCIENCE

FACULTY OF APPLIED SCIENCE AND ENGINEERINGUNIVERSITY OF TORONTO

Supervisor: J. Rose

March 2002

ii

The computational complexity of raytracing is such that as scenecomplexity grows it will eventually outperform raster graphics methods.Currently ray tracing is still to slow as the algorithm has a large constantassociated with a given computation step. This thesis will present anarchitecture, and implementation, of a raytracing processor that is designedto minimize this constant and to provide insight into the possibility of futurereal time implementations.

This raytracing processor consists of a highspeed barycentric raytriangle intersection core that can easily outperform a softwareimplementation, and a hierarchical controller unit. The hierarchicalcontroller is able to traverse a tree of bounding boxes in such a way as tomaximize memory bandwidth utilization, through pipelined reads, to makesignificant speed gains. The net result is a circuit that is able to well outperform software implementations, while running at only a twentieth theclock speed.

This resulting system is able to beat current software implementationsby an order of magnitude or more but is still too slow to be considered forany real time implementations.

iii

AcknowledgementsI would like to thank Marcus van Ierssel and David Galloway for buildingthe Transmogrifier 3a development system and for helping to solve the manydevelopment system issues that I ran across. I would also like toacknowledge David Auclair, not for just his input into the project but hisinterest. He forced me to try and organize the jumble of information in myhead into clear-cut explanations, for acting as a sounding board, I amgrateful. Next I would like to thank Professor Jonathan Rose for his advice,and the freedom he allowed me during development. And finally I would liketo thank my girlfriend of four years, Lisa Scarfo, for putting up with theoccasional late night in the lab and for providing the much needed escapesfrom work.

iv

Table of ContentsAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

List of Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Project Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1 3D Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 The Graphics Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Scan Line Renderer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3.2 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4 Raytracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4.2 Ray Object Intersection Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4.3 Barycentric Ray Triangle Intersection Test . . . . . . . . . . . . . . . . . . 72.4.4 Hierarchical Acceleration Methods . . . . . . . . . . . . . . . . . . . . . . . . 82.4.5 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 Transmogrifier 3a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Ray Triangle Intersection Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Barycentric Ray Triangle Intersection Unit . . . . . . . . . . . . . . . . . . . . . . . 12

3.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2.2 Numeric Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.3 Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.4 Barycentric Extension to Parallelograms . . . . . . . . . . . . . . . . . . . 17

3.3 Memory Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.2 Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4 Nearest Comparison Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

v

4 Hierarchy Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 Bounding Hierarchy Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2.2 Barycentric Ray Triangle Intersection Unit Utilization . . . . . . . . 224.2.3 State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2.4 Required Memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2.5 Parallel Ray Processing Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2.6 Scalability Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Sorted List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.3.2 Size and Speed Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.4 List Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4.2 List Handler Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.2 Barycentric Ray Triangle Intersection Unit Performance . . . . . . . . . . . . 285.3 Bounding Hierarchy Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.3.2 Single Large Faceted Sphere Tests . . . . . . . . . . . . . . . . . . . . . . . 295.3.3 A Grid of Faceted Spheres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.3.4 Cache Incoherent Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.1.1 Barycentric Ray Triangle Intersection Unit . . . . . . . . . . . . . . . . . 346.1.2 Hierarchy Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.1.3 General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.3 Closing Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Appendix A: Bound Controller State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Appendix B: VHDL Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Appendix C: C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Appendix D: Brute Force Test Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

vi

List of Symbols

Cavg Average coverage of a projected object (percentage of image size)(Sw, Sh) Width and height of an imageN Number of objects in a sceneO 3D Point: Origin of a RayD 3D Normalized Vector: Direction of a RayV0,V1,V2 3D Point: Triangle Vertices(u, v) Barycentric CoordinateE1,E2 3D Vector: Triangle Edges

vii

List of FiguresFigure 1: Perspective Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Figure 2: Light propagation through a 3D world . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Figure 3: A Shadow Ray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Figure 4: Bounding Object Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Figure 5: Bounding Object Sorting Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Figure 6: TM3a System Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Figure 7: Ray Triangle Unit Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Figure 8: Barycentric Ray Triangle Intersection Unit Pipeline . . . . . . . . . . . . . . . . . . . . 13Figure 9: Bounding Hierarchy Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Figure 10: Bounding Hierarchy Controller System Overview . . . . . . . . . . . . . . . . . . . . 21Figure 11: Restriction on Parallel Rays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Figure 12: Single Sphere Results (Expanded View) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Figure 13: Single Sphere Test Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Figure 14: Grid of Sphere Test Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 15: Grid of Sphere Test Set Results (Expanded View) . . . . . . . . . . . . . . . . . . . . 32Figure 16: Alternative Raytracing Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

viii

List of TablesTable 1: Input Data Widths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Table 2: Pseudo Floating Point Scaling Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Table 3: Memory Widths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Table 4: Brute Force Performance Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Table 5: One Sphere Test Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Table 6: Grid of Sphere Test Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Table 7: Cache Incoherent Test Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

1

1 Introduction1.1 Project Overview

It has long been known that of the two major ways to render a three-dimensional

computer image, raytracing and raster methods, it is raytracing that has the lower

computational complexity. Unfortunately, it turns out that even though raytracing has a

complexity advantage it suffers from a very large computational step constant. This large

constant causes raster graphics to always outperformed raytracing in real world

implementation. Eventually if scene sizes continue to grow the lower computation

complexity will dominate over the larger constant and raytracing will win out.

The purpose of this project is not to solve the step constant problem or to compete

with raster graphics but, instead, to assess what is possible with current technology and to

determine what hardware factors limit raytracing. The final goal of this thesis is to design

a hardware raytracing processor that can outperform a corresponding software

implementation. This will be accomplished by used the parallel nature of the raytracing

algorithm and by optimizing the processors memory interface by accounting for the

deterministic nature of raytracing’s memory accesses.

The design will be such that it can easily be implemented in existing

reconfigurable hardware and tested at a low clock rate. It is hoped that under these

constraints the raytracing processor can still outperform a software implementation while

the complexity remains manageable. To make the comparison fair the hardware raytracer

will employ a spatial hierarchy that accelerates rendering, as any software implementation

would. This will allow a direct comparison to current state of the art software

implementations that use such acceleration techniques.

2

1.2 Thesis Overview

This thesis is divided into seven chapters. The first provides a brief introduction

to the raytracing projects goals and methods. The second chapter introduces some basic

background knowledge that is required to understand this thesis. The next two chapters

describe the functional architecture chosen for the raytracing processor. Chapter three

describes the barycentric ray triangle intersection unit and chapter four describes the

hierarchical controllers. Chapter five details the various test cases designed to evaluate

the performance of the raytracing processing and chapter six discusses these results.

Finally, chapter seven lists the references used for this thesis.

3

Figure 1: Perspective Transform

2 Background2.1 3D Rendering

3D rendering is the process of creating a 2D digital image from a mathematical

model of a 3D world. Fundamentally

this involves a, potentially nonlinear,

projection of the 3D data onto a 2D

image plane. Figure 1 shows how a

common projection method, the

perspective transform1, can be used to

generate a 2D image.

This transform projects the 3D

world toward an eye point and onto an

image plane. If one ignores focal depth and binocular vision then this transform results in

an image that is indistinguishable, by the viewer, from the original scene. This results

from the fact that, under these two assumptions, human vision depends only on the

direction from which incident light arrives and not the distance of its source. This means

that there is no way to differentiate between the original 3D scene and the 2D projection,

as the perspective transform maintains the angular information of the scene.

2.2 The Graphics Pipeline

In general it is a very difficult task to render an image using projection methods.

Current solutions address this problem by restricting the scene complexity and by

breaking the rendering problem into several separate stages known as the graphics

pipeline. For the purpose of this document the graphics pipeline will be defined as

consisting of the following stages: object transformation, potential visibility

determination, pixel visibility, and pixel colouring.

The first stage, object transformation, allows a 3D world to be specified for later

rendering. The next two stages, potential visibility determination and pixel visibility,

determine which objects are visible and then which objects are projected onto a given

4

pixel. Finally the last stage determines what colour a pixel will be by applying the

various surface properties of the objects that have been projected onto the pixel.

2.3 Scan Line Renderer

2.3.1 Overview

Scan line rendering is the current standard in 3D computer graphics. This method

solves the rendering problem by using an object centric model. That is, each object is

processed to find what pixels it covers, as apposed to determining what objects cover any

given pixel.

The first stage in a scan line renderer is to determine which objects are potentially

visible. This determination is an active area of research that has resulted in many

different algorithms. Several samples of these are kD-trees2, object occluders3, portals4,

and view-frustum culling5, but they are many more. Once a set of potentially visible

objects have been determined, they are passed on to the pixel visibility stage.

This next stage is responsible for determining what objects are visible for what

pixels. The scan line approach projects each object onto the viewing plane and

determines which pixels the object covers. To insure that further objects do not occlude

nearer ones it is necessary to use depth sorting6, or z-buffering, to account for the fact that

distant objects should not be drawn overtop of nearer ones. Once completed these leaves

only the pixel colouring stage.

The pixel colouring stage is beyond the scope of this thesis but put simply, a pixel

colour is determined through modelling of object surface properties and lighting effects.

For a more detailed overview of a colouring architecture, either the OpenGL7 or

RenderMan8 specifications can be viewed.

5

Figure 2: Light propagation through a 3D world

2.3.2 Computational Complexity

The computation complexity of a scan line renderer can be derived as follows:

If the average object covers Cavg percent of an imageAnd the image has a resolution of Sw by ShThen an average object covers Cavg x Sw x Sh pixelsIf they are N objects in a scene Then the rendering complexity is O(N x Cavg x Sw x Sh)

This leads to a complexity of O(n) toward the number of objects in a scene as well as the

image resolution.

2.4 Raytracing

2.4.1 Overview

Raytracing is a rendering

method based on modelling the way

light rays propagate through a 3D

scene. Figure 2 shows how a light ray

can be reflected around a scene until it

eventually reaches the viewer’s eye.

Implementing this physical model

would be far too computationally

intensive as they are an infinite

number of possible paths between the light source and the viewer. To deal with this

problem raytracing attempts to find an approximate solution.

If one was to ignore the physical model of light propagation and instead think of

the process in reverse then the problem becomes simpler. By tracing the path a ray takes

from the eye through the image plane, it is possible to determine what object a ray

intersects. Once an intersection point is found, it is necessary to determine its colour.

This is performed using a two-step process.

First it is necessary to determine how much direct light falls on the object. This is

calculated by spawning a new ray from the point of intersection toward the light source.

6

Figure 3: A Shadow Ray

If this ray strikes an object, prior to the light source, then the object is shadowed,

otherwise the object is lit. Figure 3 show how the cube is determined to be shadowed by

the plane through the use of a ray

spawned toward the light source from

the point of intersection. The second

step involves modelling the surface

properties of the object.

If the object’s surface is

reflective then it is necessary to

determine what colour this reflection

should be. By using the reverse light

tracing method it is easy to determine what direction a reflected light ray would have

arrived from. Knowing this, a new ray can be spawned in that direction and the process

repeated recursively to determine this new ray’s colour.

To summarize, the raytracing algorithm can be broken into five steps:

1. Generate a ray for each pixel in the image plane2. Find the nearest object intersected by this ray3. Generate a shadow ray for each light source to determine lighting4. Apply the surface model and generate reflected rays as required5. Add the surfaces and the reflected rays colour to determine the final pixel colour

2.4.2 Ray Object Intersection Test

Unlike raster graphics, where only polygons can be easily rendered, raytracing can

handle any form of an object. The only restriction is that there exists a solution to the

object-ray intersect equation. For simple objects such as polygons, conics, and low-order

polynomial patches, closed form solutions can easily be found. However for more

complicated objects there are no closed form solutions and slow root solving methods are

required. These iterative algorithms, and complicated objects, are not well suited for

hardware implementation so this thesis will deal with ray triangle intersections only. This

is acceptable as any object can be approximated by a mesh of triangles.

7

To perform the ray triangle intersection test they have been many different

algorithms proposed. Of the different algorithms they are three major groups: those that

use the plane equation9, those that use 6D Plucker10 space, and those that use Barycentric

coordinates11. Descriptions of the first two methods can be found in their referenced

articles, where as the last method is described in the next section.

2.4.3 Barycentric Ray Triangle Intersection Test

Tomas Moller and Ben Trumbore introduced a simple algorithm for calculating

the intersection point of a ray and a triangle in the 1997 paper titled Fast, Minimum

Storage Ray/Triangle Intersection. Their results indicated that it was the fastest algorithm

available that did not require pre computed values and it works as follows.

Given a ray R(t) who has direction D and origin O

R t O tD( ) = + (1)

and a triangle with vertices V0, V1, and V2, then a point is defined to lay within a triangle

if

T u v u v V uV vV( , ) ( )= − − + +1 0 1 2 (2)

where (u,v) are the barycentric coordinates which must meet the following constraints

u v u v≥ ≥ + ≤0 0 1, , (3)

By equating equations (1) and (2) and writing them as a matrix results as follows

[ ]−

== −= −= −

D E Etuv

T WhereE V VE V VT O V

1 2

1 1 0

2 2 0

0

(4)

By using Cramer’s rule and factoring out common terms we find the solution

tuv P E

Q EP TQ D

WhereP DxEQ TxE

=⋅

⋅⋅⋅

==

1

1

22

1(5)

If the resulting value of (u,v) meets the constraints given in equation (3) then the ray

intersects the triangle at a distance t along the ray.

8

Figure 4: Bounding Object Hierarchy

Figure 5: Bounding Object Sorting Problem

2.4.4 Hierarchical Acceleration Methods

There is more to finding the nearest object intersected by a ray then just the ray-

object intersection test. This test only deals with individual objects so a method that

extends this to a scene full of objects is required. The simplest approach to this problem

is to intersect a ray with every object in the world one by one. By keeping track of

distance to each hit object, the nearest intersected object can be found. This method

works very well but is very inefficient.

A more ideal solution would to be to have an algorithm that could cull a large

number of objects with only a few

intersection tests. The simplest of

these approaches is to use a bounding

object methodology. This system

involves placing an invisible object that

is easy to intersect with, around a more

complicated one. The end result is that

if a ray misses the bounding object then

it will also miss any objects contained within. By taking this process further, as shown in

figure 4, and placing the bounding objects, previously created, into larger bounding

objects then further performance increases can be achieved. Ideally a tree of bounding

objects can be created such that if a ray misses any node then its children need not be

tested. This allows for a formidable performance increase but there is one major

drawback of this algorithm. To insure that objects are drawn correctly it is necessary to

perform an expensive sort operation.

Figure 5 shows why the

bounding object algorithm requires a

sort. The figure shows a sample scene

consisting of three bounding objects,

the boxes, that have been struck by a

ray. It is clear that checking only the

9

nearest bounding object will result in an incorrect solution that the ray does not hit any

objects as it hits bounding box one but does not hit the contained object. This means that

it is necessary to sort the distances of the intersected bounding objects and test their

children in depth order, from nearest to furthest. This allows for the intersection test to

terminate as soon as the first object ray intersection is found. The alternative would be to

intersect the ray with every bounding box and take only the nearest object intersection

test. This would work but would limit the performance increase provided by the

bounding hierarchy and as such, the complexity of a sort is still faster.

There have been a number of algorithms introduced that simplify the front to back

ordering by accounting for it within the tree. These methods include binary space

partition trees12 that allow a front to back traversals of a data set without sorting, octrees13,

a simplified BSP tree that allows faster traversal, and three-dimensional grids, that place

objects within a volume pixel such that a ray can traverse through the grid front to back

using line drawing methods. A good overview of three dimension grid methods can be

found in [14].Current research has found that a hybrid of these algorithms provides the

fastest software performance, but this area is still open to research.

2.4.5 Computational Complexity

The computation complexity of a raytracer that does not use a bounding hierarchy,

and does not perform recursive raytracing can be derived as follows:

If the image has a resolution of Sw by ShAnd they are N objects in a sceneThen the rendering complexity is O(NxSwxSh)

Like raster graphics a raytracer is linear in both the number of objects and the image

resolution. However, by comparing raytracing to raster graphics complexity, O(N x Cavg x

Sw x Sh), we see it is missing the Cavg term. This term is always less then zero and usually

much more so since it represents the average pixel coverage of an object. This means that

raster graphics will have an advantage. On top of this problem, the constant involved in

raytracing is much larger then raster graphics so at first it would seem that raytracing is

far too slow to be useful. It is only once the bounding hierarchy is considered the

10

Figure 6: TM3a System Diagram

situation improves greatly.

Due to the variations of bounding hierarchies and the complexities dependance on

the scene being rendered, there is no worst case performance increase by using a

bounding hierarchy. This results from it always being possible to construct a degenerate

hierarchy that would require O(N), where N is the number of objects, to solve a ray.

However, on average, the complexity can be approximately written as O(logN x Sw x Sh).

This result places raytracing in a much more favourable light then the simplistic

algorithm. Unfortunately for the sizes of scenes currently rendered, N is still too small to

overcome the large differences in constants between raster graphics and raytracing

methods.

2.5 Transmogrifier 3a

The circuits described in this

thesis have all been implemented on a

prototyping system created at the

University of Toronto known as the

Transmogrifier-3a (TM3a). This

system uses large Field Programmable

Gate Arrays, FPGAs, that allow for

very complicated circuits to be tested at

speeds up to, and above, 50 MHz.

Figure 6 shows a simplified system

diagram of the development system

that describes only the components

used in this thesis.

The core of the TM3a consists of four 560 pin Virtex2000 FPGAs manufactured

by Xilinx. These four chips provide more than 150,000 four input lookup tables and

flipflop pairs, as well as various specialized functional units such as shift registers, and

internal memory. The FPGAs are fully interconnected to each other by six 98 bit buses,

11

and are also connected to the computer interface section by a low data rate nibble bus.

Additionally each of the FPGAs have access to their own independent external SRAM

modules. These modules are rated for 50MHz, have a 64bit wide databus and contain a

total of 2MB of ram, resulting in a total memory bandwidth of 3.2Gbits/sec. These

hardware components allow for very large and complicated circuits to be easily tested,

but there is another side to the TM3a development system.

The hardware is supported by a software development flow that consists of both

custom software and state of the art commercial tools. The custom software routes

signals between the FPGAs and automatically generates circuitry that allows the FPGAs

to communicate with the computer interface. The commercial tools include Synplicity’s

Synplify Pro, that is used to synthesis circuits, and Xilinx’s place and route software, that

is used for physical layout.

12

Figure 7: Ray Triangle Unit Overview

3 Ray Triangle Intersection Unit3.1 Overview

The first major functional

component of the raytracing processor

is the ray triangle intersection unit.

This component implements the pixel

visibility functionality of the graphics

pipeline through the use of raytracing

methodologies. That is, it takes the list

of objects in the scene and a ray

corresponding to a given pixel as input,

and returns the object that is visible for

the given pixel.

The implementation of this

functionality is dividing into three

separate components: the barycentric

ray triangle intersection unit, the

memory interface, and the nearest

comparison unit, shown in figure 7.

The core unit is the barycentric ray

triangle intersection unit that

determines if a ray intersects a single

given object. The memory interface is

responsible for passing the proper objects to the intersection unit, and the nearest

comparison unit is responsible for tabulating the final results.

3.2 Barycentric Ray Triangle Intersection Unit

3.2.1 Overview

The barycentric ray triangle intersection unit is the workhorse of the raytracing

processor. Put simply it is a deeply pipelined unit that is capable of solving the

13

Figure 8: Barycentric Ray Triangle Intersection Unit Pipeline

barycentric intersection algorithm described in section 2.4.3. It has a maximum

throughput of one intersection test per clock cycle, a latency of 38 clock cycles, and a

maximum clock speed of 50MHz.

Figure 8 shows the functional layout of the pipeline as well as the systems inputs

and output. The system takes a ray, specified by an origin and direction, a triangle,

14

Ray Origin 3x28bits

Ray Direction 3x16bits

Triangle Vertex 3x28bits

Triangle Edge 3x16bits

Table 1: Input Data Widths

specified by a vertex, edge one, edge two, and their associated scale factors, as well as

several configuration bits as input, and outputs a boolean hit flag, the triangle ID of the

processed triangle, and the parameters t, u, and v from the barycentric algorithm.

3.2.2 Numeric Format

A conventional computer-based implementation of the barycentric algorithm is

usually implemented using floating point arithmetic. For the purposes of this design a

floating point implementation would have taken up to much area and introduced a far

deeper pipeline. To avoid this, a hybrid system of fixed point, and pseudo floating point

numbers are used. The determination of the input bit widths where determined by

working backwards from an internal constraint.

The constraint derived itself from the need to

insure that the divide operations could process a single

bit per clock cycle. To meet this constraint, it was

necessary to restrict the largest input signals to the

divider to 64 bits or less. This constraint was then

back propagated through the arithmetic units to find

the required input widths. A summary of the resulting

bit widths is shown in Table 1.

These numbers allow for a usable 3D space of 28 bits for locating an object or

viewpoint but only 16 bits for defining the triangle’s edges. At first it would appear that

this is unacceptable since a typical 3D scene has a large dynamic range of object sizes.

There are often very large triangles that might make up a landscape, and very small

triangles, that might describe a facial feature, but this problem can be easily solved by

introducing a pseudo floating point format that exploits a simple property of the

barycentric algorithm.

15

Factor Shift Bits Scale Factor

00 0 1

01 4 1/16

10 8 1/256

11 12 1/4096

Table 2: Pseudo Floating Point Scaling Factors

The u and v parameters, described in equation (2), can be interpreted as the

intersection point of a ray with the plane that the triangle lies in, expressed using the

triangle’s edges as basis vectors. The constraints shown in equation (3) usually require u

and v to be less then one. However, the basis can be scaled to result in a larger effective

triangle. That is, if u and v are

required to be less then two, instead

of the usual one, then the triangle will

become twice as big. By allowing

each triangle’s edges to have a scaling

factor, it is possible to introduce a

pseudo floating point system that

requires only a right shift of the u and

v coordinates.

The chosen implementation was to add two additional bits to the description of

each edge. These bits, shown in Table 2, control how the basis vectors are scaled to

allow a large dynamic range of sizes. For example, a scaling factor of 4096 allows for

triangles as large as the available 3D space but restricts the sizing resolution to steps of

4096.

3.2.3 Design Decisions

The implementation of the barycentric algorithm required several design decisions

other then just bit widths. It was necessary to determine whether or not to include the

divisions, the styles of arithmetic elements to use, and how to handle negative

intermediate values.

A custom implementation of a raytracing processor probably would not include

the divisions in the pipeline as they consume over half the total area and are not always

necessary. The divisions are only used to calculate u, v, and t, when there is a hit, so most

of the time they sit idle. In the case of the FPGA implementation, described in this thesis,

the divisions were placed in the pipeline regardless of this fact. It was found that there

16

was excess space on the FPGA that was going to waste, so the division units could be

added to the pipeline to simplify the control of the system. The next design decision

involved the actual implementation of these division units as well as the other arithmetic

components.

The choice of how to implement the arithmetic elements was important for two

reasons. First to meet the clock rate requirements, and secondly to minimize the area to

allow for possible parallel implementations. The first decision was to choose between

parallel and serial-based implementations. Since the algorithm requires division,

simplistic serial methods don’t apply. Instead a method known as on-line arithmetic15

was investigated. This method uses a redundant number set to serial process data using a

very fast clock rate. Although on-line arithmetic holds promise for a very fast clock as

well as low area, the serial nature reduces the throughput too much for it to be a viable

option for FPGA implementation. This left only the parallel implementation option.

Basic math functions, such as addition, subtraction, and multiplication, have

highly optimized solutions that are technology dependent when implemented in FPGA. It

is because of this dependance that the synthesize of these elements where left to the

Synplify Pro tool. This tool was able to map the simple math functions to specialized

blocks within the FPGA that perform better then any gate level design could, but it was

unable to synthesize the necessary dividers. The dividers had to be hand created so an

algorithm needed to be chosen.

The choice of what division algorithm to used depended on both speed and area.

Several approaches were examined including basic radix-2 and several higher radix16

implementations. Ideally a high radix solution would be best as it would limit the depth

of the pipeline but unfortunately it was found that these types of divisions do not map

well to FPGAs. This left only the radix-2 solution that solves division one bit per cycle,

and it is this solution that is used in the raytracing processor.

To extend the division algorithm to include signed division would have required

additional hardware on both the input and the output of a division unit. This could have

been implemented but would have increased the pipeline depth and complicity of the

17

circuit. To prevent this, the barycentric algorithm was modified slightly to restrict the

possible ray triangle intersections to only those requiring positive divisions.

Geometrically this is equivalent to defining a one-sided triangle, that is a triangle that is

visible only from one side and not the other.

3.2.4 Barycentric Extension to Parallelograms

The barycentric ray triangle intersection algorithm also yields one more additional

change that is useful in raytracing. It is possible to adjust the constraints that define a hit,

to result in the ability to intersect a ray and a parallelogram. Instead of using the

constraints shown in equation (3) if the constraints in equation (6) are used then the result

is a ray parallelogram intersection.

u v≤ ≤1 1 (6)

The config input to the barycentric unit allows for a triangle to be extended into a

parallelogram by ignoring the conditions in equation (3) and only using those in equation

(6).

3.3 Memory Interface

3.3.1 Overview

The barycentric ray triangle intersection unit described previously was only

capable of intersecting a ray with a single triangle. To add the ability to intersect a ray

with the entire world of triangles, it is necessary to read the world from memory and pass

these triangles one by one to the barycentric unit.

Put simply this unit will cycle through the entire list of triangles, stored in an

external memory, and pass them to the barycentric unit. From a more complicated view

the unit must also have the ability to communicate to the host computer to allow the

triangle data to be written to memory, and provide control signals to the rest of the

system.

18

Vertex 3x28Edges 1 & 2 6x16Edge Scale 2x2Config 1 ======Total: 185 bits

Table 3: Memory Widths

3.3.2 Design Decisions

The design of this unit was primarily guided by

memory bandwidth issues. Table 3 shows that for each

triangle it is necessary to read 185 bits of memory. Since

the TM3a has a 64 bit databus, this requires three read

cycles. This would seem to suggest that it is not possible to

keep the barycentric pipeline full as it has a throughput of

one triangle per cycle, but this is not the case. Instead of providing a new triangle each

cycle it is possible to provide a new ray each cycle. By processing three rays at the same

time it becomes possible for a single triangle to be checked against all three rays while the

next triangle is being loading. Using this method 100% pipeline utilization is achieved.

3.4 Nearest Comparison Units

3.4.1 Overview

The previous two units allow for a world to be intersected by a ray and outputs a

list of triangle identifications and the hit information of each. However this is not what is

required for solving the pixel visibility problem. This is only a list of triangles that could

be visible from the pixel if they are not occluded by a closer object. To determine which

triangle is visible out of the list of possible hits, it is necessary to find which intersection

is closest to the viewer. That is the intersection with the lowest value of parameter t. It is

this process that the nearest comparison unit is responsible for.

The nearest comparison unit is designed to process the outputs from the

barycentric ray triangle intersection unit and dynamically keep track of the nearest

intersection point so far. This is accomplished quite simply by having an internal register

that latches the new intersection data if it is closer then the old data stored within. The

only twist is that there are three rays in flight through the pipeline at any given time

whose results must be kept separated. This is accomplished by having three nearest

comparison units that are enabled only when their ray’s result is available.

Through the combination of these three units, the memory interface, and the

19

barycentric ray triangle intersection unit, a complete system that takes three ray as input

and returns the triangles they strike first is implemented.

20

Figure 9: Bounding Hierarchy Structure

4 Hierarchy Controller4.1 Overview

The barycentric ray triangle intersection unit does its job very fast but does not

have an advanced enough controller to benefit from the logarithm nature of hierarchical

raytracing. This chapter describes the supporting circuitry that is required to implement

the potential visibility stage of the graphics pipeline through the use of an entirely

hardware implementation of a bounding hierarchy algorithm.

Unlike a software implementation of a spatial hierarchy that has complete

flexibility, a hardware implementation requires strict constraints to operate at maximum

speed. Of the various hierarchical methods available it was determined that a bounding

hierarchy could be implemented with minimum memory requirements and with a

relatively simple controller compared to BSP trees and other hierarchical methods. The

bounding hierarchy also benefits from being able to exploit the barycentric ray triangle

unit in its hierarchy traversal.

The bounding hierarchy is

further constraints by restricting a

bounding volume to be defined

by six arbitrary triangles or

parallograms, by limiting the

hierarchical tree to a depth of

three, and by restricting the

maximum number of children

any node can have to eight.

Figure 9 shows a summary of this

structure. There can be up to

eight root nodes, and each of these can have up to eight children. These children can then

have up to eight children of their own, and finally a leaf node can have any number of

object triangles contained within it. This structure allows for a maximum of 512 leaf

nodes.

21

Figure 10: Bounding Hierarchy Controller System Overview

Implementing this algorithm requires an advanced controller, a memory buffer

that tracks which bounding nodes must be traversed, and a very fast sorting algorithm to

ensure the nodes are traversed in the correct order. An overview of the system that

implements these functions is shown in figure 10. The bound controllers provide the

control logic necessary, the list handlers track which bounding nodes need to be visited,

and the list sort unit insures that the bounding nodes are traversed in the proper order. In

addition to the control units the ray interface and result receiver units are necessary to

handle the inter chip buses.

22

4.2 Bounding Hierarchy Controller

4.2.1 Overview

The brain of the completed system is contained within the bounding hierarchy

controller. This controller is responsible for receiving rays from a user circuit,

performing the necessary functions to calculate the resulting hit data, and then returning

the result to the user. The basic algorithm implemented by this controller is as follows:

1. Intersect the requested ray with the eight root bounding volumes2. Intersect the requested ray with the nearest hit bounding volume’s children

and save the other hit nodes for traversal later. If no child nodes are hitthen traverse up the tree processing the next nearest hit nodes.

3. Repeat step 2 until a leaf node is hit or there are no outstanding hit nodes.4. Intersect the requested ray with the object triangles contained within the leaf

node. If there is a hit then return it, if not traverse up the tree andcontinue from 2.

To keep this controller simple the functionality that tracks which nodes have been

hit and which node should be processed next is slaved off to the list handler unit. This

means the controllers portion of this algorithm is restricted to interpreting the bounding

hierarchy and sending the proper triangles to the barycentric ray triangle intersection unit.

4.2.2 Barycentric Ray Triangle Intersection Unit Utilization

The algorithm described above is inherently sequential in nature. That is it is

necessary to examine one entire level of the bounding hierarchy before a decision can be

made on which node to examine next. If this algorithm is directly implemented then the

deep pipeline in the barycentric ray triangle intersection unit would cause a large bubble

to form between each level of the hierarchy. To avoid this problem several controllers

can be used that share the barycentric ray triangle unit by interleaving their usage.

The first controller would transmit the required triangles to the barycentric ray

triangle unit while the second controller waits. Upon completely of the transmission the

first controller passes the ownership token to the second controller who is then free to use

the barycentric ray triangle unit while the first controller’s triangles are being flushed

23

through the pipeline. Upon the second unit completing its transmission, the token is

returned to the first controller who is then free to continue. This allows the barycentric

ray triangle unit to be nearly 100% utilized.

4.2.3 State Machine

Appendix A shows a simplified description of the controlling state machine.

There are three dominate phases of the controller. The first stage involves acquiring the

clear to send (CTS) signal from the other controller, if necessary. The second stage

traverses through the bounding hierarchy by sending the proper triangle identification

numbers to the barycentric ray triangle intersection unit. Several internal variables keep

track of which node the controller is working on and the three states: S_SEND, S_WAIT1,

and S_WAIT2 transmit the triangle identification numbers to the barycentric ray triangle

intersection unit.

The final stage is responsible for processing the object triangles contained within

a given leaf node. First the stage reads from an indirection table to determine where to

find the list of object triangles in memory and how many triangles they are. There is then

another set of three stages: S_LEAFSEND, S_LEAFWAIT1, and S_LEAFWAIT2 that read

the triangle identifications from memory and transmit them to the barycentric ray triangle

intersection unit. If this stage finds a hit then the process is complete, otherwise the state

machine returns to the second stage to continue traversing the hierarchy.

4.2.4 Required Memories

The bound controllers require two different types of data. The first is the leaf

node indirection data. This data is stored in the internal FPGA memory. They are 512

entires, one for each leaf node in the bounding hierarchy that describes how many object

triangles are contained within the leaf node and where in memory the list of triangles can

be found. The second type of data is the triangle lists. These lists are stored in an

external SRAM bank as the data is to large to fit internally. The lists consist of an array

of 16bit triangle identification numbers.

24

Figure 11: Restriction on Parallel Rays

These memories cannot be directly accessed by the bound controllers because

conflicts could arise. Instead an intermediate memory controller handles all request from

the bounding controllers. In addition to simplifying memory access this controller also

allows for the possibility of two controllers accessing the memory at the same time. This

feature is of no use for the currently implemented system but it allows for the possibility

for another instance of the barycentric ray triangle intersection unit to be controlled by

another set of bound controller using the same hierarchy data.

4.2.5 Parallel Ray Processing Issue

The algorithm described above is for processing a single ray at a time, but the

barycentric ray triangle intersection unit is designed to process three rays in parallel. To

solve this problem three rays are traversed through the hierarchy such that if one ray

strikes a node then all three rays are intersected with that node. This is potentially

wasteful as there is a worse case degeneracy where the three rays traverse entirely

different paths through the hierarchy resulting in the pipeline effectively processing the

rays sequentially. However this is not usually a problem as often rays are coherent, that is

they travel in nearly the same direction through space, and therefore the hierarchy.

There is one case, however, where this will algorithm could fail. Consider the

diagram shown in figure 11. There are

two rays which strike the same two

bounding boxes, but in different orders. If

the algorithm processed bounding box one

first then two then it is possible that the

algorithm will return the wrong intersection for ray B. If ray B intersects an object in

box one the algorithm will stop searching for other intersections even though a closer

intersection in box two might exist.

This problem is also not that significant in practice as rays are often coherent in

their directions, but to insure the error does not occur a further constraint is required. The

easiest way to eliminate this case is to constrain that all three rays must have the same

25

origin. This will insure that all three rays will traverse through the bounding hierarchy in

the same direction.

4.2.6 Scalability Considerations

The bounding hierarchy has the benefit of being design in such a way that any

latency in the barycentric ray triangle intersection unit can be accommodated. The token

passing methodology can scale to include an arbitrary number of controllers, instead of

just two. For example if the pipeline is so deep that both controllers are waiting for

results then a third controller could be added to use the idle time. Since the token passing

implementation involves a request line and a grant line between two controllers, it is

possible to have a chain of controllers that will constantly pass the token around the

circle.

4.3 Sorted List

4.3.1 Overview

Section 2.4.4 described the importance of traversing the bounding hierarchy in a

front to back manor to insure the correct intersection is found. To facilitate this, it is

necessary to sort the resulting hit bounding volumes by distance from the view point.

There are a number of ways that a sort can be implemented but few are suitable because

their sequential nature would be too slow. The speed of the barycentric ray triangle

intersection unit results in a new element being generated every six cycles. This must

then be inserted into a sorted list with a maximum of eight elements. As such, any purely

sequential algorithm would require a worse case of 8 cycles, which is unacceptable.

The chosen solution is simular to a contents addressable memory in that the

sorting unit compares the new element with every element already in the list in parallel.

The decisions on which insertion point to use can be made in one cycle and the result

latched into its memory location at the same time. The result is a very fast sorting unit

that can easily handle the amount of data required.

26

4.3.2 Size and Speed Issues

Although the sorting unit is fast, it comes at quite a cost. The sorting key is the

intersection point’s distance, which is 32bits in length. This requires eight 32bit

comparators, and eight registers to store the sorted list. This by it self consumes a lot of

area, but there are additional area requirements as well. Since three rays are processed in

parallel in the barycentric ray triangle intersection unit, the results of all three rays must

be stored in the sorting unit. This doesn’t require any additional comparators but does

increase the number of registers required substantially.

4.4 List Handler

4.4.1 Overview

Once all the eight results for any given hierarchy level have been sorted by the

sorting unit it is necessary to store these for later retrieval by the bound controller. This

task is delegated to the list handler units. These units must keep track of the results of all

three internal levels of the bounding hierarchy and be able to return the next bounding

node to be processed when requested by the bound controller. This is accomplished

through two different phases.

The first phase writes the results from the sorting unit to a memory location based

on which level of the hierarchy the result is from. The second phase involves

determining the next node to processes based on which of the three rays are still active.

For example if the bound controller requests the next node that is struck by either ray one

or ray two then the list handler must search its list for the first node that has been hit by

either ray.

4.4.2 List Handler Operation

The actual algorithm implemented within the list handler is quite simple. The

algorithm that replies to a request from the bounding controller can be conceptually

described as:

27

1. Search level two for the first bounding node that has not been processed andhas been hit by one of the rays still active

2. If no node exists search level one for the first node that meets this requirement3. If no node exists search level zero for the first node that meets this requirement4. If no node exists then the rays do not intersect any objects

This algorithm is conceptual only because it is not necessary to perform the algorithm

after a request from the bounding controller. Instead the algorithm is performed during

available cycles and the result saved for later requests. This is possible because the

bounding controller is constantly informing the list handler unit which rays are still

active.

28

Raytracing Processor 50,000,000

POVray 3.0 (ultraSPARC II) ~2,500,000

POVray 3.0 (Athlon 1GHz) ~4,000,000

Intel SSE (800 MHz) 36,000,000

Table 4: Brute Force Performance Numbers

5 Results5.1 Overview

The performance of the raytracing processor will be examined in two parts. The

first will involve comparing the performance of the barycentric ray triangle intersection

unit to several state of the art software implementations. The second part will examine

the entire bounding hierarchy units performance compared with a software

implementation running on a several different computer architectures.

Since the raytracing processor does not have a defined interface, it was necessary

to create a test jig to provide the rays and to receive the results. The jig for both the

inputs and the outputs consists of a simple memory buffer that can be written to and read

from the SUN workstation connected to the prototyping system. This buffer is necessary

because the connection between the SUN and the prototyping system is such that the time

required to transmit the data is several orders of magnitude higher then the time to

process it. In order to achieve an accurate view of the performance of the individual units

this bottle neck was removed.

5.2 Barycentric Ray Triangle Intersection Unit Performance

The results for the barycentric

ray triangle intersection unit are quite

clear cut. Since the unit performs one

ray triangle intersection test every

clock cycle and runs at 50Mhz, the

total throughput is 50 M ray triangle

intersection tests per second. As a

comparison several test scenes, shown in appendix D, where run through POVray 3.017 on

an ultraSparc II 450MHz machine, and an AMD Athlon 1GHz machine. The results

where highly dependent on the cache coherent of the rendered image and as such it was

difficult to find an exact number for intersection tests per second. Table 4 summarizes

these results as well as listing the results of an Intel SSE implementation presented in the

29

paper Interactive Rendering with Coherent Ray Tracing18.

It should be noted that the result using the SSE method was derived by the author

by profiling the intersection test code at the cycle level. Although this can results in very

accurate timing information for a given test case, it does not take into account cache

issues. This means that in practice the performance of this method would likely be

substantially lower.

5.3 Bounding Hierarchy Performance

5.3.1 Overview

The performance of the completed bounding hierarchy unit is highly dependent on

the type of scene and the quality of the bounding hierarchy used. To try and gain an

accurate insight into the performance of this unit several synthetic test images will be

used that are designed to stress the system. The first two sets of tests are such that they

minimize the advantage that the bounding hierarchy provides the raytracing processor,

there by stressing it. The other set of tests is designed to provide test images that provide

very little cache coherence. These types of tests are designed to stress the software

raytracers that are used for comparison.

5.3.2 Single Large Faceted Sphere Tests

The following test sets consist of a signal sphere of a constant radius. The sphere

is contained within a fixed bounding hierarchy and only the number of triangles that

approximate the sphere is varied. This test is designed to increase the number of triangles

contained within a leaf node while maintaining a constant bounding hierarchy structure.

This will effectively eliminate the logarithmic benefit of the bound system.

Table 5 summarizes the results using triangle counts from 8 to 8196. The timing

results for the raytracing processor are derived from a cycle accurate count of the render

times. The performance numbers for the software implementation are extracted from the

raytracer’s result report and do not include the time it takes to parse the input data file.

They are accurate only to the nearest second and as such do not show small trends easily.

30

0 1000 2000 3000 4000 5000 6000 7000 8000 90000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

# of Triangles

Ren

der T

ime

(sec

)

Single Large Sphere Test Sets (Expanded View)

Figure 12: Single Sphere Results (Expanded View)

0 1000 2000 3000 4000 5000 6000 7000 80000

2

4

6

8

10

Single Large Sphere Test Sets

# of Triangles

Ren

der T

ime

(sec

) Raytracing ProcessorPOVray 3.0 ultraSPARC IIPOVray 3.0 Athlon 1Ghz

Figure 13: Single Sphere Test Set Results

Order # Triangles Cycles Hardware POVray 3.0† POVray 3.0‡

0 8 9554009 0.191s 11s 2s

1 32 9468089 0.189s 11s 2s

2 128 9680532 0.193s 10s 2s

3 512 10886906 0.218s 11s 2s

4 2048 15792096 0.316s 11s 2s

5 8196 35154476 0.703s 11s 3s†ultraSPARC II 450Mhz ‡Athlon 1Ghz

Table 5: One Sphere Test Set Results

31

Figures 12 and 13 show that once the advantages of the bounding hierarchy are

removed the performance is reduced to a linear relation between the number of triangles

and the render time. This is expected as raytracing is O(n) when a bounding hierarchy is

not used. The fast barycentric ray triangle intersection unit in the raytracing processor is

also able to out perform the software implementation by a factor that varies from 4 to 10

times.

5.3.3 A Grid of Faceted Spheres

The previous test was designed to maintain a constant bounding hierarchy

structure while varying the number of triangles in each leaf. This test is designed to stress

the raytracing processor by maintaining a constant number of triangles in any given leaf

node while varying the number of leaf nodes that exist. The new leaf nodes will be

distributed uniformly across the image to insure that every node is visible.

Table 6 summarizes the results of this test set. Once again the software numbers

provide only an approximate indication of average performance and are not accurate

enough to reveal any trends.

Order # Triangles Cycles Hardware POVray 3.0† POVray 3.0‡

1 2048 7027597 0.141s 11 2

4 8192 10509865 0.210s 7 2

7 14336 14294711 0.286s 7 2

14 28672 22394396 0.448s 7 2

17 34816 25693610 0.514s 7 2

21 43008 29939633 0.599s 7 2

23 47104 32425081 0.649s 7 2

26 53248 36170720 0.723s 7 2

28 57344 38064065 0.761s 8 3†ultraSPARC II 450Mhz ‡Athlon 1Ghz

Table 6: Grid of Sphere Test Set Results

32

0 1 2 3 4 5 6

x 104

0

2

4

6

8

10

12Grid of Spheres Test Sets

# of Triangles

Ren

der T

ime

(sec

)

Raytracing ProcessorPOVray 3.0 ultraSPARC IIPOVray 3.0 Athlon 1Ghz

Figure 14: Grid of Sphere Test Set Results

0 1 2 3 4 5 6

x 104

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

# of Triangles

Ren

der T

ime

(sec

)

Grid of Spheres Test Sets (Expanded View)

Figure 15: Grid of Sphere Test Set Results(Expanded View)

Figures 14 and 15 once again show that the performance is close to linear with

respect to the number of triangles. This results from the structure of the bounding

hierarchy being uniformly distributed. Since the render time is proportional to the

number of pixel rays that hit a bounding box, doubling the number of bounding boxes

will also double the screen coverage that the boxes provide. The net result is a linear

relationship.

Even under this degenerate case the raytracing processor is still 4 to 15 times

faster then the software implementation.

33

5.3.4 Cache Incoherent Test Set

The primary advantage that a software implementation has over the hardware

raytracing processor is its effective memory bandwidth. Through the use of highspeed

data caching a software implementation can access far more data then the uncached

raytracing processor. However these caches are relatively small and great care must be

taken to insure that data is accessed coherently to avoid cache misses. This was not an

issue in the scenes tested previously as they where small and very coherent. However, a

real world scene tends to be far less coherent. It is this incoherency that removes most of

the benefit that caching could provide. To examine this behaviour, several test sets that

have been designed to be difficult to cache will be tested.

To insure that caching effects are at a minimum a scene must be designed such

that a large number of triangles need to be tested against a given ray. This will insure that

more triangle data must be accessed then can be stored in the cache. To achieve this

effect a test scene was created by placing several objects behind each other. The net

effect is that any ray that strikes the front objects bounding box, but misses the contained

object triangles, will also intersect the bounding boxes of the remaining objects. This

will require the ray to be intersected with every triangle for every object.

Test Set Cycles Hardware POVray † POVray ‡

SphereZ 17890974 0.358s 46s 6s

SphereZ2 24931862 0.499s 84s 10s

SphereZ3 16568132 0.331s 94s 10s†ultraSPARC II 450Mhz

‡Athlon 1Ghz

Table 7: Cache Incoherent Test Set Results

Table 7 summarizes the results for three test sets. It is clear that only a few cache

misses seriously slows the performance of the software implementation. Instead of the

raytracing processor being 5 to 15 times faster for coherent scenes, it is 17 to 30 times

faster in these test sets.

34

6 Conclusions and Future Work6.1 Conclusions

6.1.1 Barycentric Ray Triangle Intersection Unit

The core of the raytracing processor, the barycentric ray triangle intersection unit,

ran at least an order of magnitude faster then the common software raytracer POVray 3.0.

Even when compared to an idealized parallel implementation, that ignored memory

access time, the hardware approach was still 30% faster. Overall the hardware

implementation, that ran at 50Mhz, was easily able to out perform the software

implementations, running on a processor with a clock rate of 1GHz. The main reason

that the raytracing processor ended up being so much faster then software raytracers are,

was because it could utilize memory better.

Even though both the hardware and software approach have simular maximum

memory bandwidth, the software method is unable to pipeline reads. The custom nature

of the raytracing processor allows it to preload all its required data to fully utilize the

memory subsystem. A software implementation does not have this freedom and, as such,

must wait for memory transactions to complete prior to requesting the next block of data.

It is this factor, more then any other, that provides the raytracing processor its biggest

advantage.

6.1.2 Hierarchy Controller

Although the barycentric ray triangle intersection unit’s performance is very good,

it is far to slow to render large images due to its O(n) complexity. The hierarchy

controller allowed the processor to take advantage of a potential O(log n) render time by

using tree structures. Initially it was unclear if the sequential nature of tree traversals

could be accelerated using hardware but the results speak for themselves. Although the

hardware implementation provides little parallel acceleration or algorithmic

improvements, it does provide much better memory access.

Under the raytracing processors worse case operating conditions it still out

performed the corresponding software implementation by 500 to 1500%. When

35

compared against the software worse case operation, poor cache coherence, the processor

performed 1700% to 3000% faster. Considering that the total available memory

bandwidth, and system clock rates are much lower then a corresponding software

implementation these numbers are quite impressive.

6.1.3 General Discussion

The raytracing algorithm is filled with many possible degrees of parallelization,

and is also high predictive. It is possible through clever hardware controllers to predict

every memory read sufficiently in advance to utilize the available memory bandwidth

100%. This is not to say that a data cache might not be useful in a future raytracing

processor.

The ability of the bounding controller to operate with intersection units that have

arbitrarily high latency allows for the system clock rate to be ramped as high as needed by

deepening the pipeline. In fact data path speeds that require more data then external

memory could provide, would still be useful provided an internal highspeed cache is

used. Although there is no guarantee that a small group of random rays will share any

coherence, a large enough group of rays would have a much higher chance. If a number

of rays are accessible to the raytracing processor at any given time then this addition

coherence could be exploited and a memory cache used.

To summarize, this project revealed the memory advantages that a hardware

implementation could utilize, as well as finding that pipeline depth can easily be handled.

Unfortunately the resulting system was still to slow to be used for real time graphics. As

an accelerated offline renderer it performs much better then software but is still too slow

for real time. Perhaps if transistor technology continues to advance at its current rate then

in a decade real time raytracing will be possible, but for now it remains just a dream.

36

Figure 16: Alternative Raytracing Architecture

6.2 Future Work

This project concentrated primarily on accelerating the math behind raytracing

through deep pipelines and parallel implementations. However, it was discovered that the

real advantages raytracing can benefit from are predictive memory access and clever

caching methods. A very interesting future project would be to design a system that

operates using highspeed external memory and worked on a large number of rays in

parallel to exploit possible memory coherency.

Such a system could be designed as shown in figure 16. A number of rays are

first entered into a ray state buffer,

as shown. A unit would then read

these rays sequentially and using

some hierarchical structure

determine which leaf node should

be intersected next based on the

input rays state. The ray, along

with which leaf node ID would be

passed to a process queue. This

queue would be examined by a

controller whose task it is to pass

rays from the queue to the processing unit such that cache misses are minimized.

Additionally this controller could also preemptively load the triangle cache if it foresees a

future cache miss. Once a leaf node has been processed, the ray is either retired or

returned to the ray state buffer for further hierarchical processing. Ideally if a large

enough number of rays are processed simultaneously then cache misses can be nearly

eliminated.

Another benefit of this architecture is that the units contained within the dashed

box can be duplicated for higher performance. In theory each of these duplicated

subsystems could be implemented on separate chips with their own external memory.

These two memories would not have to contain duplicate data since if the leaf nodes are

37

randomly distributed between the duplicated systems and the number of rays being

processes is large enough then both units would be highly utilized processing which ever

nodes they have access to. The net effect would be a higher total memory bandwidth

without the associated high pin counts or bus speeds.

Such a system could easily achieve an order of magnitude speed increase over the

implementation presented in this paper. If the data path clock speeds could be ramped

high enough then it could be possible to achieve another order of magnitude increase.

Although this speed would still be too slow for real time recursive raytracing, it would be

another step in the right direction.

6.3 Closing Words

Although mathematically raytracing appears to be superior to raster graphics its

large constant still makes it too slow to be competitive. Perhaps as processing power

increases, and correspondingly scene sizes, the logarithmic nature of raytracing will win

out, but for now and the near future, raster graphics are king.

38

1. Kornel, K. 2D and 3D Perspective Transformations. In Computers & Graphics, 1990.

2. Samet, H.J. Design and analysis of Spatial Data Structures: Quadtrees, Octrees, andother Hierarchical Methods. Addison-Wesley, Redding, MA, 1989.

3. Coorg, S. Teller, S. Real-Time Occlusion Culling for Models with Large Occluders.Proceedings of the Symposium on Interactive 3D Graphics, 1997.

4. Luebke, D. Georges, C. Portals and mirrors: Simple, fast evaluation of potentiallyvisible sets. In ACM Interactive 3D Graphics Conference, Monterey, CA, 1995.

5. Clark, J.H. Hierarchical geometric models for visible surface algorithms.Communications of the ACM, 19(10):547-554, 1976.

6. Foley, J. Van Dam, A. Huges, J. Feiner, S. Computer Graphics: Principles andPractice. Addison Wesley, Reading, Mass., 1990.

7. Akeley, K. Segal, M. The OpenGL Graphics System: A Specification (Version 1.3).http://www.opengl.org/developers/documentation/version1_3/glspec13.pdf, 2001.

8. Pixar. The RenderMan Interface (Version 3.2).http://www.pixar.com/renderman/developers_corner/rispec/rispec_pdf/RISpec3_2.pdf

9. Badouel, D. An Efficient Ray-Polygon Intersection. In Graphics Gems, 390-393.Academic Press Inc. 1990.

10. Stolfi, J. Oriented Projective Geometry. Academic Press. 1991.

11. Möller, T. Trumbore,B. Fast, minimum storage ray-triangle intersection. In theJournal of Graphics Tools. 2(1):21-28, 1997

12. Havran, V. Kopal, T. Bittner, J. Zara, J. Fast Robust Bsp Tree Traversal Algorithmfor Ray Tracing. In the Journal of Graphics Tools. 2(4):15-23, 1997

13.Agate, M. Grimsdale, L. Lister, P.F. The HERO Algorithm for Ray-Tracing Octrees.In Advances in Computer Graphics Hardware IV. 61-73, Springer Verlag. 1991.

14. Havran, V. Sixta, F. Comparison of Hierarchical Grids. In Ray Tracing News, 12(1),http://www.acm.org/pubs/tog/resources/RTNews/html/rtnv12n1.html#art3. 1999.

15. Ercegovac, M.D. On-line Arithmetic: An Overview. In Real Time Signal ProcessingVII. 86-93, SPIE, 1984.

7 References

39

16. Avizienis. Signed-Digit Number Representations For Fast Parallel Arithmetic. In IRETrans. Electron. Compute. Vol EC-10. 389-400, 1961.

17. POVray 3.0 Persistence of Vision Raytracer. http://www.povray.com

18. I. Wald, C. Benthin, M. Wagner, P. Slusallek. Interactive Rendering With CoherentRay-Tracing. In Computer Graphics Forum, Eurographics. 2001.

40

Appendix A: Bound Controller State Machine

41

Appendix B: VHDL Code**** boundcontroller.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_unsigned.all;

entity boundcontroller isgeneric(

master : std_logic := '0';unitID : std_logic_vector(1 downto 0) := "00");

port(max : out std_logic_vector(31 downto 0);maxwe : out std_logic;raygroupout : out std_logic_vector(1 downto 0);raygroupwe : out std_logic;raygroupid : out std_logic_vector(1 downto 0);enablenear : out std_logic;-- Bus Signals (to Ray Generation Unit)raygroup : in std_logic_vector(1 downto 0);validraygroup : in std_logic;busy : out std_logic;-- Bus Signals (to RayTri Unit)triIDvalid : out std_logic;triID : out std_logic_vector(15 downto 0);wanttriID : in std_logic;-- Sorted stack & list buffer signalsl0reset : out std_logic;baseaddress : out std_logic_vector(1 downto 0);newdata : in std_logic;boundNodeIDout: buffer std_logic_vector(9 downto 0);resultID : in std_logic_vector(1 downto 0);-- List Handler Signalshitmask : buffer std_logic_vector(2 downto 0);ldataready,lempty : in std_logic;llevel : in std_logic_vector(1 downto 0);lmax0,lmax1,lmax2 : in std_logic_vector(31 downto 0);lboundNodeID : in std_logic_vector(9 downto 0);lack : out std_logic;lhreset : out std_logic;-- Leaf Indirection Memory Interfaceaddrind : out std_logic_vector(9 downto 0);addrindvalid : out std_logic;dataind : in std_logic_vector(31 downto 0);dataindvalid : in std_logic;-- Triangle List Memory Interfacetladdr : buffer std_logic_vector(17 downto 0);tladdrvalid : out std_logic;tldata : in std_logic_vector(63 downto 0);tldatavalid : in std_logic;-- Result Inputst1in,t2in,t3in : in std_logic_vector(31 downto 0);u1in,u2in,u3in,v1in,v2in,v3in : in std_logic_vector(15 downto 0);id1in,id2in,id3in : in std_logic_vector(15 downto 0);hit1in,hit2in,hit3in : in std_logic;-- Result Outputst1,t2,t3 : out std_logic_vector(31 downto 0);u1,u2,u3,v1,v2,v3 : out std_logic_vector(15 downto 0);id1,id2,id3 : out std_logic_vector(15 downto 0);hit1,hit2,hit3 : out std_logic;bcvalid : out std_logic;-- Counter interface Signalsdone : in std_logic_vector(1 downto 0);resetcnt : out std_logic;-- Handshake signalspassCTSout : out std_logic;passCTSin : in std_logic;

globalreset : in std_logic;clk : in std_logic;-- debugging interfacestatepeek : out std_logic_vector(4 downto 0);debugstoplevel : in std_logic_vector(1 downto 0);

42

debugleafbreak : in std_logic;debugsubcount : out std_logic_vector(1 downto 0);debugcount : out std_logic_vector(13 downto 0));

end;

architecture rtl of boundcontroller istype state_type is (S_IDLE,S_WAITCTS,S_ACTIVE,S_SEND,S_WAIT1,S_WAIT2,S_WAITCOMPLETE,

S_WAITCTS2,S_GETCTS,S_NODECOMPLETE,S_LEAFINDIRECT,S_LEAFACTIVE,S_LEAFSEND,S_LEAFWAIT1,S_LEAFWAIT2,S_LEAFCOMPLETE,S_LEAFGIVECTS,S_LEAFGIVEDONE,S_LEAFPROCESS,S_LEAFGETCTS);

signal state : state_type;signal next_state : state_type;signal max0,max1,max2 : std_logic_vector(31 downto 0);signal cts : std_logic;signal addr,startAddr : std_logic_vector(11 downto 0);signal resetcount : std_logic_vector(2 downto 0);-- Leaf Node Signalssignal count : std_logic_vector(13 downto 0);signal triDatalatch : std_logic_vector(63 downto 0);signal subcount : std_logic_vector(1 downto 0);signal maskcount : std_logic_vector(1 downto 0);

signal debug : std_logic;begin

debugsubcount <= subcount;debugcount <= count;process (state)begin

case state iswhen S_IDLE => statepeek <= "00001";when S_WAITCTS => statepeek <= "00010";when S_ACTIVE => statepeek <= "00011";when S_SEND => statepeek <= "00100";when S_WAIT1 => statepeek <= "00101";when S_WAIT2 => statepeek <= "00110";when S_WAITCOMPLETE => statepeek <= "00111";when S_WAITCTS2 => statepeek <= "01001";when S_GETCTS => statepeek <= "01010"; -- 10when S_NODECOMPLETE => statepeek <= "01011";when S_LEAFINDIRECT => statepeek <= "01100";when S_LEAFACTIVE => statepeek <= "01101";when S_LEAFSEND => statepeek <= "01110";when S_LEAFWAIT1 => statepeek <= "01111";when S_LEAFWAIT2 => statepeek <= "10000";when S_LEAFCOMPLETE => statepeek <= "10001";when S_LEAFGIVECTS => statepeek <= "10010";when S_LEAFGIVEDONE => statepeek <= "10011";when S_LEAFPROCESS => statepeek <= "10100";when S_LEAFGETCTS => statepeek <= "10101";when others => statepeek <= "11111";

end case;end process;

Process(clk,globalreset)begin

if (globalreset = '1') thenstate <= S_IDLE;raygroupout <= (others => '0');cts <= master;passCTSout <= '0';max0 <= (others => '0');max1 <= (others => '0');max2 <= (others => '0');addr <= (others => '0');startAddr <= (others => '0');boundNodeIDout <= (others => '0');resetcount <= (others => '0');hitmask <= (others => '1');lack <= '0';baseAddress <= (others => '0');l0reset <= '0';resetcnt <= '0';triIDvalid <= '0';triID <= (others => '0');addrind <= (others => '0');addrindvalid <= '0';

43

tladdrvalid <= '0';tladdr <= (others => '0');tridatalatch <= (others => '0');maskcount <= (others => '0');subcount <= (others => '0');count <= (others => '0');hit1 <= '0'; hit2 <= '0'; hit3 <= '0';t1 <= (others => '0'); t2 <= (others => '0'); t3 <= (others => '0');u1 <= (others => '0'); u2 <= (others => '0'); u3 <= (others => '0');v1 <= (others => '0'); v2 <= (others => '0'); v3 <= (others => '0');id1 <= (others => '0'); id2 <= (others => '0'); id3 <= (others => '0');debug <= '0';

elsif (rising_edge(clk)) thenstate <= next_state;addrind <= (others => '0');l0reset <= '0';lack <= '0';triIDvalid <= '0';triID <= (others => '0');if newdata = '1' and resultID = unitID thenboundNodeIDout <= boundNodeIDout + 1;

end if;if (done = unitID) or (state = S_LEAFCOMPLETE and newdata = '1' and resultID =

unitID) thenresetcnt <= '1';

elseresetcnt <= '0';

end if;case state iswhen S_IDLE =>if validraygroup = '1' and cts = '1' thenraygroupout <= raygroup;

end if;if validraygroup = '1' and cts = '0' thencts <= '1';passCTSout <= '1';

elsif validraygroup = '0' and cts = '1' and passCTSin = '1' thencts <= '0';passCTSout <= '1';

end if;when S_WAITCTS =>if passCTSin = cts thenpassCTSout <= '0';

end if;when S_ACTIVE =>resetcount <= "100";l0reset <= '1';addr <= (others => '0');startAddr <= (others => '0');boundNodeIDout <= (others => '0');baseAddress <= (others => '0');max0 <= (others => '1');max1 <= (others => '1');max2 <= (others => '1');hitMask <= (others => '1');hit1 <= '0'; hit2 <= '0'; hit3 <= '0';

when S_SEND =>if (addr-startaddr /= 48) and (addr-startaddr /= 49) thentriIDvalid <= '1';

end if;triID <= "0000" & addr;addr <= addr+1;if resetcount = 5 thenresetcount <= "000";

elseresetcount <= resetcount+1;

end if;when S_WAITCOMPLETE =>if passCTSin = '1' and cts = '1' thencts <= '0';passCTSout <= '1';

elsif done = unitID and cts = '0' thencts <= '1';passCTSout <= '1';

end if;when S_WAITCTS2 =>if passCTSin = '0' then

44

passCTSout <= '0';end if;

when S_GETCTS =>if passCTSin = '0' thenpassCTSout <= '0';

end if;when S_NODECOMPLETE =>resetcount <= "100";baseAddress <= llevel+1;boundNodeIDout <= (lBoundNodeID+1)(6 downto 0) & "000";addr <= (((lBoundNodeID+1)(7 downto 0) & "0000")+

((lBoundNodeID+1)(6 downto 0) & "00000")) (11 downto 0);startaddr <= (((lBoundNodeID+1)(7 downto 0) & "0000")+

((lBoundNodeID+1)(6 downto 0) & "00000")) (11 downto 0);if ldataready = '1' and (wantTriID = '1' or llevel = "10") and (llevel <

debugstoplevel)thenlack <= '1';l0reset <= '1';

end if;if ldataready = '1' and llevel = "10" thenaddrind <= lboundNodeID-72;addrindvalid <= '1';

end if;when S_LEAFINDIRECT =>tlAddr <= dataind(17 downto 0);count <= dataind(31 downto 18);if dataindvalid = '1' thenaddrindvalid <= '0';tladdrvalid <= '1';

end if;when S_LEAFACTIVE =>tridatalatch <= tldata;subcount <= "10";maskcount <= "00";if (wanttriID = '1' and tldatavalid = '1') or (count = 0 or count = 1) thentladdr <= tladdr+1;tladdrvalid <= '0';

end if;when S_LEAFSEND =>if maskcount = "11" thentriID <= triDataLatch(15 downto 0);

elsif maskcount = "10" thentriID <= triDataLatch(31 downto 16);

elsif maskcount = "01" thentriID <= triDataLatch(47 downto 32);

elsetriID <= triDataLatch(63 downto 48);

end if;if count /= 0 thencount <= count - 1;if count /= 1 thentriIDvalid <= '1';

end if;if maskcount = "01" thentladdrvalid <= '1';

end if;end if;

when S_LEAFWAIT2 =>if subcount /= 0 thensubcount <= subcount - 1;

end if;if maskcount = "11" thentlAddr <= tlAddr+1;tladdrvalid <= '0';triDataLatch <= tldata;

end if;maskcount <= maskcount + 1;

when S_LEAFCOMPLETE =>if (newdata = '0' or resultID /= unitID) and CTS = '1' and passCTSin = '1' thencts <= '0';passCTSout <= '1';

end if;when S_LEAFGIVECTS =>if passCTSin = '0' and (newdata = '0' or resultID /= unitID) thenpassCTSout <= '0';

end if;

45

when S_LEAFGIVEDONE =>if passCTSin = '0' thenpassCTSout <= '0';

end if;when S_LEAFPROCESS =>-- latch new hitsif hit1in = '1' and hitmask(0) = '1' thent1 <= t1in; u1 <= u1in; v1 <= v1in; id1 <= id1in; hit1 <= '1'; hitmask(0) <=

'0';end if;if hit2in = '1' and hitmask(1) = '1' thent2 <= t2in; u2 <= u2in; v2 <= v2in; id2 <= id2in; hit2 <= '1'; hitmask(1) <=

'0';end if;if hit3in = '1' and hitmask(2) = '1' thent3 <= t3in; u3 <= u3in; v3 <= v3in; id3 <= id3in; hit3 <= '1'; hitmask(2) <=

'0';end if;if cts='0' and ((hitmask(0)='1' and hit1in='0') or (hitmask(1) = '1' and hit2in

= '0') or(hitmask(2) = '1' and hit3in = '0')) then

passCTSout <= '1';cts <= '1';

end if;when S_LEAFGETCTS =>if passCTSin = '1' thenpassCTSout <= '0';

end if;end case;

end if;end process;

busy <= '0' when (state = S_IDLE) else '1';

process (state,validraygroup,cts,passCTSin,wantTriID,addr,startAddr,done,ldataready,lempty,llevel,max0,max1,max2,resetcount,dataindvalid,tldatavalid,hit1in,hit2in,hit3in,hitmask,resultId,newdata,subcount,count)

Beginmax <= (others => '0');maxwe <= '0';raygroupID <= (others => '0');enablenear <= '0';raygroupwe <= '0';bcvalid <= '0';lhreset <= '0';case state IS

when S_IDLE =>lhreset <= '1';if validraygroup = '1' and cts = '1' thennext_state <= S_ACTIVE;

elsif validraygroup = '1' and cts = '0' thennext_state <= S_WAITCTS;

elsif validraygroup = '0' and passCTSin = '1' and cts = '1' thennext_state <= S_WAITCTS;

elsenext_state <= S_IDLE;

end if;when S_WAITCTS =>if passCTSin = cts thennext_state <= S_IDLE;

elsenext_state <= S_WAITCTS;

end if;when S_ACTIVE =>if wantTriID = '1' thennext_state <= S_SEND;

elsenext_state <= S_ACTIVE;

end if;when S_SEND =>if addr = startAddr thenmax <= max0;maxwe <= '1';

end if;if (addr-startAddr >= 1) and (addr-startAddr /= 49) thenraygroupID <= unitID;

end if;

46

next_State <= S_WAIT1;if resetcount = 5 thenraygroupwe <= '1';

end if;enablenear <= '1';

when S_WAIT1 =>if addr = startAddr thenmax <= max1;maxwe <= '1';raygroupID <= "01";

end if;if addr-startaddr=49 thennext_state <= S_WAITCOMPLETE;

elsenext_state <= S_WAIT2;

end if;when S_WAIT2 =>if addr = startAddr thenmax <= max2;maxwe <= '1';raygroupID <= "10";

end if;next_state <= S_SEND;

when S_WAITCOMPLETE =>if passCTSin = '1' and cts = '1' thennext_state <= S_WAITCTS2;

elsif done = unitID and cts = '0' thennext_state <= S_GETCTS;

elsif done = unitID and cts = '1' thennext_state <= S_NODECOMPLETE;

elsenext_state <= S_WAITCOMPLETE;

end if;when S_WAITCTS2 =>if passCTSin = '0' thennext_state <= S_WAITCOMPLETE;

elsenext_state <= S_WAITCTS2;

end if;when S_GETCTS =>if passCTSin = '1' thennext_state <= S_NODECOMPLETE;

elsenext_state <= S_GETCTS;

end if;when S_NODECOMPLETE =>if lempty = '1' thennext_state <= S_IDLE;bcvalid <= '1';

elsif ldataready = '1' and llevel = "10" and (debugstoplevel > "10") thennext_state <= S_LEAFINDIRECT;

elsif ldataready = '1' and wantTriID = '1' and llevel < debugstoplevel thennext_state <= S_SEND;

elsenext_state <= S_NODECOMPLETE;

end if;when S_LEAFINDIRECT =>if dataindvalid = '1' thennext_state <= S_LEAFACTIVE;

elsenext_state <= S_LEAFINDIRECT;

end if;when S_LEAFACTIVE =>if count = 0 or count = 1 thennext_state <= S_NODECOMPLETE;

elsif wanttriID = '1' and tldatavalid = '1' thennext_state <= S_LEAFSEND;

elsenext_state <= S_LEAFACTIVE;

end if;when S_LEAFSEND =>if count /= 0 thennext_state <= S_LEAFWAIT1;

elsenext_state <= S_LEAFCOMPLETE;

end if;if subcount = "10" then

47

max <= max0;maxwe <= '1';

end if;if (subcount = "01") thenraygroupID <= unitID;

elseraygroupID <= "00";

end if;enablenear <= '0';if subcount = "01" or count = 0 thenraygroupwe <= '1';

end if;when S_LEAFWAIT1 =>next_state <= S_LEAFWAIT2;if subcount = "10" thenmax <= max1;maxwe <= '1';raygroupID <= "01";

end if;when S_LEAFWAIT2 =>next_state <= S_LEAFSEND;if subcount = "10" thenmax <= max2;maxwe <= '1';raygroupID <= "10";

end if;when S_LEAFCOMPLETE =>if (newdata = '0' or resultID /= unitID) and CTS = '1' and passCTSin = '1' thennext_state <= S_LEAFGIVECTS;

elsif newdata = '1' and resultID = unitID thennext_state <= S_LEAFPROCESS;

elsenext_state <= S_LEAFCOMPLETE;

end if;when S_LEAFGIVECTS =>if newdata = '1' and resultID = unitID thennext_state <= S_LEAFGIVEDONE;

elsif passCTSin = '0' thennext_state <= S_LEAFCOMPLETE;

elsenext_state <= S_LEAFGIVECTS;

end if;when S_LEAFGIVEDONE =>if passCTSin = '0' thennext_state <= S_LEAFPROCESS;

elsenext_state <= S_LEAFGIVEDONE;

end if;when S_LEAFPROCESS =>if debugLeafBreak = '1' thennext_state <= S_IDLE;

elsif cts = '0' and ((hitmask(0)='1' and hit1in='0') or (hitmask(1)='1' and hit2in= '0') or

(hitmask(2) = '1' and hit3in = '0')) thennext_state <= S_LEAFGETCTS;

elsif cts = '1' and ((hitmask(0)='1' and hit1in='0') or (hitmask(1)='1' and hit2in= '0') or

(hitmask(2) = '1' and hit3in = '0')) thennext_state <= S_NODECOMPLETE;

elsenext_state <= S_IDLE;bcvalid <= '1';

end if;when S_LEAFGETCTS =>if passCTSin = '0' thennext_state <= S_LEAFGETCTS;

elsenext_state <= S_NODECOMPLETE;

end if;end case;

end process;

end rtl;

**** crossproduct.vhd *****

48

------------------------------------------------ Pipelined Vector Cross Product Component ---- C = A x B ---- Performs a vector cross product in 2 ---- clock cycles. Synplify's pipeline ---- option should be enable to better ---- balance the pipeline cycles. ------------------------------------------------library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;

entity crossproduct isgeneric (

widthA : natural := 32;widthB : natural := 32);

port(Ax,Ay,Az : in std_logic_vector(widthA-1 downto 0);Bx,By,Bz : in std_logic_vector(widthB-1 downto 0);Cx,Cy,Cz : out std_logic_vector(widthA+widthB downto 0);clk : in std_logic);

end;

architecture rtl of crossproduct issignal AyBz, AzBy, AzBx : std_logic_vector(widthA+widthB-1 downto 0);signal AxBz, AxBy, AyBx : std_logic_vector(widthA+widthB-1 downto 0);begin

process(clk)begin

if (rising_edge(clk)) thenAyBz <= Ay*Bz;AzBy <= Az*By;AzBx <= Az*Bx;AxBz <= Ax*Bz;AxBy <= Ax*By;AyBx <= Ay*Bx;

Cx <= (AyBz(widthA+widthB-1) & AyBz) - (AzBy(widthA+widthB-1) & AzBy);Cy <= (AzBx(widthA+widthB-1) & AzBx) - (AxBz(widthA+widthB-1) & AxBz);Cz <= (AxBy(widthA+widthB-1) & AxBy) - (AyBx(widthA+widthB-1) & AyBx);

end if;end process;

end rtl;

**** delay.vhd ****library ieee;use ieee.std_logic_1164.all;

entity delay isgeneric (

width : natural := 32;depth : natural := 1);

port(datain : in std_logic_vector(width-1 downto 0);dataout : out std_logic_vector(width-1 downto 0);clk : in std_logic);

end;

architecture rtl of delay istype delayarray is array (0 to depth-1) of std_logic_vector (width-1 downto 0);signal buff : delayarray;

begindataout <= buff(depth-1);

process(clk)begin

if (rising_edge (clk)) thenbuff(0) <= datain;if (depth > 1) thenrow : for k in 0 to depth-2 loopbuff(k+1) <= buff(k);

end loop row;end if;

end if;

49

end process;end rtl;

50

**** divide.vhd ****---------------------------------------------------- Parameterized Fixed Point Divide Componenent ---- ---- Qout = (A / B)*2^widthfrac ---- ---- Performs unsigned fixed point addition ---- between 2 numbers. The divide is pipelined ---- such that 1 quotient bit is generated per ---- clock cycle. The throughput is one divide ---- per cycle for any size input. ---- ---- widthOut specified the total output widht ---- widthFrac specifies how many of the output ---- bits are infact fractional ----------------------------------------------------library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;

entity divide isgeneric (

widthA : natural := 64;widthOut : natural := 32; -- Width of the outputwidthB : natural := 64;widthFrac : natural := 15); -- Fraction bits in output

port(A : in std_logic_vector(widthA-1 downto 0);B : in std_logic_vector(widthB-1 downto 0);Qout : out std_logic_vector(widthOut-1 downto 0);clk : in std_logic);

end;

architecture rtl of divide istype stdlogicarrayn is array(0 to widthOut-1) of std_logic_vector(widthA+widthFrac-1

downto 0);type stdlogicarraym is array(0 to widthOut-1) of std_logic_vector(widthOut-1 downto 0);type stdlogicarrayo is array(0 to widthOut-1) of std_logic_vector(widthB-1 downto 0);

signal c : stdlogicarrayn;signal q : stdlogicarraym;signal bp : stdlogicarrayo;

beginc(0)(widthA+widthFrac-1 downto widthFrac) <= A;c(0)(widthFrac-1 downto 0) <= (others => '0');q(0) <= (others => '0');bp(0) <= B;

process (clk)begin

if (clk'event and clk = '1') thenrow: for k in 0 to widthOut-2 loop

if (c(k)(widthA+widthFrac-1 downto widthOut-1-k)-bp(k) >= 0) thenq(k+1) <= q(k)(widthOut-2 downto 0) & '1';c(k+1) <= (c(k)(widthA+widthFrac-1 downto widthOut-1-k)-bp(k))

(k+(widthA-widthOut)+widthFrac downto 0) &c(k)(widthOut-k-2 downto 0);

elseq(k+1) <= q(k)(widthOut-2 downto 0) & '0';c(k+1) <= c(k);

end if;bp(k+1) <= bp(k);

end loop row;

if (c(widthOut-1)-bp(widthOut-1) >= 0) thenQout <= q(widthOut-1)(widthOut-2 downto 0) & '1';

elseQout <= q(widthOut-1)(widthOut-2 downto 0) & '0';

end if;

end if;end process;

end rtl;

51

**** dotproduct.vhd ****---------------------------------------------- Pipelined Vector Dot Product Component ---- C = A . B ---- Performs a vector cross product in 2 ---- clock cycles. Synplify's pipeline ---- option should be enable to better ---- balance the pipeline cycles. ----------------------------------------------library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;

entity dotproduct isgeneric (

widthA : natural := 32;widthB : natural := 32);

port(Ax,Ay,Az : in std_logic_vector(widthA-1 downto 0);Bx,By,Bz : in std_logic_vector(widthB-1 downto 0);C : out std_logic_vector(widthA+widthB+1 downto 0);clk : in std_logic);

end;

architecture rtl of dotproduct issignal AxBx, AyBy, AzBz : std_logic_vector(widthA+widthB-1 downto 0);begin

process(clk)begin

if (rising_edge(clk)) thenAxBx <= Ax*Bx;AyBy <= Ay*By;AzBz <= Az*Bz;C <= (AxBx(widthA+widthB-1) & AxBx(widthA+widthB-1) & AxBx) +

(AyBy(widthA+widthB-1) & AyBy(widthA+widthB-1) & AyBy) +(AzBz(widthA+widthB-1) & AzBz(widthA+widthB-1) & AzBz);

end if;end process;

end rtl;

**** dpram.vhd ****--------------------------------------------------------- Dual Ported Ram Modual w/Registered Output ---- - Synpify should infer ram from the coding style ---- - The virtex distributed ram is 1bitx16 ---- - Uses approximately 2 LUTs per bit wide ---------------------------------------------------------

library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_signed.all;

entity dpram isgeneric(

width : natural := 16);port(

we : in std_logic;raddr, waddr : in std_logic_vector(3 downto 0);dataout : out std_logic_vector(width-1 downto 0);datain : in std_logic_vector(width-1 downto 0);clk : in std_logic);

end;

architecture rtl of dpram istype memarray is array(15 downto 0) of std_logic_vector(width-1 downto 0);signal mem : memarray;signal data : std_logic_vector(width-1 downto 0);

begindata <= mem(conv_integer(raddr));

process(clk,we,waddr)begin

52

if (rising_edge (clk)) thendataout <= data;if (we = '1') thenmem(conv_integer(waddr)) <= datain;

end if;end if;

end process;

end rtl;

**** exchange.vhd ****------------------------------------ Scalar Mux Component ---- C = A when ABn = '1' else B ------------------------------------

library ieee;use ieee.std_logic_1164.all;

entity exchange isgeneric (


A : in std_logic_vector(width-1 downto 0);B : in std_logic_vector(width-1 downto 0);C : out std_logic_vector(width-1 downto 0);ABn : in std_logic);

end;

architecture rtl of exchange isbegin

C <= A when (ABn = '1') else B;end rtl;

**** fifo3.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;

entity fifo3 isgeneric (

datawidth : natural := 18);port(

datain : in std_logic_vector(datawidth-1 downto 0);writeen : in std_logic;dataout : out std_logic_vector(datawidth-1 downto 0);shiften : in std_logic;globalreset : in std_logic;clk : in std_logic);

end;

architecture rtl of fifo3 istype stdlogicarray is array(0 to 2) of std_logic_vector(datawidth-1 downto 0);

signal data : stdlogicarray;signal pos : std_logic_vector(1 downto 0);

begindataout <= data(0);

process(clk,globalreset)begin

if (globalreset = '1') thenpos <= "00";data(0) <= (others => '0');data(1) <= (others => '0');data(2) <= (others => '0');

elsif rising_edge(clk) thenif writeen = '1' and shiften = '1' thencase (pos) iswhen "00" =>data(0) <= (others => '-');data(1) <= (others => '-');data(2) <= (others => '-');

53

when "01" =>data(0) <= datain;data(1) <= (others => '-');data(2) <= (others => '-');

when "10" =>data(0) <= data(1);data(1) <= datain;data(2) <= (others => '-');

when "11" =>data(0) <= data(1);data(1) <= data(2);data(2) <= datain;

end case;elsif shiften = '1' thendata(0) <= data(1);data(1) <= data(2);pos <= pos-1;

elsif writeen = '1' thencase (pos) iswhen "00" => data(0) <= datain;when "01" => data(1) <= datain;when "10" => data(2) <= datain;when others =>

end case;pos <= pos + 1;

end if;end if;

end process;

end rtl;

**** listbuffer.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;

library work;use work.complib.all;

entity listbuffer isgeneric(

width : natural := 48;subdepth : natural := 3;totaldepth : natural := 5);

port(peekdata : in std_logic_vector(width*(2**subdepth)-1 downto 0);commit : in std_logic;nextaddr : in std_logic;baseaddress : in std_logic_vector(totaldepth-subdepth-1 downto 0);dataout : out std_logic_vector(width-1 downto 0);globalreset : in std_logic;clk : in std_logic);

end;

architecture rtl of listbuffer istype state_type is (S_IDLE,S_WRITE);signal state : state_type;signal next_state : state_type;

signal we : std_logic;signal address : std_logic_vector(totaldepth-1 downto 0);signal datain : std_logic_vector(width-1 downto 0);

beginram : spram

generic map(width,totaldepth)port map(we,address,dataout,datain,clk);


if (globalreset = '1') thenstate <= S_IDLE;address <= (others => '0');

elsif (rising_edge(clk)) thenstate <= next_state;

54

case state iswhen S_IDLE =>if commit = '1' thenaddress(totaldepth-1 downto subdepth) <= baseaddress;address(subdepth-1 downto 0) <= (others => '0');

end if;if nextaddr = '1' thenaddress(subdepth-1 downto 0) <= address(subdepth-1 downto 0) + 1;

end if;when S_WRITE =>address(subdepth-1 downto 0) <= address(subdepth-1 downto 0) + 1;

when others =>end case;

end if;end process;

process (state,commit,address,peekdata)Begin

we <= '0';datain <= (others => '-');case state IS

when S_IDLE =>if commit = '1' thennext_state <= S_WRITE;


end if;when S_WRITE =>writelp : for k in 0 to (2**subdepth)-1 loopif k=address(subdepth-1 downto 0) thendatain <= peekdata((k+1)*width-1 downto k*width);

end if;end loop writelp;we <= '1';if address(subdepth-1 downto 0) = (2**subdepth)-1 thennext_state <= S_IDLE;

elsenext_state <= S_WRITE;

end if;end case;

end process;

end rtl;

**** listhandler.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_unsigned.all;


entity listhandler isport(

dataarrayin : in std_logic_vector(8*109-1 downto 0);commit : in std_logic;

hitmask : in std_logic_vector(2 downto 0);ack : in std_logic;max0,max1,max2 : out std_logic_vector(31 downto 0);boundnodeID : out std_logic_vector(9 downto 0);level : out std_logic_vector(1 downto 0);empty,dataready : buffer std_logic;

reset : in std_logic;globalreset : in std_logic;clk : in std_logic;

peekoffset0,peekoffset1,peekoffset2 : out std_logic_vector(2 downto 0);peekhit : out std_logic;peekstate : out std_logic_vector(1 downto 0) );

end;

architecture rtl of listhandler is

55

type state_type is (S_IDLE,S_WRITE,S_ALIGN);signal next_state, state : state_type;

signal readlevel, writelevel : std_logic_vector(1 downto 0);signal offset0, offset1, offset2 : std_logic_vector(2 downto 0);signal address : std_logic_vector(4 downto 0);signal we : std_logic;signal datain,dataout : std_logic_vector(109-1 downto 0);signal lvempty : std_logic_vector(2 downto 0);signal busy : std_logic;

begin-- Debug Stuffpeekoffset0 <= offset0;peekoffset1 <= offset1;peekoffset2 <= offset2;peekhit <= '1' when (datain(108) = '1' or datain(107) = '1' or datain(106) = '1') else

'0';

process (state)begin

case (state) iswhen S_IDLE => peekstate <= "01";when S_WRITE => peekstate <= "10";when S_ALIGN => peekstate <= "11";when others => peekstate <= "00";


-- Real Coderam : spram

generic map(109,5)port map(we, address,dataout,datain,clk);

level <= readlevel;max0 <= dataout(41 downto 10) when dataout(106) = '1' else (others => '0');max1 <= dataout(73 downto 42) when dataout(107) = '1' else (others => '0');max2 <= dataout(105 downto 74) when dataout(108) = '1' else (others => '0');boundnodeID <= dataout(9 downto 0);

empty <= '1' when (lvempty = "111" and busy = '0') else '0';dataready <= '1' when ((dataout(106) = '1' and hitmask(0) = '1') or

(dataout(107) = '1' and hitmask(1) = '1') or(dataout(108) = '1' and hitmask(2) = '1')) and(empty = '0') and (busy = '0') else '0';

address(4 downto 3) <= readlevel;

process (offset0,offset1,offset2,address)begin

if address(4 downto 3) = "00" thenaddress(2 downto 0) <= offset0;

elsif address(4 downto 3) = "01" thenaddress(2 downto 0) <= offset1;

elsif address(4 downto 3) = "10" thenaddress(2 downto 0) <= offset2;

elseaddress(2 downto 0) <= (others => '-');

end if;end process;

process (clk,globalreset)begin

if (globalreset = '1') thenstate <= S_IDLE;lvempty <= (others => '1');busy <= '0';readlevel <= "00";writelevel <= "00";offset0 <= "000";offset1 <= "000";offset2 <= "000";

elsif (rising_edge(clk)) thenstate <= next_state;case state iswhen S_IDLE =>if (reset = '1') thenbusy <= '0';

56

lvempty <= (others => '1');readlevel <= "00"; writelevel <= "00";offset0 <= "000"; offset1 <= "000"; offset2 <= "000";

elsif (commit = '1') thenbusy <= '1';if writelevel = "00" thenoffset0 <= "000";

elsif writelevel = "01" thenoffset1 <= "000";

elsif writelevel = "10" thenoffset2 <= "000";

end if;readlevel <= writelevel;

elsif (ack = '1') thenwritelevel <= readlevel+1;busy <= '1'; -- This will ensure that align skips one

end if;when S_WRITE =>if readlevel = "00" thenoffset0 <= offset0 + 1;

elsif readlevel = "01" thenoffset1 <= offset1 + 1;

elsif readlevel = "10" thenoffset2 <= offset2 + 1;

end if;if address(2 downto 0) = "111" thenbusy <= '0';

end if;if datain(108) = '1' or datain(107) = '1' or datain(106) = '1' thenif readlevel = "00" thenlvempty(0) <= '0';

elsif readlevel = "01" thenlvempty(1) <= '0';

elsif readlevel = "10" thenlvempty(2) <= '0';

end if;end if;

when S_ALIGN =>busy <= '0';if empty = '0' and dataready = '0' thenif readlevel = "00" thenif offset0 = "111" thenlvempty(0) <= '1';

elseoffset0 <= offset0 + 1;

end if;elsif readlevel = "01" thenif offset1 = "111" thenlvempty(1) <= '1';readlevel <= "00";

elseoffset1 <= offset1 + 1;

end if;elsif readlevel = "10" thenif offset2 = "111" thenlvempty(2) <= '1';if lvempty(1) = '1' thenreadlevel <= "00";

elsereadlevel <= "01";

end if;elseoffset2 <= offset2 + 1;

end if;end if;

end if;end case;

end if;end process;

process (state,commit,ack,address,dataarrayin,reset,dataready,empty)begin

we <= '0';datain <= (others => '-');case state is

when S_IDLE =>if reset = '1' then

57

next_state <= S_IDLE;elsif commit = '1' thennext_state <= S_WRITE;

elsif (ack = '1') or (dataready = '0' and empty = '0') thennext_state <= S_ALIGN;


end if;when S_WRITE =>writelp : for k in 0 to 7 loopif k=address(2 downto 0) thendatain <= dataarrayin((k+1)*109-1 downto k*109);

end if;end loop writelp;we <= '1';if address(2 downto 0) = "111" thennext_state <= S_ALIGN;

elsenext_state <= S_WRITE;

end if;when S_ALIGN =>if empty = '0' and dataready = '0' thennext_state <= S_ALIGN;


end if;end case;

end process;

end rtl;

**** memoryinterface.vhd ****-------------------------------------------- Triangle Memory Controller Component ---- ---- There are 2 nibble bus signals that ---- allow the component to download ---- memory contents from the sun. First ---- the started address is written, ---- then data is written in 64bit ---- chunks. The address is auto inc'd ---- ---- The dataout and datavalid signals ---- contain the triangle data ---- ---- wanttriID is high to request a new ---- triangle ID for 2nd cycle. A high ---- triIDvalid signal indicates it ---- that the user has applied that ---- signal to the triID port. ---- ---- cyclenum is a control signal that ---- counts from 0-2. This signal ---- determines the ray to be sent to ---- the ray tri unit as well as which ---- nearest compare unit to use --------------------------------------------

library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;

entity memoryinterface isport(

want_addr : out std_logic;addr_ready : in std_logic;addrin : in std_logic_vector(17 downto 0);want_data : out std_logic;data_ready : in std_logic;datain : in std_logic_vector(63 downto 0);

dataout : out std_logic_vector(191 downto 0);triIDout : out std_logic_vector(15 downto 0);datavalid : out std_logic;

58

triIDvalid : in std_logic;triID : in std_logic_vector(15 downto 0);wanttriID : out std_logic;cyclenum : out std_logic_vector(1 downto 0);

tm3_sram_data : inout std_logic_vector(63 downto 0);tm3_sram_addr : out std_logic_vector(18 downto 0);tm3_sram_we : out std_logic_vector(7 downto 0);tm3_sram_oe : out std_logic_vector(1 downto 0);tm3_sram_adsp : out std_logic;globalreset : in std_logic;clk : in std_logic);

end;

architecture rtl of memoryinterface istype state_type is (S_READ1,S_READ2,S_READ3,S_WRITE1,S_WRITE2,S_WRITE3,S_WRITEDONE);signal state : state_type;signal next_state : state_type;

signal address,oldaddress : std_logic_vector(15 downto 0);signal waddress : std_logic_Vector(17 downto 0);signal databuff : std_logic_vector(127 downto 0);signal addrvalid, oldaddrvalid : std_logic;

begin


if (globalreset = '1') thenstate <= S_READ1;addrvalid <= '0';oldaddrvalid <= '0';address <= (others => '0');waddress <= (others => '0');databuff <= (others => '0');dataout <= (others => '0');triIDout <= (others => '0');oldaddress <= (others => '0');datavalid <= '0';wanttriID <= '0';

elsif (rising_edge (clk)) thenstate <= next_state;wanttriID <= '0';case (state) iswhen S_READ1 =>if (addr_ready = '1') thenwaddress <= addrin;

end if;databuff(63 downto 0) <= tm3_sram_data;

when S_READ2 =>databuff(127 downto 64) <= tm3_sram_data;oldaddrvalid <= addrvalid;oldaddress <= address;if (triIDvalid = '1') thenaddrvalid <= '1';address <= triID;

elseaddrvalid <= '0';

end if;wanttriID <= '1';

when S_READ3 =>dataout <= tm3_sram_data & databuff;datavalid <= oldaddrvalid;triIDout <= oldaddress;

when S_WRITE2 =>if (data_ready = '1') thenwaddress <= waddress+1;

end if;when S_WRITEDONE =>addrvalid <= '0';


end if;end process;

process (state,address,addr_ready,data_ready,waddress,datain)begin

tm3_sram_we <= "11111111";

59

tm3_sram_oe <= "11";tm3_sram_adsp <= '1';tm3_sram_data <= (others => 'Z');tm3_sram_addr <= (others => '-');cyclenum <= (others => '-');want_addr <= '1';want_data <= '0';case (state) is

when S_READ1 =>tm3_sram_addr <= '0' & address & "01";tm3_sram_adsp <= '0';tm3_sram_oe <= "01";cyclenum <= "00";if (addr_ready = '1') thennext_state <= S_WRITE1;

elsenext_state <= S_READ2;

end if;when S_READ2 =>tm3_sram_addr <= '0' & address & "10";tm3_sram_adsp <= '0';tm3_sram_oe <= "01";cyclenum <= "01";next_state <= S_READ3;

when S_READ3 =>tm3_sram_addr <= '0' & address & "00";tm3_sram_adsp <= '0';tm3_sram_oe <= "01";cyclenum <= "10";next_state <= S_READ1;

when S_WRITE1 =>want_addr <= '0';want_data <= '1';if (addr_ready = '1') thennext_state <= S_WRITE1;

elsenext_state <= S_WRITE2;

end if;when S_WRITE2 =>want_data <= '1';tm3_sram_addr <= '0' & waddress;tm3_sram_data <= datain;if (addr_ready = '1') thennext_state <= S_WRITEDONE;

elsif (data_ready = '1') thentm3_sram_we <= "00000000";tm3_sram_adsp <= '0';next_state <= S_WRITE3;


end if;when S_WRITE3 =>if (data_ready = '1') thennext_state <= S_WRITE3;


end if;when S_WRITEDONE =>want_addr <= '0';if (addr_ready = '1') thennext_state <= S_WRITEDONE;


end if;end case;

end process;

end rtl;

**** nearcmp.vhd ****---------------------------------------------- Nearest Triangle Hit Compare Component ---- ---- This unit keeps track of the closest ---- triangle that has currently been hit ---- This unit also tracks the furtherest --

60

-- hit distance, but not the triID ---- ---- tin,uin,vin,triIDin,hit are inputs ---- t,u,v,triID,anyhit are outputs ---- enable must be high for compare ---- reset will allow a new hit to be ---- found during the reset cycle ----------------------------------------------library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_unsigned.all;

entity nearcmp isport(

tin : in std_logic_vector(31 downto 0);uin,vin : in std_logic_vector(15 downto 0);triIDin : in std_logic_vector(15 downto 0);hit : in std_logic;

t : buffer std_logic_vector(31 downto 0);tfar : buffer std_logic_vector(31 downto 0);u,v : out std_logic_vector(15 downto 0);triID : out std_logic_vector(15 downto 0);anyhit : out std_logic;

maxdist : in std_logic_vector(31 downto 0);enable : in std_logic;reset : in std_logic;

globalreset : in std_logic;clk : in std_logic);

end;

architecture rtl of nearcmp istype nc_state_type is (S_RESET,S_EXISTS);

signal state,next_state : nc_state_type;signal latchnear, latchfar : std_logic;

beginanyhit <= '1' when (state = S_EXISTS) else '0';


if (globalreset = '1') thenstate <= S_RESET;t <= (others => '0');tfar <= (others => '1');u <= (others => '0');v <= (others => '0');triID <= (others => '0');

elsif (rising_edge(clk)) thenstate <= next_state;if latchfar = '1' thentfar <= tin;

end if;if latchnear = '1' thent <= tin;u <= uin;v <= vin;triID <= triIDin;

end if;end if;

end process;

process (state,tin,t,enable,hit,reset,maxdist, tfar)begin

latchnear <= '0';latchfar <= '0';case state IS

when S_RESET =>if (enable = '1') and (hit = '1') and (tin < maxdist) thennext_state <= S_EXISTS;latchnear <= '1';latchfar <= '1';

elsenext_state <= S_RESET;

end if;

61

when S_EXISTS =>if (reset = '1') thenif (enable = '1') and (hit = '1') and (tin < maxdist) thenlatchfar <= '1';latchnear <= '1';next_state <= S_EXISTS;


end if;elseif (enable = '1') and (hit = '1') and (tin < maxdist) thenif (tin >= tfar) thenlatchfar <= '1';

end if;if (tin < t) thenlatchnear <= '1';

end if;end if;next_state <= S_EXISTS;

end if;end case;

end process;

end rtl;

**** nearcmpspec.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_unsigned.all;

entity nearcmpspec isport(

tin : in std_logic_vector(31 downto 0);uin,vin : in std_logic_vector(15 downto 0);triIDin : in std_logic_vector(15 downto 0);hit : in std_logic;

t : buffer std_logic_vector(31 downto 0);tfar : buffer std_logic_vector(31 downto 0);u,v : out std_logic_vector(15 downto 0);triID : out std_logic_vector(15 downto 0);anyhit : out std_logic;

maxdist : in std_logic_vector(31 downto 0);enable : in std_logic;enablenear : in std_logic;reset : in std_logic;


end;

architecture rtl of nearcmpspec istype nc_state_type is (S_RESET,S_NOHIT,S_EXISTS);

signal state,next_state : nc_state_type;signal latchnear, latchfar : std_logic;

beginanyhit <= '1' when (state = S_EXISTS) else '0';


if (globalreset = '1') thenstate <= S_RESET;t <= (others => '0');tfar <= (others => '1');u <= (others => '0');v <= (others => '0');triID <= (others => '0');

elsif (rising_edge(clk)) thenstate <= next_state;if latchfar = '1' thentfar <= tin;

end if;if latchnear = '1' thent <= tin;

62

u <= uin;v <= vin;triID <= triIDin;

end if;end if;

end process;

process (state,tin,t,enable,hit,reset,maxdist, tfar)begin

latchnear <= '0';latchfar <= '0';case state IS

when S_RESET =>if (enable = '1') and (hit = '1') and (tin < maxdist) thennext_state <= S_EXISTS;latchnear <= '1';latchfar <= '1';

elsif (enablenear = '1') and (hit = '1') and (tin < maxdist) thenlatchnear <= '1';next_state <= S_NOHIT;


end if;when S_NOHIT =>if (reset = '1') thenif (enable = '1') and (hit = '1') and (tin < maxdist) thenlatchfar <= '1';latchnear <= '1';next_state <= S_EXISTS;

elsif (enablenear = '1') and (hit = '1') and (tin < maxdist) thenlatchfar <= '1';latchnear <= '1';next_state <= S_NOHIT;


end if;elsif (enable = '1') and (hit = '1') and (tin < maxdist) thenlatchfar <= '1';if (tin < t) thenlatchnear <= '1';

end if;next_state <= S_EXISTS;

elsenext_state <= S_NOHIT;

end if;when S_EXISTS =>if (reset = '1') thenif (enable = '1') and (hit = '1') and (tin < maxdist) thenlatchfar <= '1';latchnear <= '1';next_state <= S_EXISTS;

elsif (enablenear = '1') and (hit = '1') and (tin < maxdist) thenlatchfar <= '1';latchnear <= '1';next_state <= S_NOHIT;


end if;elseif (enable = '1') and (hit = '1') and (tin < maxdist) thenif (tin >= tfar) thenlatchfar <= '1';

end if;if (tin < t) thenlatchnear <= '1';

end if;elsif (enablenear = '1') and (hit = '1') and (tin < maxdist) thenif (tin <= t) thenlatchnear <= '1';

end if;end if;next_state <= S_EXISTS;

end if;end case;

end process;

end rtl;

63

**** onlyonecycle.vhd ****-- A debugging circuit that allows a single cycle pulse to be-- generated by through the ports package

library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;

entity onlyonecycle isgeneric(

pulselength : natural := 1);port(

trigger : in std_logic;output : out std_logic;globalreset : in std_logic;clk : in std_logic);

end;

architecture rtl of onlyonecycle istype state_type is (S_IDLE,S_TRIGGERED,S_WAIT);

signal state : state_type;signal next_state : state_type;signal count : integer range 0 to pulselength-1;

beginProcess(clk,globalreset)begin

if (globalreset = '1') thenstate <= S_IDLE;count <= 0;

elsif (rising_edge(clk)) thenstate <= next_state;case state iswhen S_IDLE =>count <= pulselength-1;

when S_TRIGGERED =>count <= count-1;


end if;end process;

process (state, trigger,count)Begin

output <= '0';case state IS

when S_IDLE =>if trigger = '1' thennext_state <= S_TRIGGERED;


end if;when S_TRIGGERED =>output <= '1';if count = 0 thennext_state <= S_WAIT;

elsenext_state <= S_TRIGGERED;

end if;when S_WAIT =>if trigger = '0' thennext_state <= S_IDLE;

elsenext_state <= S_WAIT;

end if;end case;

end process;

end rtl;

**** raybuffer.vhd ****---------------------------------------------------- Ray Buffer, Output Selection & Bus Interface ---- ---- Writes are enabled through the bus --

64

-- WE Function ---- 000 Idle ---- 001 origx <= raydata 27..0 ---- 010 origy <= raydata 27..0 ---- 011 origz <= raydata 27..0 ---- 100 dirx <= raydata 15..0 ---- diry <= raydata 31..16 ---- 101 dirz <= raydata 15..0 ---- swap <= raydata 16 ---- 110 maxbuff[rayaddr] <= raydata 31..0 ---- 111 activeraygroup <= rayaddr 1..0 ---- enablenear <= raydata 0 ---- ---- subraynum is not latched ---- The output ray data is latched ----------------------------------------------------



entity raybuffer isport(

origx, origy, origz : out std_logic_vector(27 downto 0);dirx, diry, dirz : out std_logic_vector(15 downto 0);maxdist : out std_logic_vector(31 downto 0);raygroupID : out std_logic_vector(1 downto 0);swap : out std_logic;resetout : out std_logic;enablenear : out std_logic;

raydata : in std_logic_vector(31 downto 0);rayaddr : in std_logic_vector(3 downto 0);raywe : in std_logic_vector(2 downto 0); -- May need to be expanded

subraynum : in std_logic_vector(1 downto 0);clk : in std_logic);

end;

architecture rtl of raybuffer issignal origxwe, origywe, origzwe : std_logic;signal dirxwe, dirywe, dirzwe : std_logic;signal swapwe,raygroupwe : std_logic;signal maxwe : std_logic;

signal raddr : std_logic_vector(3 downto 0);signal activeraygroup : std_logic_vector(1 downto 0);signal swapvect : std_logic_vector(0 downto 0);signal resetl : std_logic;signal maxdist0,maxdist1,maxdist2 : std_logic_vector(31 downto 0);signal raygroupIDl : std_logic_vector(1 downto 0);signal maxbuf0,maxbuf1,maxbuf2 : std_logic_vector(31 downto 0);signal enablenearl : std_logic;

begin-- Ray output address logicraddr <= activeraygroup & subraynum;process (clk)begin

if (rising_edge (clk)) thenresetl <= raygroupwe;resetout <= resetl;raygroupID <= raygroupIDl;enablenear <= enablenearl;if subraynum = "00" thenmaxdist <= maxdist0;

elsif subraynum = "01" thenmaxdist <= maxdist1;

elsif subraynum = "10" thenmaxdist <= maxdist2;

end if;

if (raygroupwe = '1') thenactiveraygroup <= rayaddr(1 downto 0);

65

maxdist0 <= maxbuf0;maxdist1 <= maxbuf1;maxdist2 <= maxbuf2;enablenearl <= raydata(0);raygroupIDl <= rayaddr(3 downto 2);

end if;if (maxwe = '1') thenif rayaddr(1 downto 0) = "00" thenmaxbuf0 <= raydata;

elsif rayaddr(1 downto 0) = "01" thenmaxbuf1 <= raydata;

elsif rayaddr(1 downto 0) = "10" thenmaxbuf2 <= raydata;

end if;end if;

end if;end process;

-- Decode the write enable signalsorigxwe <= '1' when (raywe = "001") else '0';origywe <= '1' when (raywe = "010") else '0';origzwe <= '1' when (raywe = "011") else '0';dirxwe <= '1' when (raywe = "100") else '0';dirywe <= '1' when (raywe = "100") else '0';dirzwe <= '1' when (raywe = "101") else '0';swapwe <= '1' when (raywe = "101") else '0';maxwe <= '1' when (raywe = "110") else '0';raygroupwe <= '1' when (raywe = "111") else '0';

-- Instantate all the required ram elementsorigxram : dpram

generic map (28)port map (origxwe, raddr, rayaddr, origx, raydata(27 downto 0), clk);

origyram : dpramgeneric map (28)port map (origywe, raddr, rayaddr, origy, raydata(27 downto 0), clk);

origzram : dpramgeneric map (28)port map (origzwe, raddr, rayaddr, origz, raydata(27 downto 0), clk);

dirxram : dpramgeneric map (16)port map (dirxwe, raddr, rayaddr, dirx, raydata(15 downto 0), clk);

diryram : dpramgeneric map (16)port map (dirywe, raddr, rayaddr, diry, raydata(31 downto 16), clk);

dirzram : dpramgeneric map (16)port map (dirzwe, raddr, rayaddr, dirz, raydata(15 downto 0), clk);

swapram : dpramgeneric map (1)port map (swapwe, raddr, rayaddr, swapvect, raydata(16 downto 16), clk);

swap <= swapvect(0);end rtl;

**** raygencont.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_unsigned.all;

entity raygencont isgeneric(

id : std_logic);port(

go : in std_logic;initcount : in std_logic_vector(14 downto 0);busyout : out std_logic;cycles : buffer std_logic_vector(30 downto 0);nextaddr : out std_logic_vector(17 downto 0);nas : out std_logic;

-- Memory Controller InterfacedirReady : in std_logic;wantDir : out std_logic;dirIn : in std_logic_vector(47 downto 0);

66

addrIn : in std_logic_vector(15 downto 0);

-- RayInterface Interfaceas : out std_logic;addr : buffer std_logic_vector(3 downto 0);ack : in std_logic;dir : out std_logic_vector(47 downto 0);

-- Bound Controller Interfaceraygroup : buffer std_logic_vector(1 downto 0);raygroupvalid : out std_logic;busy : in std_logic;

globalreset : in std_logic;clk : in std_logic;statepeek : out std_logic_vector(2 downto 0));

end;

architecture rtl of raygencont istype state_type is (S_IDLE,S_SENDSET,S_WAITSENT,S_ENABLEBOUND);signal state : state_type;signal next_state : state_type;signal groupID : std_logic;signal count : std_logic_vector(14 downto 0);signal first : std_logic;signal destaddr : std_logic_vector(17 downto 0);

beginprocess(state)begin

case (state) iswhen S_IDLE => statepeek <= "001";when S_SENDSET => statepeek <= "010";when S_WAITSENT => statepeek <= "011";when S_ENABLEBOUND => statepeek <= "100";when others => statepeek <= "000";



if (globalreset = '1') thenstate <= S_IDLE;cycles <= (others => '0');dir <= (others => '0');addr(1 downto 0) <= "00";groupID <= '0';count <= (others => '0');first <= '0';destAddr <= (others => '0');raygroupvalid <= '0';

elsif (rising_edge (clk)) thenstate <= next_state;if (state /= S_IDLE) thencycles <= cycles + 1;

end if;case (state) iswhen S_IDLE =>if go = '1' thencycles <= (others => '0');

end if;addr(1 downto 0) <= "00";groupID <= '0';count <= initcount;

when S_SENDSET =>dir <= dirIn;

when S_WAITSENT =>if (ack = '1') and (addr(1 downto 0) /= "10") thenaddr(1 downto 0) <= addr(1 downto 0) + "01";

end if;if (ack = '1') and addr(1 downto 0) = "10" and busy = '0' thenraygroupvalid <= '1';

end if;when S_ENABLEBOUND =>if busy = '1' thengroupID <= not groupID;raygroupvalid <= '0';count <= count - 1;

67

end if;addr(1 downto 0) <= "00";


end if;end process;

addr(3 downto 2) <= raygroup;busyout <= '0' when state = S_IDLE else '1';raygroup <= id & groupID;nextaddr <= "11" & addrIn;nas <= '1' when (state = S_SENDSET and addr(1 downto 0) = "00" and dirReady = '1') else

'0';

process (state,go,ack,busy,dirReady,addr,count)begin

as <= '0';wantDir <= '0';case (state) is

when S_IDLE =>if (go = '1') thennext_state <= S_SENDSET;


end if;when S_SENDSET =>as <= dirReady;wantdir <= '1';if dirReady = '1' thennext_state <= S_WAITSENT;

elsenext_State <= S_SENDSET;

end if;when S_WAITSENT =>wantdir <= '0';as <= '1';if (ack = '1') and (addr(1 downto 0) /= "10") thennext_state <= S_SENDSET;

elsif (ack = '1') and (busy = '0') thennext_state <= S_ENABLEBOUND;

elsenext_state <= S_WAITSENT;

end if;when S_ENABLEBOUND =>if busy = '0' thennext_state <= S_ENABLEBOUND;

elsif count > 0 thennext_state <= S_SENDSET;


end if;end case;

end process;

end rtl;

**** raygentop.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;


entity raygentop isport(

-- Ports Package Signalsrgwant_addr : out std_logic;rgwant_data : out std_logic;rgread_ready : out std_logic;rgaddr_ready : in std_logic;rgdata_ready : in std_logic;rgwant_read : in std_logic;rgdatain : in std_logic_vector(63 downto 0);rgdataout : out std_logic_vector(63 downto 0);

68

rgaddrin : in std_logic_vector(17 downto 0);origx : in std_logic_vector(27 downto 0);origy : in std_logic_vector(27 downto 0);origz : in std_logic_vector(27 downto 0);rgcont : in std_logic_vector(31 downto 0);rgstat : out std_logic_vector(31 downto 0);-- Memory Signalstm3_sram_data : inout std_logic_vector(63 downto 0);tm3_sram_addr : out std_logic_vector(18 downto 0);tm3_sram_we : out std_logic_vector(7 downto 0);tm3_sram_oe : out std_logic_vector(1 downto 0);tm3_sram_adsp : out std_logic;tm3_clk_v0 : in std_logic;

-- Interchip signalsraygroup01 : out std_logic_vector(1 downto 0);raygroupvalid01 : out std_logic;busy01 : in std_logic;raygroup10 : out std_logic_vector(1 downto 0);raygroupvalid10 : out std_logic;busy10 : in std_logic;

globalreset : in std_logic;

rgData : out std_logic_vector(31 downto 0);rgAddr : out std_logic_vector(3 downto 0);rgWE : out std_logic_vector(2 downto 0);rgAddrValid : out std_logic;rgDone : in std_logic;

rgResultData : in std_logic_vector(31 downto 0);rgResultReady : in std_logic;rgResultSource : in std_logic_vector(1 downto 0));

end;

architecture rtl of raygentop is

signal statepeek,statepeek2 : std_logic_vector(2 downto 0);signal as01,as10,ack01,ack10 : std_logic;signal addr01, addr10 : std_logic_vector( 3 downto 0);signal dir01,dir10,dir : std_logic_vector(47 downto 0);signal dirReady01, dirReady10, wantDir01, wantDir10 : std_logic;signal address : std_logic_vector(15 downto 0);signal cyclecounter : std_logic_vector(30 downto 0);signal nas01,nas10 : std_logic;signal go : std_logic;signal statepeekct : std_logic_vector(2 downto 0);-- result Signalssignal valid01,valid10 : std_logic;signal id01a,id01b,id01c : std_logic_vector(15 downto 0);signal id10a,id10b,id10c : std_logic_vector(15 downto 0);signal hit01a,hit01b,hit01c : std_logic;signal hit10a,hit10b,hit10c : std_logic;signal wantwriteback, writebackack : std_logic;signal writebackdata : std_logic_vector(63 downto 0);signal writebackaddr : std_logic_vector(17 downto 0);signal nextaddr01,nextaddr10 : std_logic_vector(17 downto 0);begin

onlyeonecycleinst : onlyonecycleport map(rgCont(0),go,globalreset,tm3_clk_v0);

sramcont : RGsramcontrollerport map(rgwant_addr,rgaddr_ready,rgaddrin,rgwant_data,rgdata_ready,rgdatain,

rgwant_read,rgread_ready,rgdataout,dirReady01,dirReady10,wantDir01,wantDir10,dir,address,wantwriteback,writebackack,writebackdata,writebackaddr,tm3_sram_data,tm3_sram_addr,tm3_sram_we,tm3_sram_oe,tm3_sram_adsp,globalreset, tm3_clk_v0,statepeek);

raysendinst : raysendport map(as01,as10,ack01,ack10,addr01,addr10,dir01,dir10,origx,origy,origz,

rgData,rgAddr, rgWE,rgAddrValid, rgDone, globalreset,tm3_clk_v0, statepeek2);

raygencontinst : raygencontgeneric map('1')port map(go, rgCont(15 downto 1),rgStat(31), cyclecounter, nextaddr01, nas01,

dirReady01, wantDir01, dir, address, as01,addr01,ack01,dir01,

69

raygroup01,raygroupvalid01,busy01, globalreset,tm3_clk_v0,statepeekct);

resultrecieveinst : resultrecieveport map(valid01,valid10,id01a,id01b,id01c,id10a,id10b,id10c,

hit01a,hit01b,hit01c,hit10a,hit10b,hit10c,rgResultData,rgResultReady,rgResultSource, globalreset,tm3_clk_v0);

resultwriteinst : resultwriterport map(valid01,valid10,id01a,id01b,id01c,id10a,id10b,id10c,

hit01a,hit01b,hit01c,hit10a,hit10b,hit10c,nextaddr01,nextaddr10,nas01,nas10,writebackdata,writebackaddr,wantwriteback,writebackack,globalreset,tm3_clk_v0);

rgStat(30 downto 0) <= cyclecounter;

as10 <= '0';nas10 <= '0';raygroupvalid10 <= '0';wantdir10 <= '0';

end rtl;

**** rayinterface.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;

entity rayinterface isport(

max : in std_logic_vector(31 downto 0);maxwe : in std_logic;raygroup : in std_logic_vector(1 downto 0);raygroupwe : in std_logic;raygroupid : in std_logic_vector(1 downto 0);enablenear : in std_logic;

-- Interchip Bus Signals (Ray Generation Chip)rgData : in std_logic_vector(31 downto 0);rgAddr : in std_logic_vector(3 downto 0);rgWE : in std_logic_vector(2 downto 0);rgAddrValid : in std_logic;rgDone : buffer std_logic;

-- Interchip Bus Signals (Ray Tri Chip)raydata : out std_logic_vector(31 downto 0);rayaddr : out std_logic_vector(3 downto 0);raywe : out std_logic_vector(2 downto 0);


end;

architecture rtl of rayinterface isbegin


if (globalreset = '1') thenraydata <= (others => '0');rayaddr <= (others => '0');raywe <= (others => '0');rgDone <= '0';

elsif (rising_edge(clk)) thenraywe <= (others => '0');if rgAddrValid = '0' thenrgDone <= '0';

end if;if raygroupwe = '1' thenraydata(0) <= enablenear;raydata(31 downto 1) <= (others => '0');raywe <= "111";rayaddr <= raygroupid & raygroup;

elsif maxwe = '1' thenraydata <= max;raywe <= "110";

70

rayaddr <= "00" & raygroupid;elsif rgAddrValid = '1' and rgDone = '0' thenraydata <= rgData;raywe <= rgWe;rayaddr <= rgAddr;rgDone <= '1';

end if;end if;

end process;

end rtl;

**** raysend.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;

entity raysend isport(

as01,as10 : in std_logic;ack01,ack10 : buffer std_logic;addr01, addr10 : in std_logic_vector(3 downto 0);dir01, dir10 : in std_logic_vector(47 downto 0);origx,origy,origz : in std_logic_vector(27 downto 0);

rgData : out std_logic_vector(31 downto 0);rgAddr : out std_logic_vector(3 downto 0);rgWE : out std_logic_vector(2 downto 0);rgAddrValid : out std_logic;rgDone : in std_logic;

globalreset : in std_logic;clk : in std_logic;statepeek : out std_logic_vector(2 downto 0));

end;

architecture rtl of raysend istype state_type is (S_IDLE,S_ORIGX,S_ORIGY,S_ORIGZ,S_DIRXY,S_DIRZ,

S_ORIGXWAIT,S_ORIGYWAIT,S_ORIGZWAIT,S_DIRXYWAIT);signal state : state_type;signal next_state : state_type;

signal unitselect : std_logic;signal dir : std_logic_vector(47 downto 0);

beginprocess(state)begin

case state iswhen S_IDLE => statepeek <= "001";when S_ORIGX => statepeek <= "010";when S_ORIGY => statepeek <= "011";when S_ORIGZ => statepeek <= "100";when S_DIRXY => statepeek <= "101";when S_DIRZ => statepeek <= "110";when others => statepeek <= "000";


dir <= dir01 when unitselect = '1' else dir10;


if (globalreset = '1') thenstate <= S_IDLE;ack01 <= '0';ack10 <= '0';unitselect <= '1';rgWe <= "000";rgData <= (others => '0');rgAddrValid <= '0';rgAddr <= (others => '0');

elsif (rising_edge (clk)) thenstate <= next_state;

71

case (state) iswhen S_IDLE =>if ((as01 = '1') and (ack01 = '0')) or

((as10 = '1') and (ack10 = '0')) thenrgData <= "0000" & origx;rgWe <= "001";rgAddrValid <= '1';

end if;if (as01 = '1') and (ack01 = '0') thenrgAddr <= addr01;unitselect <= '1';

elsergAddr <= addr10;unitselect <= '0';

end if;if (as01 = '0' and ack01 = '1') thenack01 <= '0';

end if;if (as10 = '0' and ack10 = '1') thenack10 <= '0';

end if;when S_ORIGX =>if rgDONE = '1' thenrgAddrValid <= '0';

end if;when S_ORIGXWAIT =>rgData <= "0000" & origy;rgWe <= "010";rgAddrValid <= '1';

when S_ORIGY =>if rgDONE = '1' thenrgAddrValid <= '0';

end if;when S_ORIGYWAIT =>rgData <= "0000" & origz;rgWe <= "011";rgAddrValid <= '1';

when S_ORIGZ =>if rgDONE = '1' thenrgAddrValid <= '0';

end if;when S_ORIGZWAIT =>rgData <= dir(31 downto 16) & dir(47 downto 32);rgWe <= "100";rgAddrValid <= '1';

when S_DIRXY =>if rgDONE = '1' thenrgAddrValid <= '0';

end if;when S_DIRXYWAIT =>rgData <= "0000000000000000" & dir(15 downto 0);rgWe <= "101";rgAddrValid <= '1';

when S_DIRZ =>if unitselect = '1' thenack01 <= '1';

elseack10 <= '1';

end if;if rgDONE = '1' thenrgAddrValid <= '0';

end if;when others =>

end case;end if;

end process;

process (state,origx,origy,origz,dir,ack01,ack10,as10,as01,rgdone)begin

case (state) iswhen S_IDLE =>if ((as01 = '1') and (ack01 = '0')) or

((as10 = '1') and (ack10 = '0')) thennext_state <= S_ORIGX;


end if;

72

when S_ORIGX =>if rgDone = '1' thennext_state <= S_ORIGXWAIT;

elsenext_state <= S_ORIGX;

end if;when S_ORIGXWAIT =>next_state <= S_ORIGY;

when S_ORIGY =>if rgDone = '1' thennext_state <= S_ORIGYWAIT;

elsenext_state <= S_ORIGY;

end if;when S_ORIGYWAIT =>next_state <= S_ORIGZ;

when S_ORIGZ =>if rgDone = '1' thennext_state <= S_ORIGZWAIT;

elsenext_state <= S_ORIGZ;

end if;when S_ORIGZWAIT =>next_state <= S_DIRXY;

when S_DIRXY =>if rgDone = '1' thennext_state <= S_DIRXYWAIT;

elsenext_state <= S_DIRXY;

end if;when S_DIRXYWAIT =>next_state <= S_DIRZ;

when S_DIRZ =>if rgDone = '1' thennext_state <= S_IDLE;

elsenext_state <= S_DIRZ;

end if;end case;

end process;

end rtl;

**** raytri.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;


entity raytri isport(

clk : in std_logic;

tout : out std_logic_vector(31 downto 0);uout : out std_logic_vector(15 downto 0);vout : out std_logic_vector(15 downto 0);triIDout : out std_logic_vector(15 downto 0);hitout : out std_logic;

vert0x,vert0y,vert0z : in std_logic_vector(27 downto 0);origx,origy,origz : in std_logic_vector(27 downto 0);dirx,diry,dirz : in std_logic_vector(15 downto 0);edge1x,edge1y, edge1z : in std_logic_vector(15 downto 0);edge1size : in std_logic_vector(1 downto 0);edge2x,edge2y, edge2z : in std_logic_vector(15 downto 0);edge2size : in std_logic_vector(1 downto 0);config : in std_logic_vector(0 downto 0);exchangeEdges : in std_logic;triID : in std_logic_vector(15 downto 0);

debugdetneg : out std_logic;debugsuneg : out std_logic;debugvneg : out std_logic;debugsugtdet : out std_logic;

73

debugvgtdet : out std_logic;debugtneg : out std_logic;debughitinter : out std_logic;debughit : out std_logic

);end;

architecture rtl of raytri is

-- Latch Connected Signalssignal tvecxl,tvecyl,tveczl : std_logic_vector(28 downto 0);signal edge1xr,edge1yr,edge1zr : std_logic_vector(15 downto 0);signal edge1xla,edge1yla,edge1zla : std_logic_vector(15 downto 0);signal edge1xlb,edge1ylb,edge1zlb : std_logic_vector(15 downto 0);signal edge2xr,edge2yr,edge2zr : std_logic_vector(15 downto 0);signal edge2xla,edge2yla,edge2zla : std_logic_vector(15 downto 0);signal edge2xlb,edge2ylb,edge2zlb : std_logic_vector(15 downto 0);signal dirxla,diryla,dirzla : std_logic_vector(15 downto 0);signal dirxlb,dirylb,dirzlb : std_logic_vector(15 downto 0);signal detl : std_logic_vector(50 downto 0);signal hitl : std_logic_vector(0 downto 0);signal configl : std_logic_vector(0 downto 0);signal edge1sizer, edge2sizer : std_logic_vector(1 downto 0);signal edge1sizel, edge2sizel : std_logic_vector(1 downto 0);

-- Intermediate Signalssignal pvecx,pvecy,pvecz : std_logic_vector(32 downto 0);signal det : std_logic_vector(50 downto 0);signal tvecx,tvecy,tvecz : std_logic_vector(28 downto 0);signal qvecx,qvecy,qvecz : std_logic_vector(45 downto 0);signal u,su : std_logic_vector(63 downto 0);signal v,usv : std_logic_vector(63 downto 0);signal t : std_logic_vector(63 downto 0);signal uv : std_logic_vector(64 downto 0);signal hitinter : std_logic;

-- Output Signalssignal hit : std_logic_vector(0 downto 0);signal ru : std_logic_vector(15 downto 0);signal rv : std_logic_vector(15 downto 0);

begin-- Level 1 Mathpvec : crossproduct

generic map (16,16)port map (dirxla,diryla,dirzla,edge2xla,edge2yla,edge2zla,pvecx,pvecy,pvecz,clk);

tvec : vectsubgeneric map (28)port map (origx,origy,origz,vert0x,vert0y,vert0z,tvecx,tvecy,tvecz,clk);

tvecdelay : vectdelaygeneric map (29,2)port map (tvecx,tvecy,tvecz,tvecxl,tvecyl,tveczl,clk);

edge1exchange : vectexchangegeneric map (16)port map (edge2x, edge2y, edge2z, edge1x, edge1y, edge1z,

edge1xr,edge1yr,edge1zr,exchangeEdges);

edge2exchange : vectexchangegeneric map (16)port map (edge1x, edge1y, edge1z, edge2x, edge2y, edge2z,

edge2xr,edge2yr,edge2zr,exchangeEdges);

-- changed to delay 1edge1adelay : vectdelay

generic map (16,1)port map (edge1xr,edge1yr,edge1zr,edge1xla,edge1yla,edge1zla,clk);

-- changed to delay 2edge1bdelay : vectdelay

generic map (16,2)port map (edge1xla,edge1yla,edge1zla,edge1xlb,edge1ylb,edge1zlb,clk);

qvec : crossproductgeneric map (29,16)port map (tvecx,tvecy,tvecz,edge1xla,edge1yla,edge1zla,qvecx,qvecy,qvecz,clk);

74

det : dotproductgeneric map (16,33)port map(edge1xlb,edge1ylb,edge1zlb,pvecx,pvecy,pvecz,det,clk);

ui : dotproductgeneric map (29,33)port map (tvecxl,tvecyl,tveczl,pvecx,pvecy,pvecz,u,clk);

dirdelaya : vectdelaygeneric map(16,1)port map(dirx,diry,dirz,dirxla,diryla,dirzla,clk);

dirdelayb : vectdelaygeneric map(16,2)port map(dirxla,diryla,dirzla,dirxlb,dirylb,dirzlb,clk);

vi : dotproductgeneric map (16,46)port map (dirxlb,dirylb,dirzlb,qvecx,qvecy,qvecz,usv,clk);

edge2delaya : vectdelaygeneric map(16,1)port map (edge2xr,edge2yr,edge2zr,edge2xla,edge2yla,edge2zla,clk);

edge2delayb : vectdelaygeneric map(16,2)port map (edge2xla,edge2yla,edge2zla,edge2xlb,edge2ylb,edge2zlb,clk);

ti : dotproductgeneric map (16,46)port map (edge2xlb,edge2ylb,edge2zlb,qvecx,qvecy,qvecz,t,clk);

configdelay : delaygeneric map (1,6)port map(config,configl,clk);

detdelay : delaygeneric map (51,1)port map(det,detl,clk);

divt : dividegeneric map(64,32,51,18)port map(t,det,tout,clk);

divu : dividegeneric map(64,16,51,16) -- Changed fraction part to 16port map(su,det,ru,clk);

divv : dividegeneric map(64,16,51,16) -- Changed fraction part to 16port map(v,det,rv,clk);

rudelay : delaygeneric map (16,16)port map(ru,uout,clk);

rvdelay : delaygeneric map (16,16)port map (rv, vout,clk);

triIDdelay : delaygeneric map (16,37)port map (triID,triIDout,clk);

-- Shifter sectionedge1sizeexchange : exchange

generic map(2)port map (edge2size, edge1size, edge1sizer, exchangeEdges);

edge2sizeexchange : exchangegeneric map(2)port map (edge1size, edge2size, edge2sizer, exchangeEdges);

edge1sizeDelay : delaygeneric map (2,5)port map(edge1sizer,edge1sizel,clk);

75

edge2sizeDelay : delaygeneric map (2,5)port map(edge2sizer,edge2sizel,clk);

shifter1 : shiftergeneric map (64)port map(usv,v,edge1sizel);

shifter2 : shiftergeneric map (64)port map(u,su,edge2sizel);

-- Sun interface (address mapped input registers)

hitdelay : delaygeneric map (1,30)port map (hit,hitl,clk);

hitout <= hitl(0);

debugdetneg <= '1' when (det < 0) else '0';debugsuneg <= '1' when (su < 0) else '0';debugvneg <= '1' when (v < 0) else '0';debugsugtdet <= '1' when (su > det) else '0';debugvgtdet <= '1' when (v > det) else '0';debugtneg <= '1' when (t < 0) else '0';debughitinter <= hitinter;debughit <= hit(0);

process(clk)begin

if (rising_edge(clk)) then-- Hit detection Logic (2 cycles)uv <= (su(63) & su)+(v(63) & v);if ((det < 0) or (su < 0) or (v < 0) or (su > det) or (v > det) or (t <= 0)) thenhitinter <= '0';

elsehitinter <= '1';

end if;if ((hitinter = '0') or ((configl(0) = '0') and (uv > detl))) thenhit(0) <= '0';

elsehit(0) <= '1';

end if;-- Hit Detection Logic Ends

end if;end process;

end rtl;

**** resultcounter.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_unsigned.all;

entity resultcounter isport(

resultID : in std_logic_vector(1 downto 0);newresult : in std_logic;done : out std_logic_vector(1 downto 0);reset : in std_logic;globalreset : in std_logic;clk : in std_logic);

end;

architecture rtl of resultcounter issignal count : std_logic_vector(3 downto 0);signal curr : std_logic_vector(1 downto 0);

begindone <= curr when count = 0 else "00";

process(clk,globalreset,reset)begin

if (globalreset = '1') or (reset = '1') thencount <= "1000";curr <= (others => '0');

elsif (rising_edge(clk)) then

76

if (resultID /= 0) and (newresult = '1') and (count /= 0) thencount <= count - 1;curr <= resultID;

end if;end if;


**** resultinterface.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;

entity resultinterface isport(

t1b,t2b,t3b : out std_logic_vector(31 downto 0);tf1b,tf2b,tf3b : out std_logic_vector(31 downto 0);u1b,u2b,u3b,v1b,v2b,v3b : out std_logic_vector(15 downto 0);id1b,id2b,id3b : out std_logic_vector(15 downto 0);hit1b,hit2b,hit3b : out std_logic;resultID : out std_logic_vector(1 downto 0);newdata : out std_logic;resultready : in std_logic;resultdata : in std_logic_vector(31 downto 0);globalreset : in std_logic;clk : in std_logic);

end;

architecture rtl of resultinterface istype state_type is (S_IDLE,S_READ1,S_READ2,S_READ3,S_READ4,S_READ5,S_READ6,

S_READ7,S_READ8,S_READ9,S_READ10,S_READ11);signal state : state_type;signal next_state : state_type;

beginProcess(clk,globalreset)begin

if (globalreset = '1') thenstate <= S_IDLE;t1b <= (others => '0'); t2b <= (others => '0'); t3b <= (others => '0');tf1b <= (others => '0'); tf2b <= (others => '0'); tf3b <= (others => '0');u1b <= (others => '0'); u2b <= (others => '0'); u3b <= (others => '0');v1b <= (others => '0'); v2b <= (others => '0'); v3b <= (others => '0');id1b <= (others => '0'); id2b <= (others => '0'); id3b <= (others => '0');hit1b <= '0'; hit2b <= '0'; hit3b <= '0';resultID <= (others => '0');newdata <= '0';

elsif (rising_edge(clk)) thenstate <= next_state;newdata <= '0';case state iswhen S_IDLE =>if (resultready = '1') thent1b <= resultdata;

end if;when S_READ1 =>tf1b <= resultdata;

when S_READ2 =>u1b <= resultdata(31 downto 16);v1b <= resultdata(15 downto 0);

when S_READ3 =>id1b <= resultdata(15 downto 0);hit1b <= resultdata(16);resultID <= resultdata(18 downto 17);

when S_READ4 =>t2b <= resultdata;

when S_READ5 =>tf2b <= resultdata;


when S_READ7 =>id2b <= resultdata(15 downto 0);hit2b <= resultdata(16);

when S_READ8 =>

77

t3b <= resultdata;when S_READ9 =>tf3b <= resultdata;


when S_READ11 =>id3b <= resultdata(15 downto 0);hit3b <= resultdata(16);newdata <= '1';

end case;end if;

end process;

process (state, resultready)Begin

case state ISwhen S_IDLE =>if (resultready = '1') thennext_state <= S_READ1;


end if;when S_READ1 =>next_state <= S_READ2;

when S_READ2 =>next_state <= S_READ3;









when S_READ11 =>next_state <= S_IDLE;


end rtl;

**** resultrecieve.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;

entity resultrecieve isport(

valid01,valid10 : out std_logic;id01a,id01b,id01c : out std_logic_vector(15 downto 0);id10a,id10b,id10c : out std_logic_vector(15 downto 0);hit01a,hit01b,hit01c : out std_logic;hit10a,hit10b,hit10c : out std_logic;

rgResultData : in std_logic_vector(31 downto 0);rgResultReady : in std_logic;rgResultSource : in std_logic_vector(1 downto 0);


end;

architecture rtl of resultrecieve istype state_type is (S_IDLE,S_READ01,S_READ10);signal state : state_type;signal next_state : state_type;

begin

78


if (globalreset = '1') thenstate <= S_IDLE;valid01 <= '0'; valid10 <= '0';hit01a <= '0'; hit01b <= '0'; hit01c <= '0';hit10a <= '0'; hit10b <= '0'; hit10c <= '0';id01a <= (others => '0'); id01b <= (others => '0'); id01c <= (others => '0');id10a <= (others => '0'); id10b <= (others => '0'); id10c <= (others => '0');

elsif (rising_edge (clk)) thenstate <= next_state;valid01 <= '0';valid10 <= '0';case (state) iswhen S_IDLE =>if rgResultReady = '1' and rgResultSource = "01" thenid01a <= rgResultData(31 downto 16);id01b <= rgResultData(15 downto 0);

elsif rgResultReady = '1' and rgResultSource = "10" thenid10a <= rgResultData(31 downto 16);id10b <= rgResultData(15 downto 0);

end if;when S_READ01 =>id01c <= rgResultData(15 downto 0);hit01a <= rgResultData(18);hit01b <= rgResultData(17);hit01c <= rgResultData(16);valid01 <= '1';

when S_READ10 =>id10c <= rgResultData(15 downto 0);hit10a <= rgResultData(18);hit10b <= rgResultData(17);hit10c <= rgResultData(16);valid10 <= '1';


end if;end process;

process (state,rgResultReady,rgResultSource)begin

case (state) iswhen S_IDLE =>if rgResultReady = '1' and rgResultSource = "01" thennext_state <= S_READ01;

elsif rgResultReady = '1' and rgResultSource = "10" thennext_state <= S_READ10;


end if;when S_READ01 =>next_state <= S_IDLE;

when S_READ10 =>next_state <= S_IDLE;


end rtl;

**** resulttransmit.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;

entity resulttransmit isport(

valid01,valid10 : in std_logic;id01a,id01b,id01c : in std_logic_vector(15 downto 0);id10a,id10b,id10c : in std_logic_vector(15 downto 0);hit01a,hit01b,hit01c : in std_logic;hit10a,hit10b,hit10c : in std_logic;

-- Interchip Bus SignalsrgResultData : out std_logic_vector(31 downto 0);rgResultReady : out std_logic;rgResultSource : out std_logic_vector(1 downto 0);

79


end;

architecture rtl of resulttransmit istype state_type is (S_IDLE,S_SEND01A,S_SEND01B,S_SEND10A,S_SEND10B);signal state : state_type;signal next_state : state_type;signal pending01,pending10 : std_logic;

begin


if (globalreset = '1') thenstate <= S_IDLE;pending01 <= '0';pending10 <= '0';rgresultdata <= (others => '0');rgresultsource <= (others => '0');rgresultready <= '0';

elsif (rising_edge (clk)) thenif valid01 = '1' thenpending01 <= '1';

end if;if valid10 = '1' thenpending10 <= '1';

end if;rgResultReady <= '0';state <= next_state;case (state) iswhen S_SEND01A =>rgResultData <= id01a & id01b;rgResultReady <= '1';rgResultSource <= "01";

when S_SEND01B =>rgResultData <= "0000000000000" & hit01a & hit01b & hit01c & id01c;rgResultReady <= '0';rgResultSource <= "01";pending01 <= '0';

when S_SEND10A =>rgResultData <= id10a & id10b;rgResultReady <= '1';rgResultSource <= "10";

when S_SEND10B =>rgResultData <= "0000000000000" & hit10a & hit10b & hit10c & id10c;rgResultReady <= '0';rgResultSource <= "10";pending10 <= '0';


end if;end process;

process (state,pending01,pending10)begin

case (state) iswhen S_IDLE =>if pending01 = '1' thennext_state <= S_SEND01A;

elsif pending10 = '1' thennext_state <= S_SEND10A;


end if;when S_SEND01A =>next_state <= S_SEND01B;

when S_SEND01B =>next_state <= S_IDLE;

when S_SEND10A =>next_state <= S_SEND10B;

when S_SEND10B =>next_state <= S_IDLE;


end rtl;

80

**** resultwriter.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;


entity resultwriter isport(

valid01,valid10 : in std_logic;id01a,id01b,id01c : in std_logic_vector(15 downto 0);id10a,id10b,id10c : in std_logic_vector(15 downto 0);hit01a,hit01b,hit01c : in std_logic;hit10a,hit10b,hit10c : in std_logic;addr01, addr10 : in std_logic_vector(17 downto 0);as01,as10 : in std_logic;

dataout : out std_logic_vector(63 downto 0);addrout : out std_logic_vector(17 downto 0);write : out std_logic;ack : in std_logic;globalreset : in std_logic;clk : in std_logic);

end;

architecture rtl of resultwriter istype state_type is (S_IDLE,S_PROCESS);signal state : state_type;signal next_state : state_type;

signal pending01, pending10 : std_logic;signal addrout01, addrout10 : std_logic_vector(17 downto 0);signal shiften01,shiften10 : std_logic;

begin

fifo3insta : fifo3port map(addr01,as01,addrout01,shiften01,globalreset,clk);

fifo3instb : fifo3port map(addr10,as10,addrout10,shiften10,globalreset,clk);

shiften01 <= '1' when pending01 = '1' and (state = S_PROCESS) and ack = '1' else '0';shiften10 <= '1' when pending10 = '1' and pending01 ='0' and (state = S_PROCESS) and ack

= '1' else '0';


if (globalreset = '1') thenstate <= S_IDLE;pending01 <= '0';pending10 <= '0';

elsif (rising_edge (clk)) thenstate <= next_state;if valid01 = '1' thenpending01 <= '1';

end if;if valid10 = '1' thenpending10 <= '1';

end if;case (state) iswhen S_PROCESS =>if ack = '1' and pending01 = '1' thenpending01 <= '0';

elsif ack = '1' and pending10 = '1' thenpending10 <= '0';


end case;end if;

end process;

dataout <= ('0' & hit01a & "000000" & hit01a & "000000" & hit01a & "000000" &hit01b & "000000" & hit01b & "000000" & hit01b & "000000" &hit01c & "000000" & hit01c & "000000" & hit01c & "000000") when

81

pending01 = '1' else('0' & hit10a & "000000" & hit10a & "000000" & hit10a & "000000" &

hit10b & "000000" & hit10b & "000000" & hit10b & "000000" &hit10c & "000000" & hit10c & "000000" & hit10c & "000000");

addrout <= addrout01 when pending01 = '1' else addrout10;write <= '1' when state = S_PROCESS else '0';

process (state,pending01,pending10,ack)begin

case (state) iswhen S_IDLE =>if pending01 = '1' or pending10 = '1' thennext_state <= S_PROCESS;


end if;when S_PROCESS =>if ack = '1' thennext_state <= S_IDLE;

elsenext_state <= S_PROCESS;

end if;end case;

end process;

end rtl;

**** Rgsramcontroller.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;

entity RGsramcontroller isport(

want_addr : out std_logic;addr_ready : in std_logic;addrin : in std_logic_vector(17 downto 0);want_data : out std_logic;data_ready : in std_logic;datain : in std_logic_vector(63 downto 0);want_read : in std_logic;read_ready : out std_logic;dataout : out std_logic_vector(63 downto 0);

dirReady01,dirReady10 : out std_logic;wantDir01,wantDir10 : in std_logic;dir : out std_logic_vector(47 downto 0);addr : out std_logic_vector(15 downto 0);

wantwriteback : in std_logic;writebackack : out std_logic;writebackdata : in std_logic_vector(63 downto 0);writebackaddr : in std_logic_vector(17 downto 0);

tm3_sram_data : inout std_logic_vector(63 downto 0);tm3_sram_addr : out std_logic_vector(18 downto 0);tm3_sram_we : out std_logic_vector(7 downto 0);tm3_sram_oe : out std_logic_vector(1 downto 0);tm3_sram_adsp : out std_logic;globalreset : in std_logic;clk : in std_logic;statepeek : out std_logic_vector(2 downto 0));

end;

architecture rtl of RGsramcontroller istype state_type is

(S_IDLE,S_LATCHADDR,S_READ,S_WRITE,S_WAIT,S_READ01,S_READ10,S_WRITEBACK);signal state : state_type;signal next_state : state_type;

signal waddress : std_logic_vector(17 downto 0);begin

process(state)

82

begincase state is

when S_IDLE => statepeek <= "000";when S_LATCHADDR => statepeek <= "001";when S_READ => statepeek <= "010";when S_WRITE => statepeek <= "011";when S_WAIT => statepeek <= "100";when S_READ01 => statepeek <= "101";when S_READ10 => statepeek <= "110";when S_WRITEBACK => statepeek <= "111";


dataout <= tm3_sram_data;dir <= tm3_sram_data(47 downto 0);addr <= tm3_sram_data(63 downto 48);


if (globalreset = '1') thenstate <= S_IDLE;waddress <= (others => '0');


case (state) iswhen S_IDLE =>if (addr_ready = '1') thenwaddress <= addrin;

end if;when S_WRITE =>waddress <= waddress+1;

when S_READ =>if (want_read = '0') thenwaddress <= waddress+1;

end if;when S_READ01 =>if wantDir01 = '0' thenwaddress <= waddress+1;

end if;when S_READ10 =>if wantDir10 = '0' thenwaddress <= waddress+1;


end case;end if;

end process;

process(state,addr_ready,data_ready,waddress,datain,wantdir10,wantdir01,want_read,wantwriteback,writebackdata,writebackaddr)

begintm3_sram_we <= "11111111";tm3_sram_oe <= "01";tm3_sram_adsp <= '0';tm3_sram_data <= (others => 'Z');tm3_sram_addr <= '0' & waddress;want_addr <= '1';want_data <= '1';read_ready <= '1';dirReady01 <= '0';dirReady10 <= '0';writebackack <= '0';case (state) is

when S_IDLE =>if (addr_ready = '1') thennext_state <= S_LATCHADDR;

elsif (want_read = '1') thennext_state <= S_READ;

elsif (data_ready = '1') thennext_state <= S_WRITE;

elsif (wantDir01 = '1') thennext_state <= S_READ01;

elsif (wantDir10 = '1') thennext_state <= S_READ10;

elsif (wantWriteback = '1') thennext_state <= S_WRITEBACK;

83


end if;when S_READ10 =>dirReady10 <= '1';if wantDir10 = '0' thennext_state <= S_IDLE;


end if;when S_READ01 =>dirReady01 <= '1';if wantDir01 = '0' thennext_state <= S_IDLE;


end if;when S_LATCHADDR =>want_addr <= '0';if (addr_ready = '0') thennext_state <= S_IDLE;

elsenext_state <= S_LATCHADDR;

end if;when S_READ =>read_ready <= '0';if (want_read = '1') thennext_state <= S_READ;


end if;when S_WRITEBACK =>tm3_sram_data <= writebackdata;tm3_sram_we <= "00000000";tm3_sram_oe <= "11";tm3_sram_adsp <= '0';tm3_sram_addr <= '0' & writebackaddr;writebackAck <= '1';next_state <= S_IDLE;

when S_WRITE =>tm3_sram_data <= datain;tm3_sram_we <= "00000000";tm3_sram_oe <= "11";tm3_sram_adsp <= '0';want_data <= '0';next_state <= S_WAIT;

when S_WAIT =>if data_ready = '1' thennext_state <= S_WAIT;


end if;want_data <= '0';


end rtl;

**** shifter.vhd ****---------------------------------------------- Variable Combinational Shift Component ---- ---- B = A shifted left by specified amt ---- ---- Factor Bits Shifted Right ---- 00 0 1 ---- 01 4 1/16 ---- 10 8 1/256 ---- 11 12 1/4096 ----------------------------------------------library ieee;use ieee.std_logic_1164.all;

entity shifter isgeneric (


84

A : in std_logic_vector(width-1 downto 0);B : out std_logic_vector(width-1 downto 0);factor : in std_logic_vector(1 downto 0));

end;

architecture rtl of shifter isbegin

process (factor,A)begin

case (factor) iswhen "00" => B <= A;when "01" => B <= "0000" & A(width-1 downto 4);when "10" => B <= "00000000" & A(width-1 downto 8);when "11" => B <= "000000000000" & A(width-1 downto 12);


end rtl;

**** sortedstack.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;

entity sortedstack isgeneric (

keywidth : natural := 32;datawidth : natural := 32+16;depth : natural := 8);

port(keyin : in std_logic_vector(keywidth-1 downto 0);datain : in std_logic_vector(datawidth-1 downto 0);write : in std_logic;reset : in std_logic;peekdata : out std_logic_vector(datawidth*depth-1 downto 0);globalreset : in std_logic;clk : in std_logic);

end;

architecture rtl of sortedstack istype stdlogicarraykey is array(0 to depth-1) of std_logic_vector(keywidth-1 downto 0);type stdlogicarraydata is array(0 to depth-1) of std_logic_vector(datawidth-1 downto 0);type stdlogicarraybit is array(0 to depth-1) of std_logic;

signal key : stdlogicarraykey;signal data : stdlogicarraydata;signal full : stdlogicarraybit;signal location : integer range 0 to depth-1;

beginpeeklp : for k in 0 to depth-1 generate

peekdata((k+1)*(datawidth)-1 downto k*(datawidth))<=data(k) when full(k)='1' else(others=>'0');

end generate peeklp;

-- Select the proper insertion pointprocess (keyin,key,full)begin

location <= depth-1;nrst: for k in depth-2 downto 0 loopif ((keyin < key(k)) or (full(k) = '0')) thenlocation <= k;

end if;end loop nrst;

end process;

process (clk,globalreset,reset)begin

if ((globalreset = '1') or (reset = '1')) thenclr: for k in 0 to depth-1 loopfull(k) <= '0';key(k) <= (others => '0');data(k) <= (others => '0');

end loop clr;elsif rising_edge(clk) then

if (write = '1') thenkey(location) <= keyin;

85

data(location) <= datain;full(location) <= '1';shft: for k in 0 to depth-2 loopif (k >= location) thenkey(k+1) <= key(k);data(k+1) <= data(k);full(k+1) <= full(k);

end if;end loop shft;

end if;end if;

end process;

end rtl;

**** spram.vhd ****--------------------------------------------------------- Signal Ported Ram Modual ---- ---- - Synplify should infer ram from the coding style---- - The depth of the ram is equal to 2**depth ---- ----------------------------------------------------------- Further Reading: RAM Inferencing with Synplify-- http://www.synplicity.com/literature/pdf/ram_inferencing.pdf

library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_signed.all;

entity spram isgeneric(


port(we : in std_logic;addr : in std_logic_vector(depth-1 downto 0);dataout : out std_logic_vector(width-1 downto 0);datain : in std_logic_vector(width-1 downto 0);clk : in std_logic);

end;

architecture rtl of spram istype memarray is array(2**depth-1 downto 0) of

std_logic_vector(width-1 downto 0);signal mem : memarray;

begindataout <= mem(conv_integer(addr));

process(clk,we,addr)begin

if (rising_edge (clk)) thenif (we = '1') thenmem(conv_integer(addr)) <= datain;

end if;end if;

end process;

end rtl;

**** spramblock.vhd ****--------------------------------------------------------- Single Ported Ram Modual w/Registered Output ---- - Synpify should infer ram from the coding style ---- - Depth is the number of bits of address ---- the true depths is 2**depth ---------------------------------------------------------

library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_signed.all;library synplify;use synplify.attributes.all;

86

entity spramblock isgeneric(


port(we : in std_logic;addr : in std_logic_vector(depth-1 downto 0);datain : in std_logic_vector(width-1 downto 0);dataout : out std_logic_vector(width-1 downto 0);clk : in std_logic);

end;

architecture rtl of spramblock istype memarray is array(2**depth-1 downto 0) of std_logic_vector(width-1 downto 0);

signal raddr : std_logic_vector(depth-1 downto 0);signal mem : memarray;attribute syn_ramstyle of mem : signal is "no_rw_check";

begindataout <= mem(conv_integer(raddr));process(clk,we,addr)begin

if (rising_edge (clk)) thenraddr <= addr;if (we = '1') thenmem(conv_integer(addr)) <= datain;

end if;end if;


**** sramcontroller.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;

entity sramcontroller isport(

want_addr : out std_logic;addr_ready : in std_logic;addrin : in std_logic_vector(17 downto 0);want_data : out std_logic;data_ready : in std_logic;datain : in std_logic_vector(63 downto 0);

addr : in std_logic_vector(17 downto 0);addrvalid : in std_logic;data : out std_logic_vector(63 downto 0);datavalid : buffer std_logic;

tm3_sram_data : inout std_logic_vector(63 downto 0);tm3_sram_addr : out std_logic_vector(18 downto 0);tm3_sram_we : out std_logic_vector(7 downto 0);tm3_sram_oe : out std_logic_vector(1 downto 0);tm3_sram_adsp : out std_logic;globalreset : in std_logic;clk : in std_logic;statepeek : out std_logic_vector(2 downto 0));

end;

architecture rtl of sramcontroller istype state_type is (S_IDLE,S_WRITE1,S_WRITE2,S_WRITE3,S_WRITEDONE,S_READ);signal state : state_type;signal next_state : state_type;

signal waddress : std_logic_vector(17 downto 0);begin

process(state)begin

case state iswhen S_IDLE => statepeek <= "001";when S_WRITE1 => statepeek <= "010";when S_WRITE2 => statepeek <= "011";when S_WRITE3 => statepeek <= "100";when S_WRITEDONE => statepeek <= "101";

87

when S_READ => statepeek <= "110";when others => statepeek <= "000";



if (globalreset = '1') thenstate <= S_IDLE;waddress <= (others => '0');data <= (others => '0');datavalid <= '0';


case (state) iswhen S_IDLE =>if (addr_ready = '1') thenwaddress <= addrin;

end if;if addrvalid = '0' thendatavalid <= '0';

end if;when S_WRITE2 =>if (data_ready = '1') thenwaddress <= waddress+1;

end if;when S_READ =>data <= tm3_sram_data;datavalid <= '1';


end if;end process;

process (state,addr_ready,data_ready,waddress,datain,addrvalid,datavalid,addr)begin

tm3_sram_we <= "11111111";tm3_sram_oe <= "11";tm3_sram_adsp <= '1';tm3_sram_data <= (others => 'Z');tm3_sram_addr <= (others => '-');want_addr <= '1';want_data <= '0';case (state) is

when S_IDLE =>if (addr_ready = '1') thennext_state <= S_WRITE1;

elsif addrvalid = '1' and datavalid = '0' thennext_state <= S_READ;tm3_sram_addr <= '0' & addr;tm3_sram_adsp <= '0';tm3_sram_oe <= "01";


end if;when S_READ =>next_state <= S_IDLE;



end if;when S_WRITE2 =>want_data <= '1';tm3_sram_addr <= '0' & waddress;tm3_sram_data <= datain;if (addr_ready = '1') thennext_state <= S_WRITEDONE;

elsif (data_ready = '1') thentm3_sram_we <= "00000000";tm3_sram_adsp <= '0';next_state <= S_WRITE3;

88






end if;end case;


**** test.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;


entity test isport(

triIDvalid : out std_logic;triID : out std_logic_vector(15 downto 0);wanttriID : in std_logic;raydata : out std_logic_vector(31 downto 0);rayaddr : out std_logic_vector(3 downto 0);raywe : out std_logic_vector(2 downto 0);resultready : in std_logic;resultdata : in std_logic_vector(31 downto 0);globalreset : out std_logic;

want_braddr : out std_logic;braddr_ready : in std_logic;braddrin : in std_logic_vector(9 downto 0);want_brdata : out std_logic;brdata_ready : in std_logic;brdatain : in std_logic_vector(31 downto 0);

want_addr2 : out std_logic;addr2_ready : in std_logic;addr2in : in std_logic_vector(17 downto 0);want_data2 : out std_logic;data2_ready : in std_logic;data2in : in std_logic_vector(63 downto 0);

pglobalreset : in std_logic;tm3_clk_v0 : in std_logic;tm3_sram_data : inout std_logic_vector(63 downto 0);tm3_sram_addr : out std_logic_vector(18 downto 0);tm3_sram_we : out std_logic_vector(7 downto 0);tm3_sram_oe : out std_logic_vector(1 downto 0);tm3_sram_adsp : out std_logic;

-- Bus Signals (To Ray Generator Unit)raygroup01 : in std_logic_vector(1 downto 0);raygroupvalid01 : in std_logic;busy01 : out std_logic;raygroup10 : in std_logic_vector(1 downto 0);raygroupvalid10 : in std_logic;busy10 : out std_logic;

rgData : in std_logic_vector(31 downto 0);rgAddr : in std_logic_vector(3 downto 0);rgWE : in std_logic_vector(2 downto 0);rgAddrValid : in std_logic;rgDone : out std_logic;

89

rgResultData : out std_logic_vector(31 downto 0);rgResultReady : out std_logic;rgResultSource : out std_logic_vector(1 downto 0);

t1a : out std_logic_vector(31 downto 0);t1b : out std_logic_vector(31 downto 0);u1a : out std_logic_vector(15 downto 0);u1b : out std_logic_vector(15 downto 0);v1a : out std_logic_vector(15 downto 0);v1b : out std_logic_vector(15 downto 0);id1a : out std_logic_vector(15 downto 0);id1b : out std_logic_vector(15 downto 0);hit1a : out std_logic;hit1b : out std_logic;

debug1 : out std_logic_vector(31 downto 0);debug2 : out std_logic_vector(31 downto 0);debug3 : out std_logic_vector(31 downto 0);input1 : in std_logic;input2 : in std_logic;input3 : in std_logic_vector(31 downto 0));

end;

architecture rtl of test issignal max,max01,max10 : std_logic_vector(31 downto 0);signal maxwe,maxwe01,maxwe10 : std_logic;signal raygroupwe,raygroupwe01,raygroupwe10 : std_logic;signal raygroupout,raygroupout01,raygroupout10 : std_logic_vector(1 downto 0);signal raygroupid, raygroupid01,raygroupid10 : std_logic_vector(1 downto 0);signal resultid : std_logic_vector(1 downto 0);signal t1i,t2i,t3i,tf1i,tf2i,tf3i : std_logic_vector(31 downto 0);signal u1i,u2i,u3i,v1i,v2i,v3i : std_logic_vector(15 downto 0);signal id1i,id2i,id3i : std_logic_vector(15 downto 0);signal hit1i,hit2i,hit3i : std_logic;signal newresult : std_logic;signal write,reset,reset01,reset10 : std_logic;signal peekdata,peeklatch : std_logic_vector(871 downto 0);signal commit01,commit10 : std_logic;signal baseaddress01,baseaddress10 : std_logic_vector(1 downto 0);signal done : std_logic_vector(1 downto 0);signal cntreset,cntreset01,cntreset10 : std_logic;signal passCTS01, passCTS10 : std_logic;signal triIDvalid01, triIDvalid10 : std_logic;signal triID01, triID10 : std_logic_vector(15 downto 0);signal gnd : std_logic;signal boundNodeID,BoundNodeID01, BoundNodeID10 : std_logic_vector(9 downto 0);signal enablenear,enablenear01,enablenear10 : std_logic;signal max0_01,max1_01,max2_01,max0_10,max1_10,max2_10 : std_logic_vector(31 downto 0);signal ack01,ack10,empty01,dataready01,empty10,dataready10,lhreset01,lhreset10 :

std_logic;signal boundnodeIDout01,boundnodeIDout10 : std_logic_vector(9 downto 0);signal level01,level10 : std_logic_vector(1 downto 0);signal hitmask01,hitmask10 : std_logic_vector(2 downto 0);

-- Offset Block Ram Read Signalssignal ostaddr,addrind01,addrind10 : std_logic_vector(9 downto 0);signal ostaddrvalid,addrindvalid01,addrindvalid10,ostdatavalid : std_logic;signal ostdata : std_logic_vector(31 downto 0);-- Tri List Ram Read Signalssignal tladdr,tladdr01,tladdr10 : std_logic_vector(17 downto 0);signal tladdrvalid,tladdrvalid01,tladdrvalid10,tldatavalid : std_logic;signal tldata : std_logic_vector(63 downto 0);-- Final Result Signalssignal t1_01,t2_01,t3_01,t1_10,t2_10,t3_10 : std_logic_vector(31 downto 0);signal v1_01,v2_01,v3_01,v1_10,v2_10,v3_10 : std_logic_vector(15 downto 0);signal u1_01,u2_01,u3_01,u1_10,u2_10,u3_10 : std_logic_vector(15 downto 0);signal id1_01,id2_01,id3_01,id1_10,id2_10,id3_10 : std_logic_Vector(15 downto 0);signal hit1_01,hit2_01,hit3_01,hit1_10,hit2_10,hit3_10 : std_logic;signal bcvalid01, bcvalid10 : std_logic;

signal peekoffset1a,peekoffset1b,peekoffset0a,peekoffset0b : std_logic_vector(2 downto0);

signal peekoffset2a,peekoffset2b : std_logic_vector(2 downto 0);signal peekaddressa,peekaddressb : std_logic_vector(4 downto 0);

signal doutput,dack : std_logic;signal state01,state10 : std_logic_vector(4 downto 0);

90

signal junk1,junk1b : std_logic_vector(2 downto 0);signal junk2,junk2a : std_logic;signal junk3,junk4 : std_logic_vector(1 downto 0);signal d1 : std_logic_vector(31 downto 0);signal debugstoplevel01,debugstoplevel10 : std_logic_vector(1 downto 0);signal debugleafbreak : std_logic;signal debugcount01,debugcount10 : std_logic_vector(13 downto 0);signal debugsubcount01, debugsubcount10 : std_logic_vector(1 downto 0);signal statesram : std_logic_vector(2 downto 0);

begind1(12 downto 7) <= (others => '0');

debugstoplevel01 <= input3(1 downto 0);debugstoplevel10 <= input3(3 downto 2);debugleafbreak <= input3(4);

t1a <= t1_01;t1b <= t1_10;u1a <= u1_01;u1b <= u1_10;v1a <= v1_01;v1b <= v1_10;id1a <= id1_01;id1b <= id1_10;hit1a <= hit1_01;hit1b <= hit1_10;

oc : onlyonecycleport map(input1,doutput,pglobalreset,tm3_clk_v0);

debug1 <= d1;d1(0) <= empty01;d1(1) <= dataready01;d1(3 downto 2) <= level01;d1(25 downto 16) <= boundnodeIDout01;d1(15 downto 13) <= (others => '0');d1(6 downto 4) <= (others => '0');debug2 <= max0_01;debug3 <= max1_01;

-- Real Stuff Starts Here

ostaddr <= addrind01 or addrind10;ostaddrvalid <= addrindvalid01 or addrindvalid10;

offsettable : vblockramcontrollergeneric map(32,10)port map(want_braddr,braddr_ready,braddrin,want_brdata,brdata_ready,brdatain,

ostaddr,ostaddrvalid,ostdata,ostdatavalid,pglobalreset,tm3_clk_v0);

tladdr <= tladdr01 or tladdr10;tladdrvalid <= tladdrvalid01 or tladdrvalid10;

trilist : sramcontrollerport map(want_addr2,addr2_ready,addr2in,want_data2,data2_ready,data2in,

tladdr,tladdrvalid,tldata,tldatavalid,tm3_sram_data,tm3_sram_addr,tm3_sram_we,tm3_sram_oe,tm3_sram_adsp,pglobalreset,tm3_clk_v0, statesram);

globalreset <= pglobalreset;

ri : resultinterfaceport map(t1i,t2i,t3i,tf1i,tf2i,tf3i,u1i,u2i,u3i,

v1i,v2i,v3i,id1i,id2i,id3i,hit1i,hit2i,hit3i,resultID,newresult,resultready,resultdata,pglobalreset,tm3_clk_v0);

rayint : rayinterfaceport map(max,maxwe, raygroupout,raygroupwe,raygroupid,enablenear,

rgData,rgAddr,rgWe,rgAddrvalid,rgDone,raydata,rayaddr,raywe, pglobalreset,tm3_clk_v0);

boundcont01 : boundcontrollergeneric map('1',"01")port map(max01,maxwe01,raygroupout01,raygroupwe01,raygroupid01,

enablenear01,raygroup01,raygroupvalid01,busy01,

91

triIDvalid01, triID01,wanttriID,reset01,baseaddress01,newresult,boundNodeID01,

resultID,hitmask01,dataready01,empty01,level01,max0_01,max1_01,max2_01,boundNodeIDout01,ack01,lhreset01,addrind01,addrindvalid01,ostdata,ostdatavalid,tladdr01,tladdrvalid01,tldata,tldatavalid,t1i,t2i,t3i,u1i,u2i,u3i,v1i,v2i,v3i,id1i,id2i,id3i,hit1i,hit2i,hit3i,

t1_01,t2_01,t3_01,u1_01,u2_01,u3_01,v1_01,v2_01,v3_01,id1_01,id2_01,id3_01,hit1_01,hit2_01,hit3_01,

bcvalid01,done,cntreset01,passCTS01,passCTS10,pglobalreset,tm3_clk_v0,state01,debugstoplevel01,

debugleafbreak,debugsubcount01,debugcount01);

boundcont10 : boundcontrollergeneric map('0',"10")port map(max10,maxwe10,raygroupout10,raygroupwe10,raygroupid10,

enablenear10, raygroup10, raygroupvalid10, busy10,triIDvalid10, triID10, wanttriID,reset10,

baseaddress10,newresult,BoundNodeID10,resultID,hitmask10,dataready10,empty10,level10,max0_10,max1_10,max2_10,boundNodeIDout10,ack10, lhreset10,addrind10,addrindvalid10,ostdata,ostdatavalid,tladdr10,tladdrvalid10,tldata,tldatavalid,t1i,t2i,t3i,u1i,u2i,u3i,v1i,v2i,v3i,id1i,id2i,id3i,hit1i,hit2i,hit3i,

t1_10,t2_10,t3_10,u1_10,u2_10,u3_10,v1_10,v2_10,v3_10,id1_10,id2_10,id3_10,hit1_10,hit2_10,hit3_10,

bcvalid10,done,cntreset10,passCTS10,passCTS01,pglobalreset,tm3_clk_v0,state10,debugstoplevel10,

debugleafbreak,debugsubcount10,debugcount10);

restransinst : resulttransmitport map(bcvalid01,bcvalid10,id1_01,id2_01,id3_01,id1_10,id2_10,id3_10,

hit1_01,hit2_01,hit3_01,hit1_10,hit2_10,hit3_10,rgResultData,rgResultReady,rgResultSource, pglobalreset,tm3_clk_v0);

gnd <= '0';

raygroupout <= raygroupout01 or raygroupout10;raygroupwe <= raygroupwe01 or raygroupwe10;raygroupid <= raygroupid01 or raygroupid10;triIDvalid <= triIDvalid01 or triIDvalid10;enablenear <= enablenear01 or enablenear10;triID <= triID01 or triID10;cntreset <= cntreset01 or cntreset10;reset <= reset01 or reset10;max <= max01 or max10;maxwe <= maxwe01 or maxwe10;

process (boundNodeID01,boundNodeID10,resultID)begin

if resultID = "01" thenboundNodeID <= BoundNodeID01;

elsif resultID = "10" thenboundNodeID <= BoundNodeID10;

elseboundNodeID <= (others => '-');

end if;end process;

write <= '1' when (newresult = '1') and (resultID /= 0) and((hit1i = '1') or (hit2i = '1') or (hit3i = '1')) else '0';

st : sortedstackgeneric map(32, 109, 8)port map

(t1i,hit3i&hit2i&hit1i&tf3i&tf2i&tf1i&boundNodeID,write,reset,peekdata,pglobalreset,tm3_clk_v0);

commit01 <= '1' when done = "01" else '0';commit10 <= '1' when done = "10" else '0';

dack <= doutput or ack01;

lh01 : listhandlerport map(peeklatch,commit01,hitmask01,dack,max0_01,max1_01,max2_01,

boundnodeIDout01,level01,empty01,dataready01,lhreset01,

92

pglobalreset,tm3_clk_v0,peekoffset0a,peekoffset1a, peekoffset2a,junk2a,junk4);

lh02 : listhandlerport map(peeklatch,commit10,hitmask10,ack10,max0_10,max1_10,max2_10,

boundnodeIDout10,level10,empty10,dataready10,lhreset10,pglobalreset,tm3_clk_v0,junk1,junk1b,peekoffset2b,junk2,junk3);

process (tm3_clk_v0,pglobalreset)begin-- The reset is only for debuggingif (pglobalreset = '1') then

d1(31 downto 26) <= (others => '0');peeklatch <= (others => '0');

elsif rising_edge(tm3_clk_v0) thenif newresult = '1' thend1(31 downto 26) <= d1(31 downto 26) + 1;

end if;if (done /= 0) thenpeeklatch <= peekdata;

end if;end if;

end process;

rc : resultcounterport map(resultID,newresult,done,cntreset,pglobalreset,tm3_clk_v0);

end rtl;

**** top.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;


entity top isport(

want_saddr : out std_logic;saddr_ready : in std_logic;saddrin : in std_logic_vector(17 downto 0);want_sdata : out std_logic;sdata_ready : in std_logic;sdatain : in std_logic_vector(63 downto 0);

tm3_sram_data : inout std_logic_vector(63 downto 0);tm3_sram_addr : out std_logic_vector(18 downto 0);tm3_sram_we : out std_logic_vector(7 downto 0);tm3_sram_oe : out std_logic_vector(1 downto 0);tm3_sram_adsp : out std_logic;

triIDvalid : in std_logic;triID : in std_logic_vector(15 downto 0);wanttriID : out std_logic;raydata : in std_logic_vector(31 downto 0);rayaddr : in std_logic_vector(3 downto 0);raywe : in std_logic_vector(2 downto 0);resultready : out std_logic;resultdata : out std_logic_vector(31 downto 0);

tm3_io_3 : out std_logic_vector(31 downto 0);globalreset : in std_logic;tm3_clk_v0 : in std_logic

);end;

architecture rtl of top istype stdlogicarray32 is array(0 to 2) of std_logic_vector(31 downto 0);type stdlogicarray16 is array(0 to 2) of std_logic_vector(15 downto 0);

-- Memory Interface Signalssignal tridata : std_logic_vector(191 downto 0);signal triID_out : std_logic_vector(15 downto 0);signal cyclenum : std_logic_vector(1 downto 0);

93

signal masterenable,masterenablel : std_logic_vector(0 downto 0);signal swap : std_logic;

-- Ray Tri Interface Signalssignal tout : std_logic_vector(31 downto 0);signal uout : std_logic_vector(15 downto 0);signal vout : std_logic_vector(15 downto 0);signal triIDout : std_logic_vector(15 downto 0);signal hitout : std_logic;signal origx,origy,origz : std_logic_vector(27 downto 0);signal dirx,diry,dirz : std_logic_vector(15 downto 0);

-- Nearest Unit Signalssignal nt,ft : stdlogicarray32;signal nu,nv,ntriID : stdlogicarray16;signal anyhit : std_logic_vector(2 downto 0);signal n0enable, n1enable, n2enable,nxenable : std_logic;signal enablenear,enablenearl : std_logic_vector(0 downto 0);signal resetl,reset : std_logic_vector(0 downto 0);signal maxdist, maxdistl : std_logic_vector(31 downto 0);signal raygroupID, raygroupIDl : std_logic_vector(1 downto 0);

-- Debug signalssignal pod1 : std_logic_vector(15 downto 1);signal pod2 : std_logic_vector(15 downto 0);signal debugdetneg : std_logic;signal debugsuneg : std_logic;signal debugvneg : std_logic;signal debugsugtdet : std_logic;signal debugvgtdet : std_logic;signal debugtneg : std_logic;signal debughitinter : std_logic;signal debughit : std_logic;begin

tm3_io_3 <= pod2 & '0' & pod1;pod1(1) <= masterenable(0);pod1(2) <= n2enable;pod1(3) <= resetl(0);pod1(4) <= anyhit(2);pod1(5) <= debugdetneg;pod1(6) <= debugsuneg;pod1(7) <= debugvneg;pod1(8) <= debugsugtdet;pod1(9) <= debugvgtdet;pod1(10) <= debugtneg;pod1(11) <= debughitinter;pod1(12) <= debughit;pod1(13) <= hitout;

pod1(15 downto 14) <= tridata(161 downto 160); -- vert0z

pod2(3 downto 0) <= dirx(3 downto 0);pod2(5 downto 4) <= diry(1 downto 0);pod2(7 downto 6) <= dirz(1 downto 0);pod2(11 downto 8) <= tridata(99 downto 96); -- vert0xpod2(13 downto 12) <= tridata(1 downto 0); -- edge1xpod2(15 downto 14) <= tridata(65 downto 64); -- edge2y

mem : memoryinterfaceport map(

want_saddr,saddr_ready,saddrin,want_sdata,sdata_ready,sdatain,tridata, triID_out, masterenable(0), triIDvalid, triID, wanttriID,cyclenum,tm3_sram_data,tm3_sram_addr,tm3_sram_we,tm3_sram_oe,tm3_sram_adsp,globalreset, tm3_clk_v0);

triunit : raytriport map(

tm3_clk_v0,tout,uout,vout,triIDout,hitout,tridata(123 downto 96), tridata(155 downto 128), tridata(187 downto 160),origx,origy,origz, dirx,diry,dirz,tridata(15 downto 0), tridata(31 downto 16), tridata(47 downto 32), tridata(125

downto 124),tridata(63 downto 48), tridata(79 downto 64), tridata(95 downto 80), tridata(157

downto 156),tridata(191 downto 191), swap,triID_out,

94

debugdetneg,debugsuneg,debugvneg,debugsugtdet,debugvgtdet,debugtneg,debughitinter,debughit);

nc0 : nearcmpspecport map(tout,uout,vout,triIDout,hitout,nt(0),ft(0),nu(0),nv(0),ntriID(0),anyhit(0), maxdistl,n0enable,nxenable,resetl(0),globalreset,tm3_clk_v0);

nc1 : nearcmpport map(

tout,uout,vout,triIDout,hitout,nt(1),ft(1),nu(1),nv(1),ntriID(1),anyhit(1),maxdistl,n1enable,resetl(0),globalreset,tm3_clk_v0);

nc2 : nearcmpport map(

tout,uout,vout,triIDout,hitout,nt(2),ft(2),nu(2),nv(2),ntriID(2),anyhit(2),maxdistl,n2enable,resetl(0),globalreset,tm3_clk_v0);

n0enable <= '1' when (cyclenum = "10") and (masterenablel(0) = '1') else '0';n1enable <= '1' when (cyclenum = "00") and (masterenablel(0) = '1') else '0';n2enable <= '1' when (cyclenum = "01") and (masterenablel(0) = '1') else '0';nxenable <= '1' when (enablenearl(0) = '1') and (masterenablel(0) = '1') else '0';maxdelay : delay

generic map (32,37)port map(maxdist,maxdistl,tm3_clk_v0);

raygroupdelay : delaygeneric map (2,37+1) -- One delay level to account for near cmp internal latchport map(raygroupID,raygroupIDl,tm3_clk_v0);

enableneardelay : delaygeneric map (1,37)port map(enablenear,enablenearl,tm3_clk_v0);

mastdelay : delaygeneric map (1,37)port map(masterenable,masterenablel,tm3_clk_v0);

resetdelay : delaygeneric map (1,37)port map(reset,resetl,tm3_clk_v0);

resstate : resultstateport map (resetl(0),

nt(0),nt(1),nt(2),ft(0),ft(1),ft(2),nu(0),nu(1),nu(2),nv(0),nv(1),nv(2),

ntriID(0),ntriID(1),ntriID(2),anyhit(0),anyhit(1),anyhit(2),raygroupIDl,resultready,resultdata,globalreset, tm3_clk_v0);

raybuff : raybufferport map ( origx, origy, origz, dirx, diry, dirz, maxdist, raygroupID, swap,

reset(0),enablenear(0),raydata, rayaddr, raywe, cyclenum,tm3_clk_v0);

end rtl;

**** vblockramcontroller.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;


entity vblockramcontroller isgeneric(


port(want_addr : out std_logic;addr_ready : in std_logic;addrin : in std_logic_vector(depth-1 downto 0);want_data : out std_logic;data_ready : in std_logic;datain : in std_logic_vector(width-1 downto 0);

95

addr : in std_logic_vector(depth-1 downto 0);addrvalid : in std_logic;data : out std_logic_vector(width-1 downto 0);datavalid : buffer std_logic;


end;

architecture rtl of vblockramcontroller istype state_type is (S_IDLE,S_WRITE1,S_WRITE2,S_WRITE3,S_WRITEDONE,S_READ);signal state : state_type;signal next_state : state_type;

signal waddr,saddr : std_logic_vector(depth-1 downto 0);signal dataout : std_logic_vector(width-1 downto 0);signal we : std_logic;

begin

saddr <= waddr when state /= S_IDLE else addr;

ramblock : spramblockgeneric map (width,depth)port map(we,saddr,datain,dataout,clk);


if (globalreset = '1') thenstate <= S_IDLE;waddr <= (others => '0');data <= (others => '0');datavalid <= '0';


case (state) iswhen S_IDLE =>if (addr_ready = '1') thenwaddr <= addrin;

end if;if addrvalid = '0' thendatavalid <= '0';

end if;when S_WRITE2 =>if (data_ready = '1') thenwaddr <= waddr+1;

end if;when S_READ =>data <= dataout;datavalid <= '1';


end if;end process;

process (state,addr_ready,data_ready,addrvalid,datavalid)begin

we <= '0';want_addr <= '1';want_data <= '0';case (state) is

when S_IDLE =>if (addr_ready = '1') thennext_state <= S_WRITE1;

elsif addrvalid = '1' and datavalid = '0' thennext_state <= S_READ;


end if;when S_READ =>next_state <= S_IDLE;


96


end if;when S_WRITE2 =>want_data <= '1';if (addr_ready = '1') thennext_state <= S_WRITEDONE;

elsif (data_ready = '1') thenwe <= '1';next_state <= S_WRITE3;






end if;end case;

end process;

end rtl;

**** vectdelay.vhd ****--------------------------------------------- Variable Length Vector Shift Register ---- Provides a specified number of ---- clock cycle delay for a 3 signals ---------------------------------------------

library ieee;use ieee.std_logic_1164.all;

entity vectdelay isgeneric (


port(xin,yin,zin : in std_logic_vector(width-1 downto 0);xout,yout,zout : out std_logic_vector(width-1 downto 0);clk : in std_logic);

end;

architecture rtl of vectdelay istype delayarray is array(0 to depth-1) of std_logic_vector(width-1 downto 0);

signal bufferx : delayarray;signal buffery : delayarray;signal bufferz : delayarray;

beginxout <= bufferx(depth-1);yout <= buffery(depth-1);zout <= bufferz(depth-1);

process(clk)begin

if (rising_edge(clk)) thenbufferx(0) <= xin;buffery(0) <= yin;bufferz(0) <= zin;if (depth > 1) thenrow : for k in 0 to depth-2 loopbufferx(k+1) <= bufferx(k);buffery(k+1) <= buffery(k);bufferz(k+1) <= bufferz(k);

end loop row;end if;

end if;

97


**** vectexchange.vhd ****------------------------------------ Vector Mux Component ---- C = A when ABn = '1' else B ------------------------------------library ieee;use ieee.std_logic_1164.all;

entity vectexchange isgeneric (


Ax,Ay,Az : in std_logic_vector(width-1 downto 0);Bx,By,Bz : in std_logic_vector(width-1 downto 0);Cx,Cy,Cz : out std_logic_vector(width-1 downto 0);ABn : in std_logic);

end;

architecture rtl of vectexchange isbegin

Cx <= Ax when (ABn = '1') else Bx;Cy <= Ay when (ABn = '1') else By;Cz <= Az when (ABn = '1') else Bz;

end rtl;

**** vectsub.vhd ****------------------------------------------- Signed Vector Subtraction Component ---- C = A - B ---- The output, C, is latched -------------------------------------------


entity vectsub isgeneric (


Ax,Ay,Az : in std_logic_vector(width-1 downto 0);Bx,By,Bz : in std_logic_vector(width-1 downto 0);Cx,Cy,Cz : out std_logic_vector(width downto 0);clk : in std_logic);

end;

architecture rtl of vectsub isbegin

process(clk)begin

if (rising_edge(clk)) thenCx <= (Ax(width-1) & Ax) - (Bx(width-1) & Bx);Cy <= (Ay(width-1) & Ay) - (By(width-1) & By);Cz <= (Az(width-1) & Az) - (Bz(width-1) & Bz);

end if;end process;

end rtl;

98

Appendix C: C Code

**** load.c **** Raytracing processor interface program#include <stdio.h>#include <stdlib.h>#include <strings.h>#include <math.h>#include "portutil.h"#include "framebuf.h"#include "trilist.h"

#define TM3enable#define PI 3.1415

typedef struct {float x,y,z;

} vect3f;

vect3f normalize(vect3f in) {vect3f result;float len;

len = sqrt(in.x*in.x+in.y*in.y+in.z*in.z);result.x = in.x / len;result.y = in.y / len;result.z = in.z / len;return result;

}

vect3f cross(vect3f a, vect3f b) {vect3f result;result.x = a.y*b.z-a.z*b.y;result.y = a.z*b.x-a.x*b.z;result.z = a.x*b.y-a.y*b.x;return result;

}

long long int packray(vect3f ray) {signed short int x,y,z;x = ray.x;y = ray.y;z = ray.z;

return ((((unsigned long long int) x) << 32) +(((unsigned long long int) y) << 16) +(((unsigned long long int) z))) & (0x0000ffffffffffffl);

}

void tmSendRays(vect3f orig, vect3f dir, vect3f up, float view_x, float view_y) {int rgdatain,rgaddrin,origx,origy,origz;unsigned long long int *data;int x,y;float sx,sy;vect3f leftn,dirn,upn;vect3f raydir;float tanx, tany;dirn = normalize(dir);upn = normalize(up);leftn = normalize(cross(up,dir));tanx = tan( (float)view_x/360*PI);tany = tan( (float)view_y/360*PI);data = (long long int*) malloc(8*321*240);

for(y = 0; y < 240; ++y) {for(x = 0; x < 320; ++x) {

sx = 2*tanx*(x-160)/320;sy = 2*tany*(y-120)/240;raydir.x = dirn.x+leftn.x*sx+upn.x*sy;raydir.y = dirn.y+leftn.y*sx+upn.y*sy;raydir.z = dirn.z+leftn.z*sx+upn.z*sy;raydir = normalize(raydir);raydir.x *= 32767;raydir.y *= 32767;raydir.z *= 32767;data[y*321+x] = packray(raydir) + (((unsigned long long int)(y*107+floor(x/3))) <<

99

48);}data[y*321+320] = 0xffff000000000000l;

}

#ifdef TM3enableorigx = openPort("origx","w");origy = openPort("origy","w");origz = openPort("origz","w");x = orig.x; writeIntPort(origx,"OrigX",x);x = orig.y; writeIntPort(origy,"OrigY",x);x = orig.z; writeIntPort(origz,"OrigZ",x);tm_close(origx);tm_close(origy);tm_close(origz);

// write ray direction datargaddrin = openPort("rgaddrin","w");rgdatain = openPort("rgdatain","w");write3BytesPort(rgaddrin,"Addr",0);writePort(rgdatain,"Data in",(char *) data,8*321*240);printf("Rays written to TM3\n");tm_close(rgaddrin);tm_close(rgdatain);

#endiffree(data);

}

void tm3go() {int pglobalreset,rgcont,rgstat,rgaddrin,input3;

#ifdef TM3enableinput3 = openPort("input3","w");writeIntPort(input3,"Stop Level",3);tm_close(input3);

printf("Rendering Image\n");pglobalreset = openPort("pglobalreset","w");toggleBitPort(pglobalreset,"Global Reset",1);tm_close(pglobalreset);

rgaddrin = openPort("rgaddrin","w");write3BytesPort(rgaddrin,"Addr",0);tm_close(rgaddrin);

rgcont = openPort("rgcont","w");writeIntPort(rgcont,"Control Port",(321*240/3)*2+1);writeIntPort(rgcont,"Control Port",(321*240/3)*2);tm_close(rgcont);

rgstat = openPort("rgstat","r");while((readPort4(rgstat,"Status Port") & 0x80000000) != 0);printf("Total Cycles: %u\n",readPort4(rgstat,"Status Port"));tm_close(rgstat);

#endif}

void writeSRAM0TM3(unsigned long long int *data, int address, int bytes) {int saddrin,sdatain;

saddrin = openPort("saddrin","w");sdatain = openPort("sdatain","w");writeMemoryPort3(saddrin,sdatain,"TriData: addr","TriData: data",address,bytes,(char

*)data);tm_close(saddrin);tm_close(sdatain);

}

void writeSRAM1TM3(unsigned long long int *data, int address, int bytes) {int addr2in,data2in;

addr2in = openPort("addr2in","w");data2in = openPort("data2in","w");writeMemoryPort3(addr2in,data2in,"TriList: addr","TriList: data",address,bytes,(char

*)data);tm_close(addr2in);tm_close(data2in);

100

}

void writeIndMemTM3(unsigned int *data, int address, int bytes) {int braddrin,brdatain;

braddrin = openPort("braddrin","w");brdatain = openPort("brdatain","w");writeMemoryPort2(braddrin,brdatain,"indmem:addr","indmem:data",address,bytes,(char

*)data);tm_close(braddrin);tm_close(brdatain);

}

void tm3writeTGA() {FrameBuffer *buf;unsigned long long int *data;int x,y;int rgaddrin,rgdataout;

#ifdef TM3enablergaddrin = openPort("rgaddrin","w");rgdataout = openPort("rgdataout","r");write3BytesPort(rgaddrin,"Addr",0x30000);data = (unsigned long long int *) malloc(8*107*240);readPort(rgdataout,"Data out",(char *) data,8*107*240);tm_close(rgaddrin);tm_close(rgdataout);

buf = createFrameBuffer(320,240);for (y = 0; y < 240; ++y) {

for(x = 0; x < 107; ++x) {setPixel(buf,(x*3) ,y,2*((data[y*107+x] >> 56) & 0x7f),2*((data[y*107+x] >> 49) &

0x7f),2*((data[y*107+x] >> 42) & 0x7f));setPixel(buf,(x*3)+1,y,2*((data[y*107+x] >> 35) & 0x7f),2*((data[y*107+x] >> 28) &

0x7f),2*((data[y*107+x] >> 21) & 0x7f));if (x != 106)setPixel(buf,(x*3)+2,y,2*((data[y*107+x] >> 14) & 0x7f),2*((data[y*107+x] >> 7) &

0x7f),2*(data[y*107+x] & 0x7f));}

}writeTGA(buf,"dataout.tga");free(data);

// destroyFrameBuffer(buf);#endif}

int readintn(FILE *f) {int result;fscanf(f,"%d\n",&result);return result;

}

int readint(FILE *f) {int result;fscanf(f,"%d",&result);return result;

}

void loadcamera(FILE *f, int *line) {char buf[50];vect3f orig,dir,up;float view_x,view_y;

while (!feof(f)) {fscanf(f,"%s",buf); (*line)++;if (strcasecmp(buf,"origx") == 0) orig.x = readintn(f);else if (strcasecmp(buf,"origy") == 0) orig.y = readintn(f);else if (strcasecmp(buf,"origz") == 0) orig.z = readintn(f);else if (strcasecmp(buf,"dirx") == 0) dir.x = readintn(f);else if (strcasecmp(buf,"diry") == 0) dir.y = readintn(f);else if (strcasecmp(buf,"dirz") == 0) dir.z = readintn(f);else if (strcasecmp(buf,"upx") == 0) up.x = readintn(f);else if (strcasecmp(buf,"upy") == 0) up.y = readintn(f);else if (strcasecmp(buf,"upz") == 0) up.z = readintn(f);else if (strcasecmp(buf,"FOVX") == 0) fscanf(f,"%g\n",&view_x);else if (strcasecmp(buf,"FOVY") == 0) fscanf(f,"%g\n",&view_y);else if (strcasecmp(buf,"endcamera") == 0) {

101

tmSendRays(orig,dir,up,view_x,view_y);fscanf(f,"\n"); return;

}else {

printf("Line %d: Expected endcamera found %s instead\n",*line,buf);}

}printf("Line %d: Expected endcamera found EOF instead\n",*line);exit(1);

}

sPolygon loadPoly(FILE *f, char square) {sPolygon result;result.vert0x = readint(f);result.vert0y = readint(f);result.vert0z = readint(f);result.vert1x = readint(f);result.vert1y = readint(f);result.vert1z = readint(f);result.vert2x = readint(f);result.vert2y = readint(f);result.vert2z = readint(f);result.square = square;return result;

}

void loadleaf(FILE *f, int *line) {char buf[50];

while (!feof(f)) {fscanf(f,"%s",buf); (*line)++;if (strcasecmp(buf,"poly") == 0) {

addObjectPoly(loadPoly(f,0));} else if (strcasecmp(buf,"square") == 0) {

addObjectPoly(loadPoly(f,1));} else if (strcasecmp(buf,"endleaf") == 0) {

fscanf(f,"\n");return;

} else {printf("Line %d: Expected endleaf found %s instead\n",*line,buf);exit(1);

}}printf("Line %d: Expected endleaf found EOF instead\n",*line);exit(1);

}

void loadlevel2(FILE *f, int *line) {char buf[50];char count = 0;

if (push() == 1) {printf("Line %d: Only 8 level2 bounding boxes supported\n");exit(1);

}

while (!feof(f)) {fscanf(f,"%s",buf); (*line)++;if (strcasecmp(buf,"leaf") == 0) {

loadleaf(f,line);} else if (strcasecmp(buf,"poly") == 0) {

addBoundPoly(loadPoly(f,0)); count++;} else if (strcasecmp(buf,"square") == 0) {

addBoundPoly(loadPoly(f,1)); count++;} else if (strcasecmp(buf,"endlevel2") == 0) {

fscanf(f,"\n");pop();if (count == 6) return;printf("Line %d: A bounding box requires 6 bounding polys\n",*line);exit(1);

} else {printf("Line %d: Expected endlevel2 found %s instead\n",*line,buf);exit(1);

}}printf("Line %d: Expected endlevel2 found EOF instead\n",*line);exit(1);

102

}



}

while (!feof(f)) {fscanf(f,"%s",buf); (*line)++;if (strcasecmp(buf,"level2") == 0) {loadlevel2(f,line);

} else if (strcasecmp(buf,"poly") == 0) {addBoundPoly(loadPoly(f,0)); count++;

} else if (strcasecmp(buf,"square") == 0) {addBoundPoly(loadPoly(f,1)); count++;

} else if (strcasecmp(buf,"endlevel1") == 0) {fscanf(f,"\n");pop();if (count == 6) return;printf("Line %d: A bounding box requires 6 bounding polys\n",*line);exit(1);



}



}

while (!feof(f)) {fscanf(f,"%s",buf); (*line)++;if (strcasecmp(buf,"level1") == 0) {

loadlevel1(f,line);} else if (strcasecmp(buf,"poly") == 0) {

addBoundPoly(loadPoly(f,0)); count++;} else if (strcasecmp(buf,"square") == 0) {

addBoundPoly(loadPoly(f,1)); count++;} else if (strcasecmp(buf,"endlevel0") == 0) {

fscanf(f,"\n");pop();if (count == 6) return;printf("Line %d: A bounding box requires 6 bounding polys\n",*line);exit(1);



}

void loadSceneData(FILE *f, int *line) {char buf[50];

unsigned long long int *test;int x;

initTriData();initIndirect();

while (!feof(f)) {fscanf(f,"%s\n",buf); (*line)++;

103

if (strcasecmp(buf,"level0") == 0) loadlevel0(f,line);else if (strcasecmp(buf,"endscenedata") == 0) {

finalizeIndirect();#ifdef TM3enable

printf("Writing Triangle Data\n");writeSRAM0TM3(gettridata(),0,65535*4*8);printf("Writing Leaf Node Data\n");writeSRAM1TM3(getindirect(),0,1024*512*2);printf("Writing Indirection Data\n");writeIndMemTM3(getindirectcount(),0,512*2*2);

#endifreturn;

} else {printf("Line %d: Expected endscenedata found %s instead\n",*line,buf);exit(1);

}}printf("Line %d: Expected endscenedata found EOF instead\n",*line);exit(1);

}

int loadfile(FILE *f) {char buf[50];int line;

/* Check for ID string */fscanf(f,"%s\n",buf); line = 1;if (strcasecmp(buf,"TMray") != 0) {

printf("Incorrect input format\n");return 1;

}

while (!feof(f)) {fscanf(f,"%s\n",buf); line++;if (strcasecmp(buf,"Camera")==0) loadcamera(f, &line);else if (strcasecmp(buf,"SceneData")==0) loadSceneData(f,&line);else if (strcasecmp(buf,"WriteTGA")==0) tm3writeTGA();else if (strcasecmp(buf,"go")==0) tm3go();else {

printf("Line %d: Expected Valid Keyword found '%s' instead\n",line,buf);return 1;

}}return 0;

}

int main (int argc, char *argv[]) {FILE *f;

if (argc != 2) {printf("Usage: %s filename\n",argv[0]);return 1;

}if (!(f = fopen(argv[1],"r"))) {

printf("File %s not found\n",argv[1]);exit(1);

}#ifdef TM3enable

tm_init("");#endif

if (loadfile(f)) exit(1);

fclose(f);return 0;

}

**** framebuf.c ****#include <stdio.h>#include <assert.h>#include "framebuf.h"

void writeTGA(FrameBuffer *buf, char *name) {unsigned short temp;int x;FILE *fout;

104

assert(buf != NULL);/* Initialize the File Header */

fout = fopen(name,"wb");assert(fout != NULL);temp = 0; fwrite(&temp,sizeof(temp),1,fout);temp = 2 << 8; fwrite(&temp,sizeof(temp),1,fout);for (x = 0; x < 4; x++) {

temp = 0; fwrite(&temp,sizeof(temp),1,fout);}temp = (buf->Width << 8) + (buf->Width >> 8);fwrite(&temp,sizeof(temp),1,fout);temp = (buf->Height << 8) + (buf->Height >> 8);fwrite(&temp,sizeof(temp),1,fout);temp = 0x1830; fwrite(&temp,sizeof(temp),1,fout);fwrite(buf->Data,buf->Width*buf->Height*3,1,fout);fclose(fout);

}

FrameBuffer *createFrameBuffer(int width, int height) {FrameBuffer *result;

result= (FrameBuffer *) malloc(sizeof(FrameBuffer));assert(result != NULL);result->Width = width;result->Height = height;result->Data= (char *) malloc(width*height*3);memset(result->Data,0,width*height*3);assert(result->Data != NULL);return result;

}

void setPixel(FrameBuffer *buf, int x, int y, char red, char green, char blue) {assert(buf != NULL);assert(buf->Data != NULL);assert( (x >= 0) && (x < buf->Width) );assert( (y >= 0) && (y < buf->Height) );buf->Data[(y*buf->Width+buf->Width-x-1)*3+2] = red;buf->Data[(y*buf->Width+buf->Width-x-1)*3+1] = green;buf->Data[(y*buf->Width+buf->Width-x-1)*3] = blue;

}

**** portutil.c ****#include <stdio.h>#include "portutil.h"

unsigned char readPort1(int port, char *name) {unsigned char temp;

if(tm_read(port, &temp, 1) != 1) {fprintf(stderr, "ERROR: Error reading %s port\n",name);exit(1);

}return temp;

}

unsigned short int readPort2(int port, char *name) {unsigned short int temp;


}return temp;

}

unsigned int readPort4(int port, char *name) {unsigned int temp;


}return temp;

}

105

int openPort(char *name, char *mode) {int temp;if ((temp = tm_open(name,mode)) < 0) {

fprintf(stderr,"ERROR: Can't open port %s in mode %s\n",name,mode);exit(1);

}return temp;

}

void writePort(int port, char *name, char *data, int bytes) {if(tm_write(port, data, bytes) != bytes) {

fprintf(stderr, "ERROR: Unable to write %u bytes to port %s [%u]\n",bytes,name,port);exit(1);

}}

void readPort(int port, char *name, char *data, int bytes) {if(tm_read(port, data, bytes) != bytes) {

fprintf(stderr, "ERROR: Unable to read %u bytes from port %s [%u]\n",bytes,name,port);exit(1);

}}

void writeCharPort(int port, char *name, char val) {if(tm_write(port, &val, 1) != 1) {

fprintf(stderr, "ERROR: Unable to write %u to port %s [%u]\n",val,name,port);exit(1);

}}

void writeIntPort(int port, char *name, unsigned int val) {if(tm_write(port, &val, 4) != 4) {


}}

void writeShortIntPort(int port, char *name, unsigned short int val) {if(tm_write(port, &val, 2) != 2) {


}}

void write3BytesPort(int port, char *name, unsigned int val) {int temp;

temp = val << 8;if(tm_write(port, &temp, 3) != 3) {


}}

void toggleBitPort(int port, char *name, char val) {writeCharPort(port,name,val);if (val == 0)

writeCharPort(port,name,1);else

writeCharPort(port,name,0);}

/* Writes using standard memory interface method with a 3 byte address */void writeMemoryPort3(int addrport, int dataport, char *addrname, char *dataname,

unsigned int addr, int bytes, char *data) {write3BytesPort(addrport,addrname,addr);writePort(dataport,dataname,data,bytes);write3BytesPort(addrport,addrname,addr);

}

/* Writes using standard memory interface method with a 2 byte address */void writeMemoryPort2(int addrport, int dataport, char *addrname, char *dataname,

unsigned short int addr, int bytes, char *data) {writeShortIntPort(addrport,addrname,addr);writePort(dataport,dataname,data,bytes);writeShortIntPort(addrport,addrname,addr);

}

106

107

**** trilist.c ****#include <stdio.h>#include "assert.h"#include "trilist.h"

/* Node I may have to rearrange the data such that endiness is correct */unsigned short int indirect[512][1024];unsigned long long int *tridata = NULL;

/* Note: [0] is the count but it must be shifted left by 2 bits *//* The count must be larger then 8 or so to prevent result collision */unsigned short int indirectcount[512][2];

int activelevel,trinum0,trinum1,trinum2,level0,level1,level2,triindex;

unsigned long long int *gettridata() {return tridata;

}

unsigned long long int *getindirect() {return indirect;

}

unsigned int *getindirectcount() {return indirectcount;

}

void initIndirect() {int x;for (x=0; x < 512; x++) {indirectcount[x][0] = 0; /* Count *4 */indirectcount[x][1] = x*1024 / 4;

}level0 = 0; level1 = 0; level2 = 0;activelevel = 0;trinum0 = 0;trinum1 = 0;trinum2 = 0;triindex = 65534;

}

void finalizeIndirect() {int x,y;

for (x=0; x < 512; x++) {

/* Make sure every node meets the minimum requirement of 8 triangles *//* Pad if Necessary (Empty nodes are exempt) */if ((indirectcount[x][0] > 0) && (indirectcount[x][0] < 8)) {

for (y = indirectcount[x][0]; y < 8; y++) indirect[x][y] = 0xffff;indirectcount[x][0] = 7;

}if (indirectcount[x][0] != 0) {

indirectcount[x][0] += 1;}indirectcount[x][0] *= 4; /* Scale to fit proper bit position */

}}

void addTritoNode(unsigned short int triID, unsigned short int nodeID) {indirect[nodeID-72][ indirectcount[nodeID-72][0]++ ] = triID;

}

/* Packs a triangle into 24 bytes */void packTriangle( unsigned long long int *data,

int vert0x, int vert0y, int vert0z,int edge1x, int edge1y, int edge1z,int edge2x, int edge2y, int edge2z, char square) {

data[0] = (((long long int)edge1x) & 0xffff)+(((long long int)edge1y << 16) & 0xFFFF0000) +(((long long int)edge1z << 32) & 0xFFFF00000000) +(((long long int)edge2x << 48) );

/* printf("Pack 0: e1x %x e1y %x e1z %x e2x %d Packed:%016llx\n",edge1x,edge1y,edge1z,edge2x,data[0]);*/

108

data[1] = (((long long int) edge2y) & 0xFFFF)+(((long long int) edge2z << 16) & 0xFFFF0000) +(((long long int) vert0x << 32) & 0x0FFFFFFF00000000l);

/* printf("Pack 1: e2y %x e2z %x v0x %x Packed:%016llx\n",edge2y,edge2z,vert0x,data[1]);*/

data[2] = (((long long int) vert0y) & 0x0FFFFFFF)+((((long long int) vert0z) << 32) & 0x0fffffff00000000l)+((long long int)square << 63);

data[3] = 0;}

void clearTriData() {int x;for (x = 0; x < 65536; x++) {

tridata[x*4] = 0;tridata[x*4+1] = 0;tridata[x*4+2] = 0;tridata[x*4+3] = 0;

}}

void packVTriangle( unsigned long long int *data,int vert0x, int vert0y, int vert0z,int vert1x, int vert1y, int vert1z,int vert2x, int vert2y, int vert2z, char square) {

packTriangle(data,vert1x,vert1y,vert1z, vert2x-vert1x,vert2y-vert1y,vert2z-vert1z,vert0x-vert1x, vert0y-vert1y,vert0z-vert1z,square);

/* printf("%016llx\n",data[0]);printf("%016llx\n",data[1]);printf("%016llx\n",data[2]);*/

}

void initTriData() {

tridata = (unsigned long long int *) malloc(sizeof(long long int) * 65536 * 4);assert(tridata != NULL);clearTriData(tridata);

}

void addBoundPoly(sPolygon p) {int polyID;

switch (activelevel) {case(1): polyID = (level0-1)*6+trinum0; trinum0++; break;case(2): polyID = (level0-1)*8*6+(level1-1)*6+trinum1+48; trinum1++; break;case(3): polyID = (level0-1)*8*8*6+(level1-1)*8*6+(level2-1)*6+trinum2+432;

trinum2++;break;}

/* printf("Adding Bound Poly Level %d ID %d\n",activelevel,polyID);printf(" (%d %d %d) (%d %d %d) (%d %d %d)\n",

p.vert0x,p.vert0y,p.vert0z,p.vert1x,p.vert1y,p.vert1z,p.vert2x,p.vert2y,p.vert2z);

*/packVTriangle(&(tridata[polyID*4]),

p.vert0x,p.vert0y,p.vert0z,p.vert1x,p.vert1y,p.vert1z,p.vert2x,p.vert2y,p.vert2z,p.square);

}

void addObjectPoly(sPolygon p) {int nodeID;

nodeID = (level0-1)*8*8+(level1-1)*8+(level2-1)+72;

/* printf("Adding Object Poly %d to node %d\n",triindex, nodeID);printf(" (%d %d %d) (%d %d %d) (%d %d %d)\n",

p.vert0x,p.vert0y,p.vert0z,p.vert1x,p.vert1y,p.vert1z,p.vert2x,p.vert2y,p.vert2z);

*/addTritoNode(triindex,nodeID);packVTriangle(&(tridata[triindex*4]),

p.vert0x,p.vert0y,p.vert0z,p.vert1x,p.vert1y,p.vert1z,p.vert2x,p.vert2y,p.vert2z,p.square);

109

triindex--;}

int push() {switch(activelevel) {case(0):

if (level0 == 8) return 1;level0++;level1 = 0;trinum0 = 0;break;

case(1):if (level1 == 8) return 1;level1++;level2 = 0;trinum1 = 0;break;

case(2):if (level2 == 8) return 1;level2++;trinum2 = 0;break;

}activelevel++;return 0;

}

void pop() {activelevel--;

}

110

Appendix D: Brute Force Test Images

Documents

The Design and Implementation of - The College of ...cs6958/papers/HWRT-seminar/fpga-raytracer.pdf · that there is no way to differentiate between the original 3D scene and the 2D