Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
The Design and Implementation of
a Hardware Accelerated
Raytracer Using the TM3a
FPGA Prototyping System
By
J. Fender
A THESIS SUBMITTED IN PARTIAL FULFILMENTOF THE REQUIREMENTS FOR THE DEGREE OF
BACHELOR OF APPLIED SCIENCE
DIVISION OF ENGINEERING SCIENCE
FACULTY OF APPLIED SCIENCE AND ENGINEERINGUNIVERSITY OF TORONTO
Supervisor: J. Rose
March 2002
ii
The computational complexity of raytracing is such that as scenecomplexity grows it will eventually outperform raster graphics methods.Currently ray tracing is still to slow as the algorithm has a large constantassociated with a given computation step. This thesis will present anarchitecture, and implementation, of a raytracing processor that is designedto minimize this constant and to provide insight into the possibility of futurereal time implementations.
This raytracing processor consists of a highspeed barycentric raytriangle intersection core that can easily outperform a softwareimplementation, and a hierarchical controller unit. The hierarchicalcontroller is able to traverse a tree of bounding boxes in such a way as tomaximize memory bandwidth utilization, through pipelined reads, to makesignificant speed gains. The net result is a circuit that is able to well outperform software implementations, while running at only a twentieth theclock speed.
This resulting system is able to beat current software implementationsby an order of magnitude or more but is still too slow to be considered forany real time implementations.
iii
AcknowledgementsI would like to thank Marcus van Ierssel and David Galloway for buildingthe Transmogrifier 3a development system and for helping to solve the manydevelopment system issues that I ran across. I would also like toacknowledge David Auclair, not for just his input into the project but hisinterest. He forced me to try and organize the jumble of information in myhead into clear-cut explanations, for acting as a sounding board, I amgrateful. Next I would like to thank Professor Jonathan Rose for his advice,and the freedom he allowed me during development. And finally I would liketo thank my girlfriend of four years, Lisa Scarfo, for putting up with theoccasional late night in the lab and for providing the much needed escapesfrom work.
iv
Table of ContentsAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Project Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1 3D Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 The Graphics Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Scan Line Renderer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3.2 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Raytracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4.2 Ray Object Intersection Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4.3 Barycentric Ray Triangle Intersection Test . . . . . . . . . . . . . . . . . . 72.4.4 Hierarchical Acceleration Methods . . . . . . . . . . . . . . . . . . . . . . . . 82.4.5 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Transmogrifier 3a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Ray Triangle Intersection Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Barycentric Ray Triangle Intersection Unit . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2.2 Numeric Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.3 Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.4 Barycentric Extension to Parallelograms . . . . . . . . . . . . . . . . . . . 17
3.3 Memory Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.2 Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Nearest Comparison Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
v
4 Hierarchy Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 Bounding Hierarchy Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2.2 Barycentric Ray Triangle Intersection Unit Utilization . . . . . . . . 224.2.3 State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2.4 Required Memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2.5 Parallel Ray Processing Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2.6 Scalability Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Sorted List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.3.2 Size and Speed Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 List Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4.2 List Handler Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.2 Barycentric Ray Triangle Intersection Unit Performance . . . . . . . . . . . . 285.3 Bounding Hierarchy Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.3.2 Single Large Faceted Sphere Tests . . . . . . . . . . . . . . . . . . . . . . . 295.3.3 A Grid of Faceted Spheres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.3.4 Cache Incoherent Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.1.1 Barycentric Ray Triangle Intersection Unit . . . . . . . . . . . . . . . . . 346.1.2 Hierarchy Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.1.3 General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.3 Closing Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Appendix A: Bound Controller State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Appendix B: VHDL Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Appendix C: C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Appendix D: Brute Force Test Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
vi
List of Symbols
Cavg Average coverage of a projected object (percentage of image size)(Sw, Sh) Width and height of an imageN Number of objects in a sceneO 3D Point: Origin of a RayD 3D Normalized Vector: Direction of a RayV0,V1,V2 3D Point: Triangle Vertices(u, v) Barycentric CoordinateE1,E2 3D Vector: Triangle Edges
vii
List of FiguresFigure 1: Perspective Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Figure 2: Light propagation through a 3D world . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Figure 3: A Shadow Ray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Figure 4: Bounding Object Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Figure 5: Bounding Object Sorting Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Figure 6: TM3a System Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Figure 7: Ray Triangle Unit Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Figure 8: Barycentric Ray Triangle Intersection Unit Pipeline . . . . . . . . . . . . . . . . . . . . 13Figure 9: Bounding Hierarchy Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Figure 10: Bounding Hierarchy Controller System Overview . . . . . . . . . . . . . . . . . . . . 21Figure 11: Restriction on Parallel Rays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Figure 12: Single Sphere Results (Expanded View) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Figure 13: Single Sphere Test Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Figure 14: Grid of Sphere Test Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 15: Grid of Sphere Test Set Results (Expanded View) . . . . . . . . . . . . . . . . . . . . 32Figure 16: Alternative Raytracing Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
viii
List of TablesTable 1: Input Data Widths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Table 2: Pseudo Floating Point Scaling Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Table 3: Memory Widths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Table 4: Brute Force Performance Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Table 5: One Sphere Test Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Table 6: Grid of Sphere Test Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Table 7: Cache Incoherent Test Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1
1 Introduction1.1 Project Overview
It has long been known that of the two major ways to render a three-dimensional
computer image, raytracing and raster methods, it is raytracing that has the lower
computational complexity. Unfortunately, it turns out that even though raytracing has a
complexity advantage it suffers from a very large computational step constant. This large
constant causes raster graphics to always outperformed raytracing in real world
implementation. Eventually if scene sizes continue to grow the lower computation
complexity will dominate over the larger constant and raytracing will win out.
The purpose of this project is not to solve the step constant problem or to compete
with raster graphics but, instead, to assess what is possible with current technology and to
determine what hardware factors limit raytracing. The final goal of this thesis is to design
a hardware raytracing processor that can outperform a corresponding software
implementation. This will be accomplished by used the parallel nature of the raytracing
algorithm and by optimizing the processors memory interface by accounting for the
deterministic nature of raytracing’s memory accesses.
The design will be such that it can easily be implemented in existing
reconfigurable hardware and tested at a low clock rate. It is hoped that under these
constraints the raytracing processor can still outperform a software implementation while
the complexity remains manageable. To make the comparison fair the hardware raytracer
will employ a spatial hierarchy that accelerates rendering, as any software implementation
would. This will allow a direct comparison to current state of the art software
implementations that use such acceleration techniques.
2
1.2 Thesis Overview
This thesis is divided into seven chapters. The first provides a brief introduction
to the raytracing projects goals and methods. The second chapter introduces some basic
background knowledge that is required to understand this thesis. The next two chapters
describe the functional architecture chosen for the raytracing processor. Chapter three
describes the barycentric ray triangle intersection unit and chapter four describes the
hierarchical controllers. Chapter five details the various test cases designed to evaluate
the performance of the raytracing processing and chapter six discusses these results.
Finally, chapter seven lists the references used for this thesis.
3
Figure 1: Perspective Transform
2 Background2.1 3D Rendering
3D rendering is the process of creating a 2D digital image from a mathematical
model of a 3D world. Fundamentally
this involves a, potentially nonlinear,
projection of the 3D data onto a 2D
image plane. Figure 1 shows how a
common projection method, the
perspective transform1, can be used to
generate a 2D image.
This transform projects the 3D
world toward an eye point and onto an
image plane. If one ignores focal depth and binocular vision then this transform results in
an image that is indistinguishable, by the viewer, from the original scene. This results
from the fact that, under these two assumptions, human vision depends only on the
direction from which incident light arrives and not the distance of its source. This means
that there is no way to differentiate between the original 3D scene and the 2D projection,
as the perspective transform maintains the angular information of the scene.
2.2 The Graphics Pipeline
In general it is a very difficult task to render an image using projection methods.
Current solutions address this problem by restricting the scene complexity and by
breaking the rendering problem into several separate stages known as the graphics
pipeline. For the purpose of this document the graphics pipeline will be defined as
consisting of the following stages: object transformation, potential visibility
determination, pixel visibility, and pixel colouring.
The first stage, object transformation, allows a 3D world to be specified for later
rendering. The next two stages, potential visibility determination and pixel visibility,
determine which objects are visible and then which objects are projected onto a given
4
pixel. Finally the last stage determines what colour a pixel will be by applying the
various surface properties of the objects that have been projected onto the pixel.
2.3 Scan Line Renderer
2.3.1 Overview
Scan line rendering is the current standard in 3D computer graphics. This method
solves the rendering problem by using an object centric model. That is, each object is
processed to find what pixels it covers, as apposed to determining what objects cover any
given pixel.
The first stage in a scan line renderer is to determine which objects are potentially
visible. This determination is an active area of research that has resulted in many
different algorithms. Several samples of these are kD-trees2, object occluders3, portals4,
and view-frustum culling5, but they are many more. Once a set of potentially visible
objects have been determined, they are passed on to the pixel visibility stage.
This next stage is responsible for determining what objects are visible for what
pixels. The scan line approach projects each object onto the viewing plane and
determines which pixels the object covers. To insure that further objects do not occlude
nearer ones it is necessary to use depth sorting6, or z-buffering, to account for the fact that
distant objects should not be drawn overtop of nearer ones. Once completed these leaves
only the pixel colouring stage.
The pixel colouring stage is beyond the scope of this thesis but put simply, a pixel
colour is determined through modelling of object surface properties and lighting effects.
For a more detailed overview of a colouring architecture, either the OpenGL7 or
RenderMan8 specifications can be viewed.
5
Figure 2: Light propagation through a 3D world
2.3.2 Computational Complexity
The computation complexity of a scan line renderer can be derived as follows:
If the average object covers Cavg percent of an imageAnd the image has a resolution of Sw by ShThen an average object covers Cavg x Sw x Sh pixelsIf they are N objects in a scene Then the rendering complexity is O(N x Cavg x Sw x Sh)
This leads to a complexity of O(n) toward the number of objects in a scene as well as the
image resolution.
2.4 Raytracing
2.4.1 Overview
Raytracing is a rendering
method based on modelling the way
light rays propagate through a 3D
scene. Figure 2 shows how a light ray
can be reflected around a scene until it
eventually reaches the viewer’s eye.
Implementing this physical model
would be far too computationally
intensive as they are an infinite
number of possible paths between the light source and the viewer. To deal with this
problem raytracing attempts to find an approximate solution.
If one was to ignore the physical model of light propagation and instead think of
the process in reverse then the problem becomes simpler. By tracing the path a ray takes
from the eye through the image plane, it is possible to determine what object a ray
intersects. Once an intersection point is found, it is necessary to determine its colour.
This is performed using a two-step process.
First it is necessary to determine how much direct light falls on the object. This is
calculated by spawning a new ray from the point of intersection toward the light source.
6
Figure 3: A Shadow Ray
If this ray strikes an object, prior to the light source, then the object is shadowed,
otherwise the object is lit. Figure 3 show how the cube is determined to be shadowed by
the plane through the use of a ray
spawned toward the light source from
the point of intersection. The second
step involves modelling the surface
properties of the object.
If the object’s surface is
reflective then it is necessary to
determine what colour this reflection
should be. By using the reverse light
tracing method it is easy to determine what direction a reflected light ray would have
arrived from. Knowing this, a new ray can be spawned in that direction and the process
repeated recursively to determine this new ray’s colour.
To summarize, the raytracing algorithm can be broken into five steps:
1. Generate a ray for each pixel in the image plane2. Find the nearest object intersected by this ray3. Generate a shadow ray for each light source to determine lighting4. Apply the surface model and generate reflected rays as required5. Add the surfaces and the reflected rays colour to determine the final pixel colour
2.4.2 Ray Object Intersection Test
Unlike raster graphics, where only polygons can be easily rendered, raytracing can
handle any form of an object. The only restriction is that there exists a solution to the
object-ray intersect equation. For simple objects such as polygons, conics, and low-order
polynomial patches, closed form solutions can easily be found. However for more
complicated objects there are no closed form solutions and slow root solving methods are
required. These iterative algorithms, and complicated objects, are not well suited for
hardware implementation so this thesis will deal with ray triangle intersections only. This
is acceptable as any object can be approximated by a mesh of triangles.
7
To perform the ray triangle intersection test they have been many different
algorithms proposed. Of the different algorithms they are three major groups: those that
use the plane equation9, those that use 6D Plucker10 space, and those that use Barycentric
coordinates11. Descriptions of the first two methods can be found in their referenced
articles, where as the last method is described in the next section.
2.4.3 Barycentric Ray Triangle Intersection Test
Tomas Moller and Ben Trumbore introduced a simple algorithm for calculating
the intersection point of a ray and a triangle in the 1997 paper titled Fast, Minimum
Storage Ray/Triangle Intersection. Their results indicated that it was the fastest algorithm
available that did not require pre computed values and it works as follows.
Given a ray R(t) who has direction D and origin O
R t O tD( ) = + (1)
and a triangle with vertices V0, V1, and V2, then a point is defined to lay within a triangle
if
T u v u v V uV vV( , ) ( )= − − + +1 0 1 2 (2)
where (u,v) are the barycentric coordinates which must meet the following constraints
u v u v≥ ≥ + ≤0 0 1, , (3)
By equating equations (1) and (2) and writing them as a matrix results as follows
[ ]−
== −= −= −
D E Etuv
T WhereE V VE V VT O V
1 2
1 1 0
2 2 0
0
(4)
By using Cramer’s rule and factoring out common terms we find the solution
tuv P E
Q EP TQ D
WhereP DxEQ TxE
=⋅
⋅⋅⋅
==
1
1
22
1(5)
If the resulting value of (u,v) meets the constraints given in equation (3) then the ray
intersects the triangle at a distance t along the ray.
8
Figure 4: Bounding Object Hierarchy
Figure 5: Bounding Object Sorting Problem
2.4.4 Hierarchical Acceleration Methods
There is more to finding the nearest object intersected by a ray then just the ray-
object intersection test. This test only deals with individual objects so a method that
extends this to a scene full of objects is required. The simplest approach to this problem
is to intersect a ray with every object in the world one by one. By keeping track of
distance to each hit object, the nearest intersected object can be found. This method
works very well but is very inefficient.
A more ideal solution would to be to have an algorithm that could cull a large
number of objects with only a few
intersection tests. The simplest of
these approaches is to use a bounding
object methodology. This system
involves placing an invisible object that
is easy to intersect with, around a more
complicated one. The end result is that
if a ray misses the bounding object then
it will also miss any objects contained within. By taking this process further, as shown in
figure 4, and placing the bounding objects, previously created, into larger bounding
objects then further performance increases can be achieved. Ideally a tree of bounding
objects can be created such that if a ray misses any node then its children need not be
tested. This allows for a formidable performance increase but there is one major
drawback of this algorithm. To insure that objects are drawn correctly it is necessary to
perform an expensive sort operation.
Figure 5 shows why the
bounding object algorithm requires a
sort. The figure shows a sample scene
consisting of three bounding objects,
the boxes, that have been struck by a
ray. It is clear that checking only the
9
nearest bounding object will result in an incorrect solution that the ray does not hit any
objects as it hits bounding box one but does not hit the contained object. This means that
it is necessary to sort the distances of the intersected bounding objects and test their
children in depth order, from nearest to furthest. This allows for the intersection test to
terminate as soon as the first object ray intersection is found. The alternative would be to
intersect the ray with every bounding box and take only the nearest object intersection
test. This would work but would limit the performance increase provided by the
bounding hierarchy and as such, the complexity of a sort is still faster.
There have been a number of algorithms introduced that simplify the front to back
ordering by accounting for it within the tree. These methods include binary space
partition trees12 that allow a front to back traversals of a data set without sorting, octrees13,
a simplified BSP tree that allows faster traversal, and three-dimensional grids, that place
objects within a volume pixel such that a ray can traverse through the grid front to back
using line drawing methods. A good overview of three dimension grid methods can be
found in [14].Current research has found that a hybrid of these algorithms provides the
fastest software performance, but this area is still open to research.
2.4.5 Computational Complexity
The computation complexity of a raytracer that does not use a bounding hierarchy,
and does not perform recursive raytracing can be derived as follows:
If the image has a resolution of Sw by ShAnd they are N objects in a sceneThen the rendering complexity is O(NxSwxSh)
Like raster graphics a raytracer is linear in both the number of objects and the image
resolution. However, by comparing raytracing to raster graphics complexity, O(N x Cavg x
Sw x Sh), we see it is missing the Cavg term. This term is always less then zero and usually
much more so since it represents the average pixel coverage of an object. This means that
raster graphics will have an advantage. On top of this problem, the constant involved in
raytracing is much larger then raster graphics so at first it would seem that raytracing is
far too slow to be useful. It is only once the bounding hierarchy is considered the
10
Figure 6: TM3a System Diagram
situation improves greatly.
Due to the variations of bounding hierarchies and the complexities dependance on
the scene being rendered, there is no worst case performance increase by using a
bounding hierarchy. This results from it always being possible to construct a degenerate
hierarchy that would require O(N), where N is the number of objects, to solve a ray.
However, on average, the complexity can be approximately written as O(logN x Sw x Sh).
This result places raytracing in a much more favourable light then the simplistic
algorithm. Unfortunately for the sizes of scenes currently rendered, N is still too small to
overcome the large differences in constants between raster graphics and raytracing
methods.
2.5 Transmogrifier 3a
The circuits described in this
thesis have all been implemented on a
prototyping system created at the
University of Toronto known as the
Transmogrifier-3a (TM3a). This
system uses large Field Programmable
Gate Arrays, FPGAs, that allow for
very complicated circuits to be tested at
speeds up to, and above, 50 MHz.
Figure 6 shows a simplified system
diagram of the development system
that describes only the components
used in this thesis.
The core of the TM3a consists of four 560 pin Virtex2000 FPGAs manufactured
by Xilinx. These four chips provide more than 150,000 four input lookup tables and
flipflop pairs, as well as various specialized functional units such as shift registers, and
internal memory. The FPGAs are fully interconnected to each other by six 98 bit buses,
11
and are also connected to the computer interface section by a low data rate nibble bus.
Additionally each of the FPGAs have access to their own independent external SRAM
modules. These modules are rated for 50MHz, have a 64bit wide databus and contain a
total of 2MB of ram, resulting in a total memory bandwidth of 3.2Gbits/sec. These
hardware components allow for very large and complicated circuits to be easily tested,
but there is another side to the TM3a development system.
The hardware is supported by a software development flow that consists of both
custom software and state of the art commercial tools. The custom software routes
signals between the FPGAs and automatically generates circuitry that allows the FPGAs
to communicate with the computer interface. The commercial tools include Synplicity’s
Synplify Pro, that is used to synthesis circuits, and Xilinx’s place and route software, that
is used for physical layout.
12
Figure 7: Ray Triangle Unit Overview
3 Ray Triangle Intersection Unit3.1 Overview
The first major functional
component of the raytracing processor
is the ray triangle intersection unit.
This component implements the pixel
visibility functionality of the graphics
pipeline through the use of raytracing
methodologies. That is, it takes the list
of objects in the scene and a ray
corresponding to a given pixel as input,
and returns the object that is visible for
the given pixel.
The implementation of this
functionality is dividing into three
separate components: the barycentric
ray triangle intersection unit, the
memory interface, and the nearest
comparison unit, shown in figure 7.
The core unit is the barycentric ray
triangle intersection unit that
determines if a ray intersects a single
given object. The memory interface is
responsible for passing the proper objects to the intersection unit, and the nearest
comparison unit is responsible for tabulating the final results.
3.2 Barycentric Ray Triangle Intersection Unit
3.2.1 Overview
The barycentric ray triangle intersection unit is the workhorse of the raytracing
processor. Put simply it is a deeply pipelined unit that is capable of solving the
13
Figure 8: Barycentric Ray Triangle Intersection Unit Pipeline
barycentric intersection algorithm described in section 2.4.3. It has a maximum
throughput of one intersection test per clock cycle, a latency of 38 clock cycles, and a
maximum clock speed of 50MHz.
Figure 8 shows the functional layout of the pipeline as well as the systems inputs
and output. The system takes a ray, specified by an origin and direction, a triangle,
14
Ray Origin 3x28bits
Ray Direction 3x16bits
Triangle Vertex 3x28bits
Triangle Edge 3x16bits
Table 1: Input Data Widths
specified by a vertex, edge one, edge two, and their associated scale factors, as well as
several configuration bits as input, and outputs a boolean hit flag, the triangle ID of the
processed triangle, and the parameters t, u, and v from the barycentric algorithm.
3.2.2 Numeric Format
A conventional computer-based implementation of the barycentric algorithm is
usually implemented using floating point arithmetic. For the purposes of this design a
floating point implementation would have taken up to much area and introduced a far
deeper pipeline. To avoid this, a hybrid system of fixed point, and pseudo floating point
numbers are used. The determination of the input bit widths where determined by
working backwards from an internal constraint.
The constraint derived itself from the need to
insure that the divide operations could process a single
bit per clock cycle. To meet this constraint, it was
necessary to restrict the largest input signals to the
divider to 64 bits or less. This constraint was then
back propagated through the arithmetic units to find
the required input widths. A summary of the resulting
bit widths is shown in Table 1.
These numbers allow for a usable 3D space of 28 bits for locating an object or
viewpoint but only 16 bits for defining the triangle’s edges. At first it would appear that
this is unacceptable since a typical 3D scene has a large dynamic range of object sizes.
There are often very large triangles that might make up a landscape, and very small
triangles, that might describe a facial feature, but this problem can be easily solved by
introducing a pseudo floating point format that exploits a simple property of the
barycentric algorithm.
15
Factor Shift Bits Scale Factor
00 0 1
01 4 1/16
10 8 1/256
11 12 1/4096
Table 2: Pseudo Floating Point Scaling Factors
The u and v parameters, described in equation (2), can be interpreted as the
intersection point of a ray with the plane that the triangle lies in, expressed using the
triangle’s edges as basis vectors. The constraints shown in equation (3) usually require u
and v to be less then one. However, the basis can be scaled to result in a larger effective
triangle. That is, if u and v are
required to be less then two, instead
of the usual one, then the triangle will
become twice as big. By allowing
each triangle’s edges to have a scaling
factor, it is possible to introduce a
pseudo floating point system that
requires only a right shift of the u and
v coordinates.
The chosen implementation was to add two additional bits to the description of
each edge. These bits, shown in Table 2, control how the basis vectors are scaled to
allow a large dynamic range of sizes. For example, a scaling factor of 4096 allows for
triangles as large as the available 3D space but restricts the sizing resolution to steps of
4096.
3.2.3 Design Decisions
The implementation of the barycentric algorithm required several design decisions
other then just bit widths. It was necessary to determine whether or not to include the
divisions, the styles of arithmetic elements to use, and how to handle negative
intermediate values.
A custom implementation of a raytracing processor probably would not include
the divisions in the pipeline as they consume over half the total area and are not always
necessary. The divisions are only used to calculate u, v, and t, when there is a hit, so most
of the time they sit idle. In the case of the FPGA implementation, described in this thesis,
the divisions were placed in the pipeline regardless of this fact. It was found that there
16
was excess space on the FPGA that was going to waste, so the division units could be
added to the pipeline to simplify the control of the system. The next design decision
involved the actual implementation of these division units as well as the other arithmetic
components.
The choice of how to implement the arithmetic elements was important for two
reasons. First to meet the clock rate requirements, and secondly to minimize the area to
allow for possible parallel implementations. The first decision was to choose between
parallel and serial-based implementations. Since the algorithm requires division,
simplistic serial methods don’t apply. Instead a method known as on-line arithmetic15
was investigated. This method uses a redundant number set to serial process data using a
very fast clock rate. Although on-line arithmetic holds promise for a very fast clock as
well as low area, the serial nature reduces the throughput too much for it to be a viable
option for FPGA implementation. This left only the parallel implementation option.
Basic math functions, such as addition, subtraction, and multiplication, have
highly optimized solutions that are technology dependent when implemented in FPGA. It
is because of this dependance that the synthesize of these elements where left to the
Synplify Pro tool. This tool was able to map the simple math functions to specialized
blocks within the FPGA that perform better then any gate level design could, but it was
unable to synthesize the necessary dividers. The dividers had to be hand created so an
algorithm needed to be chosen.
The choice of what division algorithm to used depended on both speed and area.
Several approaches were examined including basic radix-2 and several higher radix16
implementations. Ideally a high radix solution would be best as it would limit the depth
of the pipeline but unfortunately it was found that these types of divisions do not map
well to FPGAs. This left only the radix-2 solution that solves division one bit per cycle,
and it is this solution that is used in the raytracing processor.
To extend the division algorithm to include signed division would have required
additional hardware on both the input and the output of a division unit. This could have
been implemented but would have increased the pipeline depth and complicity of the
17
circuit. To prevent this, the barycentric algorithm was modified slightly to restrict the
possible ray triangle intersections to only those requiring positive divisions.
Geometrically this is equivalent to defining a one-sided triangle, that is a triangle that is
visible only from one side and not the other.
3.2.4 Barycentric Extension to Parallelograms
The barycentric ray triangle intersection algorithm also yields one more additional
change that is useful in raytracing. It is possible to adjust the constraints that define a hit,
to result in the ability to intersect a ray and a parallelogram. Instead of using the
constraints shown in equation (3) if the constraints in equation (6) are used then the result
is a ray parallelogram intersection.
u v≤ ≤1 1 (6)
The config input to the barycentric unit allows for a triangle to be extended into a
parallelogram by ignoring the conditions in equation (3) and only using those in equation
(6).
3.3 Memory Interface
3.3.1 Overview
The barycentric ray triangle intersection unit described previously was only
capable of intersecting a ray with a single triangle. To add the ability to intersect a ray
with the entire world of triangles, it is necessary to read the world from memory and pass
these triangles one by one to the barycentric unit.
Put simply this unit will cycle through the entire list of triangles, stored in an
external memory, and pass them to the barycentric unit. From a more complicated view
the unit must also have the ability to communicate to the host computer to allow the
triangle data to be written to memory, and provide control signals to the rest of the
system.
18
Vertex 3x28Edges 1 & 2 6x16Edge Scale 2x2Config 1 ======Total: 185 bits
Table 3: Memory Widths
3.3.2 Design Decisions
The design of this unit was primarily guided by
memory bandwidth issues. Table 3 shows that for each
triangle it is necessary to read 185 bits of memory. Since
the TM3a has a 64 bit databus, this requires three read
cycles. This would seem to suggest that it is not possible to
keep the barycentric pipeline full as it has a throughput of
one triangle per cycle, but this is not the case. Instead of providing a new triangle each
cycle it is possible to provide a new ray each cycle. By processing three rays at the same
time it becomes possible for a single triangle to be checked against all three rays while the
next triangle is being loading. Using this method 100% pipeline utilization is achieved.
3.4 Nearest Comparison Units
3.4.1 Overview
The previous two units allow for a world to be intersected by a ray and outputs a
list of triangle identifications and the hit information of each. However this is not what is
required for solving the pixel visibility problem. This is only a list of triangles that could
be visible from the pixel if they are not occluded by a closer object. To determine which
triangle is visible out of the list of possible hits, it is necessary to find which intersection
is closest to the viewer. That is the intersection with the lowest value of parameter t. It is
this process that the nearest comparison unit is responsible for.
The nearest comparison unit is designed to process the outputs from the
barycentric ray triangle intersection unit and dynamically keep track of the nearest
intersection point so far. This is accomplished quite simply by having an internal register
that latches the new intersection data if it is closer then the old data stored within. The
only twist is that there are three rays in flight through the pipeline at any given time
whose results must be kept separated. This is accomplished by having three nearest
comparison units that are enabled only when their ray’s result is available.
Through the combination of these three units, the memory interface, and the
19
barycentric ray triangle intersection unit, a complete system that takes three ray as input
and returns the triangles they strike first is implemented.
20
Figure 9: Bounding Hierarchy Structure
4 Hierarchy Controller4.1 Overview
The barycentric ray triangle intersection unit does its job very fast but does not
have an advanced enough controller to benefit from the logarithm nature of hierarchical
raytracing. This chapter describes the supporting circuitry that is required to implement
the potential visibility stage of the graphics pipeline through the use of an entirely
hardware implementation of a bounding hierarchy algorithm.
Unlike a software implementation of a spatial hierarchy that has complete
flexibility, a hardware implementation requires strict constraints to operate at maximum
speed. Of the various hierarchical methods available it was determined that a bounding
hierarchy could be implemented with minimum memory requirements and with a
relatively simple controller compared to BSP trees and other hierarchical methods. The
bounding hierarchy also benefits from being able to exploit the barycentric ray triangle
unit in its hierarchy traversal.
The bounding hierarchy is
further constraints by restricting a
bounding volume to be defined
by six arbitrary triangles or
parallograms, by limiting the
hierarchical tree to a depth of
three, and by restricting the
maximum number of children
any node can have to eight.
Figure 9 shows a summary of this
structure. There can be up to
eight root nodes, and each of these can have up to eight children. These children can then
have up to eight children of their own, and finally a leaf node can have any number of
object triangles contained within it. This structure allows for a maximum of 512 leaf
nodes.
21
Figure 10: Bounding Hierarchy Controller System Overview
Implementing this algorithm requires an advanced controller, a memory buffer
that tracks which bounding nodes must be traversed, and a very fast sorting algorithm to
ensure the nodes are traversed in the correct order. An overview of the system that
implements these functions is shown in figure 10. The bound controllers provide the
control logic necessary, the list handlers track which bounding nodes need to be visited,
and the list sort unit insures that the bounding nodes are traversed in the proper order. In
addition to the control units the ray interface and result receiver units are necessary to
handle the inter chip buses.
22
4.2 Bounding Hierarchy Controller
4.2.1 Overview
The brain of the completed system is contained within the bounding hierarchy
controller. This controller is responsible for receiving rays from a user circuit,
performing the necessary functions to calculate the resulting hit data, and then returning
the result to the user. The basic algorithm implemented by this controller is as follows:
1. Intersect the requested ray with the eight root bounding volumes2. Intersect the requested ray with the nearest hit bounding volume’s children
and save the other hit nodes for traversal later. If no child nodes are hitthen traverse up the tree processing the next nearest hit nodes.
3. Repeat step 2 until a leaf node is hit or there are no outstanding hit nodes.4. Intersect the requested ray with the object triangles contained within the leaf
node. If there is a hit then return it, if not traverse up the tree andcontinue from 2.
To keep this controller simple the functionality that tracks which nodes have been
hit and which node should be processed next is slaved off to the list handler unit. This
means the controllers portion of this algorithm is restricted to interpreting the bounding
hierarchy and sending the proper triangles to the barycentric ray triangle intersection unit.
4.2.2 Barycentric Ray Triangle Intersection Unit Utilization
The algorithm described above is inherently sequential in nature. That is it is
necessary to examine one entire level of the bounding hierarchy before a decision can be
made on which node to examine next. If this algorithm is directly implemented then the
deep pipeline in the barycentric ray triangle intersection unit would cause a large bubble
to form between each level of the hierarchy. To avoid this problem several controllers
can be used that share the barycentric ray triangle unit by interleaving their usage.
The first controller would transmit the required triangles to the barycentric ray
triangle unit while the second controller waits. Upon completely of the transmission the
first controller passes the ownership token to the second controller who is then free to use
the barycentric ray triangle unit while the first controller’s triangles are being flushed
23
through the pipeline. Upon the second unit completing its transmission, the token is
returned to the first controller who is then free to continue. This allows the barycentric
ray triangle unit to be nearly 100% utilized.
4.2.3 State Machine
Appendix A shows a simplified description of the controlling state machine.
There are three dominate phases of the controller. The first stage involves acquiring the
clear to send (CTS) signal from the other controller, if necessary. The second stage
traverses through the bounding hierarchy by sending the proper triangle identification
numbers to the barycentric ray triangle intersection unit. Several internal variables keep
track of which node the controller is working on and the three states: S_SEND, S_WAIT1,
and S_WAIT2 transmit the triangle identification numbers to the barycentric ray triangle
intersection unit.
The final stage is responsible for processing the object triangles contained within
a given leaf node. First the stage reads from an indirection table to determine where to
find the list of object triangles in memory and how many triangles they are. There is then
another set of three stages: S_LEAFSEND, S_LEAFWAIT1, and S_LEAFWAIT2 that read
the triangle identifications from memory and transmit them to the barycentric ray triangle
intersection unit. If this stage finds a hit then the process is complete, otherwise the state
machine returns to the second stage to continue traversing the hierarchy.
4.2.4 Required Memories
The bound controllers require two different types of data. The first is the leaf
node indirection data. This data is stored in the internal FPGA memory. They are 512
entires, one for each leaf node in the bounding hierarchy that describes how many object
triangles are contained within the leaf node and where in memory the list of triangles can
be found. The second type of data is the triangle lists. These lists are stored in an
external SRAM bank as the data is to large to fit internally. The lists consist of an array
of 16bit triangle identification numbers.
24
Figure 11: Restriction on Parallel Rays
These memories cannot be directly accessed by the bound controllers because
conflicts could arise. Instead an intermediate memory controller handles all request from
the bounding controllers. In addition to simplifying memory access this controller also
allows for the possibility of two controllers accessing the memory at the same time. This
feature is of no use for the currently implemented system but it allows for the possibility
for another instance of the barycentric ray triangle intersection unit to be controlled by
another set of bound controller using the same hierarchy data.
4.2.5 Parallel Ray Processing Issue
The algorithm described above is for processing a single ray at a time, but the
barycentric ray triangle intersection unit is designed to process three rays in parallel. To
solve this problem three rays are traversed through the hierarchy such that if one ray
strikes a node then all three rays are intersected with that node. This is potentially
wasteful as there is a worse case degeneracy where the three rays traverse entirely
different paths through the hierarchy resulting in the pipeline effectively processing the
rays sequentially. However this is not usually a problem as often rays are coherent, that is
they travel in nearly the same direction through space, and therefore the hierarchy.
There is one case, however, where this will algorithm could fail. Consider the
diagram shown in figure 11. There are
two rays which strike the same two
bounding boxes, but in different orders. If
the algorithm processed bounding box one
first then two then it is possible that the
algorithm will return the wrong intersection for ray B. If ray B intersects an object in
box one the algorithm will stop searching for other intersections even though a closer
intersection in box two might exist.
This problem is also not that significant in practice as rays are often coherent in
their directions, but to insure the error does not occur a further constraint is required. The
easiest way to eliminate this case is to constrain that all three rays must have the same
25
origin. This will insure that all three rays will traverse through the bounding hierarchy in
the same direction.
4.2.6 Scalability Considerations
The bounding hierarchy has the benefit of being design in such a way that any
latency in the barycentric ray triangle intersection unit can be accommodated. The token
passing methodology can scale to include an arbitrary number of controllers, instead of
just two. For example if the pipeline is so deep that both controllers are waiting for
results then a third controller could be added to use the idle time. Since the token passing
implementation involves a request line and a grant line between two controllers, it is
possible to have a chain of controllers that will constantly pass the token around the
circle.
4.3 Sorted List
4.3.1 Overview
Section 2.4.4 described the importance of traversing the bounding hierarchy in a
front to back manor to insure the correct intersection is found. To facilitate this, it is
necessary to sort the resulting hit bounding volumes by distance from the view point.
There are a number of ways that a sort can be implemented but few are suitable because
their sequential nature would be too slow. The speed of the barycentric ray triangle
intersection unit results in a new element being generated every six cycles. This must
then be inserted into a sorted list with a maximum of eight elements. As such, any purely
sequential algorithm would require a worse case of 8 cycles, which is unacceptable.
The chosen solution is simular to a contents addressable memory in that the
sorting unit compares the new element with every element already in the list in parallel.
The decisions on which insertion point to use can be made in one cycle and the result
latched into its memory location at the same time. The result is a very fast sorting unit
that can easily handle the amount of data required.
26
4.3.2 Size and Speed Issues
Although the sorting unit is fast, it comes at quite a cost. The sorting key is the
intersection point’s distance, which is 32bits in length. This requires eight 32bit
comparators, and eight registers to store the sorted list. This by it self consumes a lot of
area, but there are additional area requirements as well. Since three rays are processed in
parallel in the barycentric ray triangle intersection unit, the results of all three rays must
be stored in the sorting unit. This doesn’t require any additional comparators but does
increase the number of registers required substantially.
4.4 List Handler
4.4.1 Overview
Once all the eight results for any given hierarchy level have been sorted by the
sorting unit it is necessary to store these for later retrieval by the bound controller. This
task is delegated to the list handler units. These units must keep track of the results of all
three internal levels of the bounding hierarchy and be able to return the next bounding
node to be processed when requested by the bound controller. This is accomplished
through two different phases.
The first phase writes the results from the sorting unit to a memory location based
on which level of the hierarchy the result is from. The second phase involves
determining the next node to processes based on which of the three rays are still active.
For example if the bound controller requests the next node that is struck by either ray one
or ray two then the list handler must search its list for the first node that has been hit by
either ray.
4.4.2 List Handler Operation
The actual algorithm implemented within the list handler is quite simple. The
algorithm that replies to a request from the bounding controller can be conceptually
described as:
27
1. Search level two for the first bounding node that has not been processed andhas been hit by one of the rays still active
2. If no node exists search level one for the first node that meets this requirement3. If no node exists search level zero for the first node that meets this requirement4. If no node exists then the rays do not intersect any objects
This algorithm is conceptual only because it is not necessary to perform the algorithm
after a request from the bounding controller. Instead the algorithm is performed during
available cycles and the result saved for later requests. This is possible because the
bounding controller is constantly informing the list handler unit which rays are still
active.
28
Raytracing Processor 50,000,000
POVray 3.0 (ultraSPARC II) ~2,500,000
POVray 3.0 (Athlon 1GHz) ~4,000,000
Intel SSE (800 MHz) 36,000,000
Table 4: Brute Force Performance Numbers
5 Results5.1 Overview
The performance of the raytracing processor will be examined in two parts. The
first will involve comparing the performance of the barycentric ray triangle intersection
unit to several state of the art software implementations. The second part will examine
the entire bounding hierarchy units performance compared with a software
implementation running on a several different computer architectures.
Since the raytracing processor does not have a defined interface, it was necessary
to create a test jig to provide the rays and to receive the results. The jig for both the
inputs and the outputs consists of a simple memory buffer that can be written to and read
from the SUN workstation connected to the prototyping system. This buffer is necessary
because the connection between the SUN and the prototyping system is such that the time
required to transmit the data is several orders of magnitude higher then the time to
process it. In order to achieve an accurate view of the performance of the individual units
this bottle neck was removed.
5.2 Barycentric Ray Triangle Intersection Unit Performance
The results for the barycentric
ray triangle intersection unit are quite
clear cut. Since the unit performs one
ray triangle intersection test every
clock cycle and runs at 50Mhz, the
total throughput is 50 M ray triangle
intersection tests per second. As a
comparison several test scenes, shown in appendix D, where run through POVray 3.017 on
an ultraSparc II 450MHz machine, and an AMD Athlon 1GHz machine. The results
where highly dependent on the cache coherent of the rendered image and as such it was
difficult to find an exact number for intersection tests per second. Table 4 summarizes
these results as well as listing the results of an Intel SSE implementation presented in the
29
paper Interactive Rendering with Coherent Ray Tracing18.
It should be noted that the result using the SSE method was derived by the author
by profiling the intersection test code at the cycle level. Although this can results in very
accurate timing information for a given test case, it does not take into account cache
issues. This means that in practice the performance of this method would likely be
substantially lower.
5.3 Bounding Hierarchy Performance
5.3.1 Overview
The performance of the completed bounding hierarchy unit is highly dependent on
the type of scene and the quality of the bounding hierarchy used. To try and gain an
accurate insight into the performance of this unit several synthetic test images will be
used that are designed to stress the system. The first two sets of tests are such that they
minimize the advantage that the bounding hierarchy provides the raytracing processor,
there by stressing it. The other set of tests is designed to provide test images that provide
very little cache coherence. These types of tests are designed to stress the software
raytracers that are used for comparison.
5.3.2 Single Large Faceted Sphere Tests
The following test sets consist of a signal sphere of a constant radius. The sphere
is contained within a fixed bounding hierarchy and only the number of triangles that
approximate the sphere is varied. This test is designed to increase the number of triangles
contained within a leaf node while maintaining a constant bounding hierarchy structure.
This will effectively eliminate the logarithmic benefit of the bound system.
Table 5 summarizes the results using triangle counts from 8 to 8196. The timing
results for the raytracing processor are derived from a cycle accurate count of the render
times. The performance numbers for the software implementation are extracted from the
raytracer’s result report and do not include the time it takes to parse the input data file.
They are accurate only to the nearest second and as such do not show small trends easily.
30
0 1000 2000 3000 4000 5000 6000 7000 8000 90000.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
# of Triangles
Ren
der T
ime
(sec
)
Single Large Sphere Test Sets (Expanded View)
Figure 12: Single Sphere Results (Expanded View)
0 1000 2000 3000 4000 5000 6000 7000 80000
2
4
6
8
10
Single Large Sphere Test Sets
# of Triangles
Ren
der T
ime
(sec
) Raytracing ProcessorPOVray 3.0 ultraSPARC IIPOVray 3.0 Athlon 1Ghz
Figure 13: Single Sphere Test Set Results
Order # Triangles Cycles Hardware POVray 3.0† POVray 3.0‡
0 8 9554009 0.191s 11s 2s
1 32 9468089 0.189s 11s 2s
2 128 9680532 0.193s 10s 2s
3 512 10886906 0.218s 11s 2s
4 2048 15792096 0.316s 11s 2s
5 8196 35154476 0.703s 11s 3s†ultraSPARC II 450Mhz ‡Athlon 1Ghz
Table 5: One Sphere Test Set Results
31
Figures 12 and 13 show that once the advantages of the bounding hierarchy are
removed the performance is reduced to a linear relation between the number of triangles
and the render time. This is expected as raytracing is O(n) when a bounding hierarchy is
not used. The fast barycentric ray triangle intersection unit in the raytracing processor is
also able to out perform the software implementation by a factor that varies from 4 to 10
times.
5.3.3 A Grid of Faceted Spheres
The previous test was designed to maintain a constant bounding hierarchy
structure while varying the number of triangles in each leaf. This test is designed to stress
the raytracing processor by maintaining a constant number of triangles in any given leaf
node while varying the number of leaf nodes that exist. The new leaf nodes will be
distributed uniformly across the image to insure that every node is visible.
Table 6 summarizes the results of this test set. Once again the software numbers
provide only an approximate indication of average performance and are not accurate
enough to reveal any trends.
Order # Triangles Cycles Hardware POVray 3.0† POVray 3.0‡
1 2048 7027597 0.141s 11 2
4 8192 10509865 0.210s 7 2
7 14336 14294711 0.286s 7 2
14 28672 22394396 0.448s 7 2
17 34816 25693610 0.514s 7 2
21 43008 29939633 0.599s 7 2
23 47104 32425081 0.649s 7 2
26 53248 36170720 0.723s 7 2
28 57344 38064065 0.761s 8 3†ultraSPARC II 450Mhz ‡Athlon 1Ghz
Table 6: Grid of Sphere Test Set Results
32
0 1 2 3 4 5 6
x 104
0
2
4
6
8
10
12Grid of Spheres Test Sets
# of Triangles
Ren
der T
ime
(sec
)
Raytracing ProcessorPOVray 3.0 ultraSPARC IIPOVray 3.0 Athlon 1Ghz
Figure 14: Grid of Sphere Test Set Results
0 1 2 3 4 5 6
x 104
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
# of Triangles
Ren
der T
ime
(sec
)
Grid of Spheres Test Sets (Expanded View)
Figure 15: Grid of Sphere Test Set Results(Expanded View)
Figures 14 and 15 once again show that the performance is close to linear with
respect to the number of triangles. This results from the structure of the bounding
hierarchy being uniformly distributed. Since the render time is proportional to the
number of pixel rays that hit a bounding box, doubling the number of bounding boxes
will also double the screen coverage that the boxes provide. The net result is a linear
relationship.
Even under this degenerate case the raytracing processor is still 4 to 15 times
faster then the software implementation.
33
5.3.4 Cache Incoherent Test Set
The primary advantage that a software implementation has over the hardware
raytracing processor is its effective memory bandwidth. Through the use of highspeed
data caching a software implementation can access far more data then the uncached
raytracing processor. However these caches are relatively small and great care must be
taken to insure that data is accessed coherently to avoid cache misses. This was not an
issue in the scenes tested previously as they where small and very coherent. However, a
real world scene tends to be far less coherent. It is this incoherency that removes most of
the benefit that caching could provide. To examine this behaviour, several test sets that
have been designed to be difficult to cache will be tested.
To insure that caching effects are at a minimum a scene must be designed such
that a large number of triangles need to be tested against a given ray. This will insure that
more triangle data must be accessed then can be stored in the cache. To achieve this
effect a test scene was created by placing several objects behind each other. The net
effect is that any ray that strikes the front objects bounding box, but misses the contained
object triangles, will also intersect the bounding boxes of the remaining objects. This
will require the ray to be intersected with every triangle for every object.
Test Set Cycles Hardware POVray † POVray ‡
SphereZ 17890974 0.358s 46s 6s
SphereZ2 24931862 0.499s 84s 10s
SphereZ3 16568132 0.331s 94s 10s†ultraSPARC II 450Mhz
‡Athlon 1Ghz
Table 7: Cache Incoherent Test Set Results
Table 7 summarizes the results for three test sets. It is clear that only a few cache
misses seriously slows the performance of the software implementation. Instead of the
raytracing processor being 5 to 15 times faster for coherent scenes, it is 17 to 30 times
faster in these test sets.
34
6 Conclusions and Future Work6.1 Conclusions
6.1.1 Barycentric Ray Triangle Intersection Unit
The core of the raytracing processor, the barycentric ray triangle intersection unit,
ran at least an order of magnitude faster then the common software raytracer POVray 3.0.
Even when compared to an idealized parallel implementation, that ignored memory
access time, the hardware approach was still 30% faster. Overall the hardware
implementation, that ran at 50Mhz, was easily able to out perform the software
implementations, running on a processor with a clock rate of 1GHz. The main reason
that the raytracing processor ended up being so much faster then software raytracers are,
was because it could utilize memory better.
Even though both the hardware and software approach have simular maximum
memory bandwidth, the software method is unable to pipeline reads. The custom nature
of the raytracing processor allows it to preload all its required data to fully utilize the
memory subsystem. A software implementation does not have this freedom and, as such,
must wait for memory transactions to complete prior to requesting the next block of data.
It is this factor, more then any other, that provides the raytracing processor its biggest
advantage.
6.1.2 Hierarchy Controller
Although the barycentric ray triangle intersection unit’s performance is very good,
it is far to slow to render large images due to its O(n) complexity. The hierarchy
controller allowed the processor to take advantage of a potential O(log n) render time by
using tree structures. Initially it was unclear if the sequential nature of tree traversals
could be accelerated using hardware but the results speak for themselves. Although the
hardware implementation provides little parallel acceleration or algorithmic
improvements, it does provide much better memory access.
Under the raytracing processors worse case operating conditions it still out
performed the corresponding software implementation by 500 to 1500%. When
35
compared against the software worse case operation, poor cache coherence, the processor
performed 1700% to 3000% faster. Considering that the total available memory
bandwidth, and system clock rates are much lower then a corresponding software
implementation these numbers are quite impressive.
6.1.3 General Discussion
The raytracing algorithm is filled with many possible degrees of parallelization,
and is also high predictive. It is possible through clever hardware controllers to predict
every memory read sufficiently in advance to utilize the available memory bandwidth
100%. This is not to say that a data cache might not be useful in a future raytracing
processor.
The ability of the bounding controller to operate with intersection units that have
arbitrarily high latency allows for the system clock rate to be ramped as high as needed by
deepening the pipeline. In fact data path speeds that require more data then external
memory could provide, would still be useful provided an internal highspeed cache is
used. Although there is no guarantee that a small group of random rays will share any
coherence, a large enough group of rays would have a much higher chance. If a number
of rays are accessible to the raytracing processor at any given time then this addition
coherence could be exploited and a memory cache used.
To summarize, this project revealed the memory advantages that a hardware
implementation could utilize, as well as finding that pipeline depth can easily be handled.
Unfortunately the resulting system was still to slow to be used for real time graphics. As
an accelerated offline renderer it performs much better then software but is still too slow
for real time. Perhaps if transistor technology continues to advance at its current rate then
in a decade real time raytracing will be possible, but for now it remains just a dream.
36
Figure 16: Alternative Raytracing Architecture
6.2 Future Work
This project concentrated primarily on accelerating the math behind raytracing
through deep pipelines and parallel implementations. However, it was discovered that the
real advantages raytracing can benefit from are predictive memory access and clever
caching methods. A very interesting future project would be to design a system that
operates using highspeed external memory and worked on a large number of rays in
parallel to exploit possible memory coherency.
Such a system could be designed as shown in figure 16. A number of rays are
first entered into a ray state buffer,
as shown. A unit would then read
these rays sequentially and using
some hierarchical structure
determine which leaf node should
be intersected next based on the
input rays state. The ray, along
with which leaf node ID would be
passed to a process queue. This
queue would be examined by a
controller whose task it is to pass
rays from the queue to the processing unit such that cache misses are minimized.
Additionally this controller could also preemptively load the triangle cache if it foresees a
future cache miss. Once a leaf node has been processed, the ray is either retired or
returned to the ray state buffer for further hierarchical processing. Ideally if a large
enough number of rays are processed simultaneously then cache misses can be nearly
eliminated.
Another benefit of this architecture is that the units contained within the dashed
box can be duplicated for higher performance. In theory each of these duplicated
subsystems could be implemented on separate chips with their own external memory.
These two memories would not have to contain duplicate data since if the leaf nodes are
37
randomly distributed between the duplicated systems and the number of rays being
processes is large enough then both units would be highly utilized processing which ever
nodes they have access to. The net effect would be a higher total memory bandwidth
without the associated high pin counts or bus speeds.
Such a system could easily achieve an order of magnitude speed increase over the
implementation presented in this paper. If the data path clock speeds could be ramped
high enough then it could be possible to achieve another order of magnitude increase.
Although this speed would still be too slow for real time recursive raytracing, it would be
another step in the right direction.
6.3 Closing Words
Although mathematically raytracing appears to be superior to raster graphics its
large constant still makes it too slow to be competitive. Perhaps as processing power
increases, and correspondingly scene sizes, the logarithmic nature of raytracing will win
out, but for now and the near future, raster graphics are king.
38
1. Kornel, K. 2D and 3D Perspective Transformations. In Computers & Graphics, 1990.
2. Samet, H.J. Design and analysis of Spatial Data Structures: Quadtrees, Octrees, andother Hierarchical Methods. Addison-Wesley, Redding, MA, 1989.
3. Coorg, S. Teller, S. Real-Time Occlusion Culling for Models with Large Occluders.Proceedings of the Symposium on Interactive 3D Graphics, 1997.
4. Luebke, D. Georges, C. Portals and mirrors: Simple, fast evaluation of potentiallyvisible sets. In ACM Interactive 3D Graphics Conference, Monterey, CA, 1995.
5. Clark, J.H. Hierarchical geometric models for visible surface algorithms.Communications of the ACM, 19(10):547-554, 1976.
6. Foley, J. Van Dam, A. Huges, J. Feiner, S. Computer Graphics: Principles andPractice. Addison Wesley, Reading, Mass., 1990.
7. Akeley, K. Segal, M. The OpenGL Graphics System: A Specification (Version 1.3).http://www.opengl.org/developers/documentation/version1_3/glspec13.pdf, 2001.
8. Pixar. The RenderMan Interface (Version 3.2).http://www.pixar.com/renderman/developers_corner/rispec/rispec_pdf/RISpec3_2.pdf
9. Badouel, D. An Efficient Ray-Polygon Intersection. In Graphics Gems, 390-393.Academic Press Inc. 1990.
10. Stolfi, J. Oriented Projective Geometry. Academic Press. 1991.
11. Möller, T. Trumbore,B. Fast, minimum storage ray-triangle intersection. In theJournal of Graphics Tools. 2(1):21-28, 1997
12. Havran, V. Kopal, T. Bittner, J. Zara, J. Fast Robust Bsp Tree Traversal Algorithmfor Ray Tracing. In the Journal of Graphics Tools. 2(4):15-23, 1997
13.Agate, M. Grimsdale, L. Lister, P.F. The HERO Algorithm for Ray-Tracing Octrees.In Advances in Computer Graphics Hardware IV. 61-73, Springer Verlag. 1991.
14. Havran, V. Sixta, F. Comparison of Hierarchical Grids. In Ray Tracing News, 12(1),http://www.acm.org/pubs/tog/resources/RTNews/html/rtnv12n1.html#art3. 1999.
15. Ercegovac, M.D. On-line Arithmetic: An Overview. In Real Time Signal ProcessingVII. 86-93, SPIE, 1984.
7 References
39
16. Avizienis. Signed-Digit Number Representations For Fast Parallel Arithmetic. In IRETrans. Electron. Compute. Vol EC-10. 389-400, 1961.
17. POVray 3.0 Persistence of Vision Raytracer. http://www.povray.com
18. I. Wald, C. Benthin, M. Wagner, P. Slusallek. Interactive Rendering With CoherentRay-Tracing. In Computer Graphics Forum, Eurographics. 2001.
40
Appendix A: Bound Controller State Machine
41
Appendix B: VHDL Code**** boundcontroller.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_unsigned.all;
entity boundcontroller isgeneric(
master : std_logic := '0';unitID : std_logic_vector(1 downto 0) := "00");
port(max : out std_logic_vector(31 downto 0);maxwe : out std_logic;raygroupout : out std_logic_vector(1 downto 0);raygroupwe : out std_logic;raygroupid : out std_logic_vector(1 downto 0);enablenear : out std_logic;-- Bus Signals (to Ray Generation Unit)raygroup : in std_logic_vector(1 downto 0);validraygroup : in std_logic;busy : out std_logic;-- Bus Signals (to RayTri Unit)triIDvalid : out std_logic;triID : out std_logic_vector(15 downto 0);wanttriID : in std_logic;-- Sorted stack & list buffer signalsl0reset : out std_logic;baseaddress : out std_logic_vector(1 downto 0);newdata : in std_logic;boundNodeIDout: buffer std_logic_vector(9 downto 0);resultID : in std_logic_vector(1 downto 0);-- List Handler Signalshitmask : buffer std_logic_vector(2 downto 0);ldataready,lempty : in std_logic;llevel : in std_logic_vector(1 downto 0);lmax0,lmax1,lmax2 : in std_logic_vector(31 downto 0);lboundNodeID : in std_logic_vector(9 downto 0);lack : out std_logic;lhreset : out std_logic;-- Leaf Indirection Memory Interfaceaddrind : out std_logic_vector(9 downto 0);addrindvalid : out std_logic;dataind : in std_logic_vector(31 downto 0);dataindvalid : in std_logic;-- Triangle List Memory Interfacetladdr : buffer std_logic_vector(17 downto 0);tladdrvalid : out std_logic;tldata : in std_logic_vector(63 downto 0);tldatavalid : in std_logic;-- Result Inputst1in,t2in,t3in : in std_logic_vector(31 downto 0);u1in,u2in,u3in,v1in,v2in,v3in : in std_logic_vector(15 downto 0);id1in,id2in,id3in : in std_logic_vector(15 downto 0);hit1in,hit2in,hit3in : in std_logic;-- Result Outputst1,t2,t3 : out std_logic_vector(31 downto 0);u1,u2,u3,v1,v2,v3 : out std_logic_vector(15 downto 0);id1,id2,id3 : out std_logic_vector(15 downto 0);hit1,hit2,hit3 : out std_logic;bcvalid : out std_logic;-- Counter interface Signalsdone : in std_logic_vector(1 downto 0);resetcnt : out std_logic;-- Handshake signalspassCTSout : out std_logic;passCTSin : in std_logic;
globalreset : in std_logic;clk : in std_logic;-- debugging interfacestatepeek : out std_logic_vector(4 downto 0);debugstoplevel : in std_logic_vector(1 downto 0);
42
debugleafbreak : in std_logic;debugsubcount : out std_logic_vector(1 downto 0);debugcount : out std_logic_vector(13 downto 0));
end;
architecture rtl of boundcontroller istype state_type is (S_IDLE,S_WAITCTS,S_ACTIVE,S_SEND,S_WAIT1,S_WAIT2,S_WAITCOMPLETE,
S_WAITCTS2,S_GETCTS,S_NODECOMPLETE,S_LEAFINDIRECT,S_LEAFACTIVE,S_LEAFSEND,S_LEAFWAIT1,S_LEAFWAIT2,S_LEAFCOMPLETE,S_LEAFGIVECTS,S_LEAFGIVEDONE,S_LEAFPROCESS,S_LEAFGETCTS);
signal state : state_type;signal next_state : state_type;signal max0,max1,max2 : std_logic_vector(31 downto 0);signal cts : std_logic;signal addr,startAddr : std_logic_vector(11 downto 0);signal resetcount : std_logic_vector(2 downto 0);-- Leaf Node Signalssignal count : std_logic_vector(13 downto 0);signal triDatalatch : std_logic_vector(63 downto 0);signal subcount : std_logic_vector(1 downto 0);signal maskcount : std_logic_vector(1 downto 0);
signal debug : std_logic;begin
debugsubcount <= subcount;debugcount <= count;process (state)begin
case state iswhen S_IDLE => statepeek <= "00001";when S_WAITCTS => statepeek <= "00010";when S_ACTIVE => statepeek <= "00011";when S_SEND => statepeek <= "00100";when S_WAIT1 => statepeek <= "00101";when S_WAIT2 => statepeek <= "00110";when S_WAITCOMPLETE => statepeek <= "00111";when S_WAITCTS2 => statepeek <= "01001";when S_GETCTS => statepeek <= "01010"; -- 10when S_NODECOMPLETE => statepeek <= "01011";when S_LEAFINDIRECT => statepeek <= "01100";when S_LEAFACTIVE => statepeek <= "01101";when S_LEAFSEND => statepeek <= "01110";when S_LEAFWAIT1 => statepeek <= "01111";when S_LEAFWAIT2 => statepeek <= "10000";when S_LEAFCOMPLETE => statepeek <= "10001";when S_LEAFGIVECTS => statepeek <= "10010";when S_LEAFGIVEDONE => statepeek <= "10011";when S_LEAFPROCESS => statepeek <= "10100";when S_LEAFGETCTS => statepeek <= "10101";when others => statepeek <= "11111";
end case;end process;
Process(clk,globalreset)begin
if (globalreset = '1') thenstate <= S_IDLE;raygroupout <= (others => '0');cts <= master;passCTSout <= '0';max0 <= (others => '0');max1 <= (others => '0');max2 <= (others => '0');addr <= (others => '0');startAddr <= (others => '0');boundNodeIDout <= (others => '0');resetcount <= (others => '0');hitmask <= (others => '1');lack <= '0';baseAddress <= (others => '0');l0reset <= '0';resetcnt <= '0';triIDvalid <= '0';triID <= (others => '0');addrind <= (others => '0');addrindvalid <= '0';
43
tladdrvalid <= '0';tladdr <= (others => '0');tridatalatch <= (others => '0');maskcount <= (others => '0');subcount <= (others => '0');count <= (others => '0');hit1 <= '0'; hit2 <= '0'; hit3 <= '0';t1 <= (others => '0'); t2 <= (others => '0'); t3 <= (others => '0');u1 <= (others => '0'); u2 <= (others => '0'); u3 <= (others => '0');v1 <= (others => '0'); v2 <= (others => '0'); v3 <= (others => '0');id1 <= (others => '0'); id2 <= (others => '0'); id3 <= (others => '0');debug <= '0';
elsif (rising_edge(clk)) thenstate <= next_state;addrind <= (others => '0');l0reset <= '0';lack <= '0';triIDvalid <= '0';triID <= (others => '0');if newdata = '1' and resultID = unitID thenboundNodeIDout <= boundNodeIDout + 1;
end if;if (done = unitID) or (state = S_LEAFCOMPLETE and newdata = '1' and resultID =
unitID) thenresetcnt <= '1';
elseresetcnt <= '0';
end if;case state iswhen S_IDLE =>if validraygroup = '1' and cts = '1' thenraygroupout <= raygroup;
end if;if validraygroup = '1' and cts = '0' thencts <= '1';passCTSout <= '1';
elsif validraygroup = '0' and cts = '1' and passCTSin = '1' thencts <= '0';passCTSout <= '1';
end if;when S_WAITCTS =>if passCTSin = cts thenpassCTSout <= '0';
end if;when S_ACTIVE =>resetcount <= "100";l0reset <= '1';addr <= (others => '0');startAddr <= (others => '0');boundNodeIDout <= (others => '0');baseAddress <= (others => '0');max0 <= (others => '1');max1 <= (others => '1');max2 <= (others => '1');hitMask <= (others => '1');hit1 <= '0'; hit2 <= '0'; hit3 <= '0';
when S_SEND =>if (addr-startaddr /= 48) and (addr-startaddr /= 49) thentriIDvalid <= '1';
end if;triID <= "0000" & addr;addr <= addr+1;if resetcount = 5 thenresetcount <= "000";
elseresetcount <= resetcount+1;
end if;when S_WAITCOMPLETE =>if passCTSin = '1' and cts = '1' thencts <= '0';passCTSout <= '1';
elsif done = unitID and cts = '0' thencts <= '1';passCTSout <= '1';
end if;when S_WAITCTS2 =>if passCTSin = '0' then
44
passCTSout <= '0';end if;
when S_GETCTS =>if passCTSin = '0' thenpassCTSout <= '0';
end if;when S_NODECOMPLETE =>resetcount <= "100";baseAddress <= llevel+1;boundNodeIDout <= (lBoundNodeID+1)(6 downto 0) & "000";addr <= (((lBoundNodeID+1)(7 downto 0) & "0000")+
((lBoundNodeID+1)(6 downto 0) & "00000")) (11 downto 0);startaddr <= (((lBoundNodeID+1)(7 downto 0) & "0000")+
((lBoundNodeID+1)(6 downto 0) & "00000")) (11 downto 0);if ldataready = '1' and (wantTriID = '1' or llevel = "10") and (llevel <
debugstoplevel)thenlack <= '1';l0reset <= '1';
end if;if ldataready = '1' and llevel = "10" thenaddrind <= lboundNodeID-72;addrindvalid <= '1';
end if;when S_LEAFINDIRECT =>tlAddr <= dataind(17 downto 0);count <= dataind(31 downto 18);if dataindvalid = '1' thenaddrindvalid <= '0';tladdrvalid <= '1';
end if;when S_LEAFACTIVE =>tridatalatch <= tldata;subcount <= "10";maskcount <= "00";if (wanttriID = '1' and tldatavalid = '1') or (count = 0 or count = 1) thentladdr <= tladdr+1;tladdrvalid <= '0';
end if;when S_LEAFSEND =>if maskcount = "11" thentriID <= triDataLatch(15 downto 0);
elsif maskcount = "10" thentriID <= triDataLatch(31 downto 16);
elsif maskcount = "01" thentriID <= triDataLatch(47 downto 32);
elsetriID <= triDataLatch(63 downto 48);
end if;if count /= 0 thencount <= count - 1;if count /= 1 thentriIDvalid <= '1';
end if;if maskcount = "01" thentladdrvalid <= '1';
end if;end if;
when S_LEAFWAIT2 =>if subcount /= 0 thensubcount <= subcount - 1;
end if;if maskcount = "11" thentlAddr <= tlAddr+1;tladdrvalid <= '0';triDataLatch <= tldata;
end if;maskcount <= maskcount + 1;
when S_LEAFCOMPLETE =>if (newdata = '0' or resultID /= unitID) and CTS = '1' and passCTSin = '1' thencts <= '0';passCTSout <= '1';
end if;when S_LEAFGIVECTS =>if passCTSin = '0' and (newdata = '0' or resultID /= unitID) thenpassCTSout <= '0';
end if;
45
when S_LEAFGIVEDONE =>if passCTSin = '0' thenpassCTSout <= '0';
end if;when S_LEAFPROCESS =>-- latch new hitsif hit1in = '1' and hitmask(0) = '1' thent1 <= t1in; u1 <= u1in; v1 <= v1in; id1 <= id1in; hit1 <= '1'; hitmask(0) <=
'0';end if;if hit2in = '1' and hitmask(1) = '1' thent2 <= t2in; u2 <= u2in; v2 <= v2in; id2 <= id2in; hit2 <= '1'; hitmask(1) <=
'0';end if;if hit3in = '1' and hitmask(2) = '1' thent3 <= t3in; u3 <= u3in; v3 <= v3in; id3 <= id3in; hit3 <= '1'; hitmask(2) <=
'0';end if;if cts='0' and ((hitmask(0)='1' and hit1in='0') or (hitmask(1) = '1' and hit2in
= '0') or(hitmask(2) = '1' and hit3in = '0')) then
passCTSout <= '1';cts <= '1';
end if;when S_LEAFGETCTS =>if passCTSin = '1' thenpassCTSout <= '0';
end if;end case;
end if;end process;
busy <= '0' when (state = S_IDLE) else '1';
process (state,validraygroup,cts,passCTSin,wantTriID,addr,startAddr,done,ldataready,lempty,llevel,max0,max1,max2,resetcount,dataindvalid,tldatavalid,hit1in,hit2in,hit3in,hitmask,resultId,newdata,subcount,count)
Beginmax <= (others => '0');maxwe <= '0';raygroupID <= (others => '0');enablenear <= '0';raygroupwe <= '0';bcvalid <= '0';lhreset <= '0';case state IS
when S_IDLE =>lhreset <= '1';if validraygroup = '1' and cts = '1' thennext_state <= S_ACTIVE;
elsif validraygroup = '1' and cts = '0' thennext_state <= S_WAITCTS;
elsif validraygroup = '0' and passCTSin = '1' and cts = '1' thennext_state <= S_WAITCTS;
elsenext_state <= S_IDLE;
end if;when S_WAITCTS =>if passCTSin = cts thennext_state <= S_IDLE;
elsenext_state <= S_WAITCTS;
end if;when S_ACTIVE =>if wantTriID = '1' thennext_state <= S_SEND;
elsenext_state <= S_ACTIVE;
end if;when S_SEND =>if addr = startAddr thenmax <= max0;maxwe <= '1';
end if;if (addr-startAddr >= 1) and (addr-startAddr /= 49) thenraygroupID <= unitID;
end if;
46
next_State <= S_WAIT1;if resetcount = 5 thenraygroupwe <= '1';
end if;enablenear <= '1';
when S_WAIT1 =>if addr = startAddr thenmax <= max1;maxwe <= '1';raygroupID <= "01";
end if;if addr-startaddr=49 thennext_state <= S_WAITCOMPLETE;
elsenext_state <= S_WAIT2;
end if;when S_WAIT2 =>if addr = startAddr thenmax <= max2;maxwe <= '1';raygroupID <= "10";
end if;next_state <= S_SEND;
when S_WAITCOMPLETE =>if passCTSin = '1' and cts = '1' thennext_state <= S_WAITCTS2;
elsif done = unitID and cts = '0' thennext_state <= S_GETCTS;
elsif done = unitID and cts = '1' thennext_state <= S_NODECOMPLETE;
elsenext_state <= S_WAITCOMPLETE;
end if;when S_WAITCTS2 =>if passCTSin = '0' thennext_state <= S_WAITCOMPLETE;
elsenext_state <= S_WAITCTS2;
end if;when S_GETCTS =>if passCTSin = '1' thennext_state <= S_NODECOMPLETE;
elsenext_state <= S_GETCTS;
end if;when S_NODECOMPLETE =>if lempty = '1' thennext_state <= S_IDLE;bcvalid <= '1';
elsif ldataready = '1' and llevel = "10" and (debugstoplevel > "10") thennext_state <= S_LEAFINDIRECT;
elsif ldataready = '1' and wantTriID = '1' and llevel < debugstoplevel thennext_state <= S_SEND;
elsenext_state <= S_NODECOMPLETE;
end if;when S_LEAFINDIRECT =>if dataindvalid = '1' thennext_state <= S_LEAFACTIVE;
elsenext_state <= S_LEAFINDIRECT;
end if;when S_LEAFACTIVE =>if count = 0 or count = 1 thennext_state <= S_NODECOMPLETE;
elsif wanttriID = '1' and tldatavalid = '1' thennext_state <= S_LEAFSEND;
elsenext_state <= S_LEAFACTIVE;
end if;when S_LEAFSEND =>if count /= 0 thennext_state <= S_LEAFWAIT1;
elsenext_state <= S_LEAFCOMPLETE;
end if;if subcount = "10" then
47
max <= max0;maxwe <= '1';
end if;if (subcount = "01") thenraygroupID <= unitID;
elseraygroupID <= "00";
end if;enablenear <= '0';if subcount = "01" or count = 0 thenraygroupwe <= '1';
end if;when S_LEAFWAIT1 =>next_state <= S_LEAFWAIT2;if subcount = "10" thenmax <= max1;maxwe <= '1';raygroupID <= "01";
end if;when S_LEAFWAIT2 =>next_state <= S_LEAFSEND;if subcount = "10" thenmax <= max2;maxwe <= '1';raygroupID <= "10";
end if;when S_LEAFCOMPLETE =>if (newdata = '0' or resultID /= unitID) and CTS = '1' and passCTSin = '1' thennext_state <= S_LEAFGIVECTS;
elsif newdata = '1' and resultID = unitID thennext_state <= S_LEAFPROCESS;
elsenext_state <= S_LEAFCOMPLETE;
end if;when S_LEAFGIVECTS =>if newdata = '1' and resultID = unitID thennext_state <= S_LEAFGIVEDONE;
elsif passCTSin = '0' thennext_state <= S_LEAFCOMPLETE;
elsenext_state <= S_LEAFGIVECTS;
end if;when S_LEAFGIVEDONE =>if passCTSin = '0' thennext_state <= S_LEAFPROCESS;
elsenext_state <= S_LEAFGIVEDONE;
end if;when S_LEAFPROCESS =>if debugLeafBreak = '1' thennext_state <= S_IDLE;
elsif cts = '0' and ((hitmask(0)='1' and hit1in='0') or (hitmask(1)='1' and hit2in= '0') or
(hitmask(2) = '1' and hit3in = '0')) thennext_state <= S_LEAFGETCTS;
elsif cts = '1' and ((hitmask(0)='1' and hit1in='0') or (hitmask(1)='1' and hit2in= '0') or
(hitmask(2) = '1' and hit3in = '0')) thennext_state <= S_NODECOMPLETE;
elsenext_state <= S_IDLE;bcvalid <= '1';
end if;when S_LEAFGETCTS =>if passCTSin = '0' thennext_state <= S_LEAFGETCTS;
elsenext_state <= S_NODECOMPLETE;
end if;end case;
end process;
end rtl;
**** crossproduct.vhd *****
48
------------------------------------------------ Pipelined Vector Cross Product Component ---- C = A x B ---- Performs a vector cross product in 2 ---- clock cycles. Synplify's pipeline ---- option should be enable to better ---- balance the pipeline cycles. ------------------------------------------------library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
entity crossproduct isgeneric (
widthA : natural := 32;widthB : natural := 32);
port(Ax,Ay,Az : in std_logic_vector(widthA-1 downto 0);Bx,By,Bz : in std_logic_vector(widthB-1 downto 0);Cx,Cy,Cz : out std_logic_vector(widthA+widthB downto 0);clk : in std_logic);
end;
architecture rtl of crossproduct issignal AyBz, AzBy, AzBx : std_logic_vector(widthA+widthB-1 downto 0);signal AxBz, AxBy, AyBx : std_logic_vector(widthA+widthB-1 downto 0);begin
process(clk)begin
if (rising_edge(clk)) thenAyBz <= Ay*Bz;AzBy <= Az*By;AzBx <= Az*Bx;AxBz <= Ax*Bz;AxBy <= Ax*By;AyBx <= Ay*Bx;
Cx <= (AyBz(widthA+widthB-1) & AyBz) - (AzBy(widthA+widthB-1) & AzBy);Cy <= (AzBx(widthA+widthB-1) & AzBx) - (AxBz(widthA+widthB-1) & AxBz);Cz <= (AxBy(widthA+widthB-1) & AxBy) - (AyBx(widthA+widthB-1) & AyBx);
end if;end process;
end rtl;
**** delay.vhd ****library ieee;use ieee.std_logic_1164.all;
entity delay isgeneric (
width : natural := 32;depth : natural := 1);
port(datain : in std_logic_vector(width-1 downto 0);dataout : out std_logic_vector(width-1 downto 0);clk : in std_logic);
end;
architecture rtl of delay istype delayarray is array (0 to depth-1) of std_logic_vector (width-1 downto 0);signal buff : delayarray;
begindataout <= buff(depth-1);
process(clk)begin
if (rising_edge (clk)) thenbuff(0) <= datain;if (depth > 1) thenrow : for k in 0 to depth-2 loopbuff(k+1) <= buff(k);
end loop row;end if;
end if;
49
end process;end rtl;
50
**** divide.vhd ****---------------------------------------------------- Parameterized Fixed Point Divide Componenent ---- ---- Qout = (A / B)*2^widthfrac ---- ---- Performs unsigned fixed point addition ---- between 2 numbers. The divide is pipelined ---- such that 1 quotient bit is generated per ---- clock cycle. The throughput is one divide ---- per cycle for any size input. ---- ---- widthOut specified the total output widht ---- widthFrac specifies how many of the output ---- bits are infact fractional ----------------------------------------------------library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
entity divide isgeneric (
widthA : natural := 64;widthOut : natural := 32; -- Width of the outputwidthB : natural := 64;widthFrac : natural := 15); -- Fraction bits in output
port(A : in std_logic_vector(widthA-1 downto 0);B : in std_logic_vector(widthB-1 downto 0);Qout : out std_logic_vector(widthOut-1 downto 0);clk : in std_logic);
end;
architecture rtl of divide istype stdlogicarrayn is array(0 to widthOut-1) of std_logic_vector(widthA+widthFrac-1
downto 0);type stdlogicarraym is array(0 to widthOut-1) of std_logic_vector(widthOut-1 downto 0);type stdlogicarrayo is array(0 to widthOut-1) of std_logic_vector(widthB-1 downto 0);
signal c : stdlogicarrayn;signal q : stdlogicarraym;signal bp : stdlogicarrayo;
beginc(0)(widthA+widthFrac-1 downto widthFrac) <= A;c(0)(widthFrac-1 downto 0) <= (others => '0');q(0) <= (others => '0');bp(0) <= B;
process (clk)begin
if (clk'event and clk = '1') thenrow: for k in 0 to widthOut-2 loop
if (c(k)(widthA+widthFrac-1 downto widthOut-1-k)-bp(k) >= 0) thenq(k+1) <= q(k)(widthOut-2 downto 0) & '1';c(k+1) <= (c(k)(widthA+widthFrac-1 downto widthOut-1-k)-bp(k))
(k+(widthA-widthOut)+widthFrac downto 0) &c(k)(widthOut-k-2 downto 0);
elseq(k+1) <= q(k)(widthOut-2 downto 0) & '0';c(k+1) <= c(k);
end if;bp(k+1) <= bp(k);
end loop row;
if (c(widthOut-1)-bp(widthOut-1) >= 0) thenQout <= q(widthOut-1)(widthOut-2 downto 0) & '1';
elseQout <= q(widthOut-1)(widthOut-2 downto 0) & '0';
end if;
end if;end process;
end rtl;
51
**** dotproduct.vhd ****---------------------------------------------- Pipelined Vector Dot Product Component ---- C = A . B ---- Performs a vector cross product in 2 ---- clock cycles. Synplify's pipeline ---- option should be enable to better ---- balance the pipeline cycles. ----------------------------------------------library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
entity dotproduct isgeneric (
widthA : natural := 32;widthB : natural := 32);
port(Ax,Ay,Az : in std_logic_vector(widthA-1 downto 0);Bx,By,Bz : in std_logic_vector(widthB-1 downto 0);C : out std_logic_vector(widthA+widthB+1 downto 0);clk : in std_logic);
end;
architecture rtl of dotproduct issignal AxBx, AyBy, AzBz : std_logic_vector(widthA+widthB-1 downto 0);begin
process(clk)begin
if (rising_edge(clk)) thenAxBx <= Ax*Bx;AyBy <= Ay*By;AzBz <= Az*Bz;C <= (AxBx(widthA+widthB-1) & AxBx(widthA+widthB-1) & AxBx) +
(AyBy(widthA+widthB-1) & AyBy(widthA+widthB-1) & AyBy) +(AzBz(widthA+widthB-1) & AzBz(widthA+widthB-1) & AzBz);
end if;end process;
end rtl;
**** dpram.vhd ****--------------------------------------------------------- Dual Ported Ram Modual w/Registered Output ---- - Synpify should infer ram from the coding style ---- - The virtex distributed ram is 1bitx16 ---- - Uses approximately 2 LUTs per bit wide ---------------------------------------------------------
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_signed.all;
entity dpram isgeneric(
width : natural := 16);port(
we : in std_logic;raddr, waddr : in std_logic_vector(3 downto 0);dataout : out std_logic_vector(width-1 downto 0);datain : in std_logic_vector(width-1 downto 0);clk : in std_logic);
end;
architecture rtl of dpram istype memarray is array(15 downto 0) of std_logic_vector(width-1 downto 0);signal mem : memarray;signal data : std_logic_vector(width-1 downto 0);
begindata <= mem(conv_integer(raddr));
process(clk,we,waddr)begin
52
if (rising_edge (clk)) thendataout <= data;if (we = '1') thenmem(conv_integer(waddr)) <= datain;
end if;end if;
end process;
end rtl;
**** exchange.vhd ****------------------------------------ Scalar Mux Component ---- C = A when ABn = '1' else B ------------------------------------
library ieee;use ieee.std_logic_1164.all;
entity exchange isgeneric (
width : natural := 32);port(
A : in std_logic_vector(width-1 downto 0);B : in std_logic_vector(width-1 downto 0);C : out std_logic_vector(width-1 downto 0);ABn : in std_logic);
end;
architecture rtl of exchange isbegin
C <= A when (ABn = '1') else B;end rtl;
**** fifo3.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
entity fifo3 isgeneric (
datawidth : natural := 18);port(
datain : in std_logic_vector(datawidth-1 downto 0);writeen : in std_logic;dataout : out std_logic_vector(datawidth-1 downto 0);shiften : in std_logic;globalreset : in std_logic;clk : in std_logic);
end;
architecture rtl of fifo3 istype stdlogicarray is array(0 to 2) of std_logic_vector(datawidth-1 downto 0);
signal data : stdlogicarray;signal pos : std_logic_vector(1 downto 0);
begindataout <= data(0);
process(clk,globalreset)begin
if (globalreset = '1') thenpos <= "00";data(0) <= (others => '0');data(1) <= (others => '0');data(2) <= (others => '0');
elsif rising_edge(clk) thenif writeen = '1' and shiften = '1' thencase (pos) iswhen "00" =>data(0) <= (others => '-');data(1) <= (others => '-');data(2) <= (others => '-');
53
when "01" =>data(0) <= datain;data(1) <= (others => '-');data(2) <= (others => '-');
when "10" =>data(0) <= data(1);data(1) <= datain;data(2) <= (others => '-');
when "11" =>data(0) <= data(1);data(1) <= data(2);data(2) <= datain;
end case;elsif shiften = '1' thendata(0) <= data(1);data(1) <= data(2);pos <= pos-1;
elsif writeen = '1' thencase (pos) iswhen "00" => data(0) <= datain;when "01" => data(1) <= datain;when "10" => data(2) <= datain;when others =>
end case;pos <= pos + 1;
end if;end if;
end process;
end rtl;
**** listbuffer.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
library work;use work.complib.all;
entity listbuffer isgeneric(
width : natural := 48;subdepth : natural := 3;totaldepth : natural := 5);
port(peekdata : in std_logic_vector(width*(2**subdepth)-1 downto 0);commit : in std_logic;nextaddr : in std_logic;baseaddress : in std_logic_vector(totaldepth-subdepth-1 downto 0);dataout : out std_logic_vector(width-1 downto 0);globalreset : in std_logic;clk : in std_logic);
end;
architecture rtl of listbuffer istype state_type is (S_IDLE,S_WRITE);signal state : state_type;signal next_state : state_type;
signal we : std_logic;signal address : std_logic_vector(totaldepth-1 downto 0);signal datain : std_logic_vector(width-1 downto 0);
beginram : spram
generic map(width,totaldepth)port map(we,address,dataout,datain,clk);
Process(clk,globalreset)begin
if (globalreset = '1') thenstate <= S_IDLE;address <= (others => '0');
elsif (rising_edge(clk)) thenstate <= next_state;
54
case state iswhen S_IDLE =>if commit = '1' thenaddress(totaldepth-1 downto subdepth) <= baseaddress;address(subdepth-1 downto 0) <= (others => '0');
end if;if nextaddr = '1' thenaddress(subdepth-1 downto 0) <= address(subdepth-1 downto 0) + 1;
end if;when S_WRITE =>address(subdepth-1 downto 0) <= address(subdepth-1 downto 0) + 1;
when others =>end case;
end if;end process;
process (state,commit,address,peekdata)Begin
we <= '0';datain <= (others => '-');case state IS
when S_IDLE =>if commit = '1' thennext_state <= S_WRITE;
elsenext_state <= S_IDLE;
end if;when S_WRITE =>writelp : for k in 0 to (2**subdepth)-1 loopif k=address(subdepth-1 downto 0) thendatain <= peekdata((k+1)*width-1 downto k*width);
end if;end loop writelp;we <= '1';if address(subdepth-1 downto 0) = (2**subdepth)-1 thennext_state <= S_IDLE;
elsenext_state <= S_WRITE;
end if;end case;
end process;
end rtl;
**** listhandler.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_unsigned.all;
library work;use work.complib.all;
entity listhandler isport(
dataarrayin : in std_logic_vector(8*109-1 downto 0);commit : in std_logic;
hitmask : in std_logic_vector(2 downto 0);ack : in std_logic;max0,max1,max2 : out std_logic_vector(31 downto 0);boundnodeID : out std_logic_vector(9 downto 0);level : out std_logic_vector(1 downto 0);empty,dataready : buffer std_logic;
reset : in std_logic;globalreset : in std_logic;clk : in std_logic;
peekoffset0,peekoffset1,peekoffset2 : out std_logic_vector(2 downto 0);peekhit : out std_logic;peekstate : out std_logic_vector(1 downto 0) );
end;
architecture rtl of listhandler is
55
type state_type is (S_IDLE,S_WRITE,S_ALIGN);signal next_state, state : state_type;
signal readlevel, writelevel : std_logic_vector(1 downto 0);signal offset0, offset1, offset2 : std_logic_vector(2 downto 0);signal address : std_logic_vector(4 downto 0);signal we : std_logic;signal datain,dataout : std_logic_vector(109-1 downto 0);signal lvempty : std_logic_vector(2 downto 0);signal busy : std_logic;
begin-- Debug Stuffpeekoffset0 <= offset0;peekoffset1 <= offset1;peekoffset2 <= offset2;peekhit <= '1' when (datain(108) = '1' or datain(107) = '1' or datain(106) = '1') else
'0';
process (state)begin
case (state) iswhen S_IDLE => peekstate <= "01";when S_WRITE => peekstate <= "10";when S_ALIGN => peekstate <= "11";when others => peekstate <= "00";
end case;end process;
-- Real Coderam : spram
generic map(109,5)port map(we, address,dataout,datain,clk);
level <= readlevel;max0 <= dataout(41 downto 10) when dataout(106) = '1' else (others => '0');max1 <= dataout(73 downto 42) when dataout(107) = '1' else (others => '0');max2 <= dataout(105 downto 74) when dataout(108) = '1' else (others => '0');boundnodeID <= dataout(9 downto 0);
empty <= '1' when (lvempty = "111" and busy = '0') else '0';dataready <= '1' when ((dataout(106) = '1' and hitmask(0) = '1') or
(dataout(107) = '1' and hitmask(1) = '1') or(dataout(108) = '1' and hitmask(2) = '1')) and(empty = '0') and (busy = '0') else '0';
address(4 downto 3) <= readlevel;
process (offset0,offset1,offset2,address)begin
if address(4 downto 3) = "00" thenaddress(2 downto 0) <= offset0;
elsif address(4 downto 3) = "01" thenaddress(2 downto 0) <= offset1;
elsif address(4 downto 3) = "10" thenaddress(2 downto 0) <= offset2;
elseaddress(2 downto 0) <= (others => '-');
end if;end process;
process (clk,globalreset)begin
if (globalreset = '1') thenstate <= S_IDLE;lvempty <= (others => '1');busy <= '0';readlevel <= "00";writelevel <= "00";offset0 <= "000";offset1 <= "000";offset2 <= "000";
elsif (rising_edge(clk)) thenstate <= next_state;case state iswhen S_IDLE =>if (reset = '1') thenbusy <= '0';
56
lvempty <= (others => '1');readlevel <= "00"; writelevel <= "00";offset0 <= "000"; offset1 <= "000"; offset2 <= "000";
elsif (commit = '1') thenbusy <= '1';if writelevel = "00" thenoffset0 <= "000";
elsif writelevel = "01" thenoffset1 <= "000";
elsif writelevel = "10" thenoffset2 <= "000";
end if;readlevel <= writelevel;
elsif (ack = '1') thenwritelevel <= readlevel+1;busy <= '1'; -- This will ensure that align skips one
end if;when S_WRITE =>if readlevel = "00" thenoffset0 <= offset0 + 1;
elsif readlevel = "01" thenoffset1 <= offset1 + 1;
elsif readlevel = "10" thenoffset2 <= offset2 + 1;
end if;if address(2 downto 0) = "111" thenbusy <= '0';
end if;if datain(108) = '1' or datain(107) = '1' or datain(106) = '1' thenif readlevel = "00" thenlvempty(0) <= '0';
elsif readlevel = "01" thenlvempty(1) <= '0';
elsif readlevel = "10" thenlvempty(2) <= '0';
end if;end if;
when S_ALIGN =>busy <= '0';if empty = '0' and dataready = '0' thenif readlevel = "00" thenif offset0 = "111" thenlvempty(0) <= '1';
elseoffset0 <= offset0 + 1;
end if;elsif readlevel = "01" thenif offset1 = "111" thenlvempty(1) <= '1';readlevel <= "00";
elseoffset1 <= offset1 + 1;
end if;elsif readlevel = "10" thenif offset2 = "111" thenlvempty(2) <= '1';if lvempty(1) = '1' thenreadlevel <= "00";
elsereadlevel <= "01";
end if;elseoffset2 <= offset2 + 1;
end if;end if;
end if;end case;
end if;end process;
process (state,commit,ack,address,dataarrayin,reset,dataready,empty)begin
we <= '0';datain <= (others => '-');case state is
when S_IDLE =>if reset = '1' then
57
next_state <= S_IDLE;elsif commit = '1' thennext_state <= S_WRITE;
elsif (ack = '1') or (dataready = '0' and empty = '0') thennext_state <= S_ALIGN;
elsenext_state <= S_IDLE;
end if;when S_WRITE =>writelp : for k in 0 to 7 loopif k=address(2 downto 0) thendatain <= dataarrayin((k+1)*109-1 downto k*109);
end if;end loop writelp;we <= '1';if address(2 downto 0) = "111" thennext_state <= S_ALIGN;
elsenext_state <= S_WRITE;
end if;when S_ALIGN =>if empty = '0' and dataready = '0' thennext_state <= S_ALIGN;
elsenext_state <= S_IDLE;
end if;end case;
end process;
end rtl;
**** memoryinterface.vhd ****-------------------------------------------- Triangle Memory Controller Component ---- ---- There are 2 nibble bus signals that ---- allow the component to download ---- memory contents from the sun. First ---- the started address is written, ---- then data is written in 64bit ---- chunks. The address is auto inc'd ---- ---- The dataout and datavalid signals ---- contain the triangle data ---- ---- wanttriID is high to request a new ---- triangle ID for 2nd cycle. A high ---- triIDvalid signal indicates it ---- that the user has applied that ---- signal to the triID port. ---- ---- cyclenum is a control signal that ---- counts from 0-2. This signal ---- determines the ray to be sent to ---- the ray tri unit as well as which ---- nearest compare unit to use --------------------------------------------
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
entity memoryinterface isport(
want_addr : out std_logic;addr_ready : in std_logic;addrin : in std_logic_vector(17 downto 0);want_data : out std_logic;data_ready : in std_logic;datain : in std_logic_vector(63 downto 0);
dataout : out std_logic_vector(191 downto 0);triIDout : out std_logic_vector(15 downto 0);datavalid : out std_logic;
58
triIDvalid : in std_logic;triID : in std_logic_vector(15 downto 0);wanttriID : out std_logic;cyclenum : out std_logic_vector(1 downto 0);
tm3_sram_data : inout std_logic_vector(63 downto 0);tm3_sram_addr : out std_logic_vector(18 downto 0);tm3_sram_we : out std_logic_vector(7 downto 0);tm3_sram_oe : out std_logic_vector(1 downto 0);tm3_sram_adsp : out std_logic;globalreset : in std_logic;clk : in std_logic);
end;
architecture rtl of memoryinterface istype state_type is (S_READ1,S_READ2,S_READ3,S_WRITE1,S_WRITE2,S_WRITE3,S_WRITEDONE);signal state : state_type;signal next_state : state_type;
signal address,oldaddress : std_logic_vector(15 downto 0);signal waddress : std_logic_Vector(17 downto 0);signal databuff : std_logic_vector(127 downto 0);signal addrvalid, oldaddrvalid : std_logic;
begin
process(clk,globalreset)begin
if (globalreset = '1') thenstate <= S_READ1;addrvalid <= '0';oldaddrvalid <= '0';address <= (others => '0');waddress <= (others => '0');databuff <= (others => '0');dataout <= (others => '0');triIDout <= (others => '0');oldaddress <= (others => '0');datavalid <= '0';wanttriID <= '0';
elsif (rising_edge (clk)) thenstate <= next_state;wanttriID <= '0';case (state) iswhen S_READ1 =>if (addr_ready = '1') thenwaddress <= addrin;
end if;databuff(63 downto 0) <= tm3_sram_data;
when S_READ2 =>databuff(127 downto 64) <= tm3_sram_data;oldaddrvalid <= addrvalid;oldaddress <= address;if (triIDvalid = '1') thenaddrvalid <= '1';address <= triID;
elseaddrvalid <= '0';
end if;wanttriID <= '1';
when S_READ3 =>dataout <= tm3_sram_data & databuff;datavalid <= oldaddrvalid;triIDout <= oldaddress;
when S_WRITE2 =>if (data_ready = '1') thenwaddress <= waddress+1;
end if;when S_WRITEDONE =>addrvalid <= '0';
when others =>end case;
end if;end process;
process (state,address,addr_ready,data_ready,waddress,datain)begin
tm3_sram_we <= "11111111";
59
tm3_sram_oe <= "11";tm3_sram_adsp <= '1';tm3_sram_data <= (others => 'Z');tm3_sram_addr <= (others => '-');cyclenum <= (others => '-');want_addr <= '1';want_data <= '0';case (state) is
when S_READ1 =>tm3_sram_addr <= '0' & address & "01";tm3_sram_adsp <= '0';tm3_sram_oe <= "01";cyclenum <= "00";if (addr_ready = '1') thennext_state <= S_WRITE1;
elsenext_state <= S_READ2;
end if;when S_READ2 =>tm3_sram_addr <= '0' & address & "10";tm3_sram_adsp <= '0';tm3_sram_oe <= "01";cyclenum <= "01";next_state <= S_READ3;
when S_READ3 =>tm3_sram_addr <= '0' & address & "00";tm3_sram_adsp <= '0';tm3_sram_oe <= "01";cyclenum <= "10";next_state <= S_READ1;
when S_WRITE1 =>want_addr <= '0';want_data <= '1';if (addr_ready = '1') thennext_state <= S_WRITE1;
elsenext_state <= S_WRITE2;
end if;when S_WRITE2 =>want_data <= '1';tm3_sram_addr <= '0' & waddress;tm3_sram_data <= datain;if (addr_ready = '1') thennext_state <= S_WRITEDONE;
elsif (data_ready = '1') thentm3_sram_we <= "00000000";tm3_sram_adsp <= '0';next_state <= S_WRITE3;
elsenext_state <= S_WRITE2;
end if;when S_WRITE3 =>if (data_ready = '1') thennext_state <= S_WRITE3;
elsenext_state <= S_WRITE2;
end if;when S_WRITEDONE =>want_addr <= '0';if (addr_ready = '1') thennext_state <= S_WRITEDONE;
elsenext_state <= S_READ1;
end if;end case;
end process;
end rtl;
**** nearcmp.vhd ****---------------------------------------------- Nearest Triangle Hit Compare Component ---- ---- This unit keeps track of the closest ---- triangle that has currently been hit ---- This unit also tracks the furtherest --
60
-- hit distance, but not the triID ---- ---- tin,uin,vin,triIDin,hit are inputs ---- t,u,v,triID,anyhit are outputs ---- enable must be high for compare ---- reset will allow a new hit to be ---- found during the reset cycle ----------------------------------------------library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_unsigned.all;
entity nearcmp isport(
tin : in std_logic_vector(31 downto 0);uin,vin : in std_logic_vector(15 downto 0);triIDin : in std_logic_vector(15 downto 0);hit : in std_logic;
t : buffer std_logic_vector(31 downto 0);tfar : buffer std_logic_vector(31 downto 0);u,v : out std_logic_vector(15 downto 0);triID : out std_logic_vector(15 downto 0);anyhit : out std_logic;
maxdist : in std_logic_vector(31 downto 0);enable : in std_logic;reset : in std_logic;
globalreset : in std_logic;clk : in std_logic);
end;
architecture rtl of nearcmp istype nc_state_type is (S_RESET,S_EXISTS);
signal state,next_state : nc_state_type;signal latchnear, latchfar : std_logic;
beginanyhit <= '1' when (state = S_EXISTS) else '0';
process(clk,globalreset)begin
if (globalreset = '1') thenstate <= S_RESET;t <= (others => '0');tfar <= (others => '1');u <= (others => '0');v <= (others => '0');triID <= (others => '0');
elsif (rising_edge(clk)) thenstate <= next_state;if latchfar = '1' thentfar <= tin;
end if;if latchnear = '1' thent <= tin;u <= uin;v <= vin;triID <= triIDin;
end if;end if;
end process;
process (state,tin,t,enable,hit,reset,maxdist, tfar)begin
latchnear <= '0';latchfar <= '0';case state IS
when S_RESET =>if (enable = '1') and (hit = '1') and (tin < maxdist) thennext_state <= S_EXISTS;latchnear <= '1';latchfar <= '1';
elsenext_state <= S_RESET;
end if;
61
when S_EXISTS =>if (reset = '1') thenif (enable = '1') and (hit = '1') and (tin < maxdist) thenlatchfar <= '1';latchnear <= '1';next_state <= S_EXISTS;
elsenext_state <= S_RESET;
end if;elseif (enable = '1') and (hit = '1') and (tin < maxdist) thenif (tin >= tfar) thenlatchfar <= '1';
end if;if (tin < t) thenlatchnear <= '1';
end if;end if;next_state <= S_EXISTS;
end if;end case;
end process;
end rtl;
**** nearcmpspec.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_unsigned.all;
entity nearcmpspec isport(
tin : in std_logic_vector(31 downto 0);uin,vin : in std_logic_vector(15 downto 0);triIDin : in std_logic_vector(15 downto 0);hit : in std_logic;
t : buffer std_logic_vector(31 downto 0);tfar : buffer std_logic_vector(31 downto 0);u,v : out std_logic_vector(15 downto 0);triID : out std_logic_vector(15 downto 0);anyhit : out std_logic;
maxdist : in std_logic_vector(31 downto 0);enable : in std_logic;enablenear : in std_logic;reset : in std_logic;
globalreset : in std_logic;clk : in std_logic);
end;
architecture rtl of nearcmpspec istype nc_state_type is (S_RESET,S_NOHIT,S_EXISTS);
signal state,next_state : nc_state_type;signal latchnear, latchfar : std_logic;
beginanyhit <= '1' when (state = S_EXISTS) else '0';
process(clk,globalreset)begin
if (globalreset = '1') thenstate <= S_RESET;t <= (others => '0');tfar <= (others => '1');u <= (others => '0');v <= (others => '0');triID <= (others => '0');
elsif (rising_edge(clk)) thenstate <= next_state;if latchfar = '1' thentfar <= tin;
end if;if latchnear = '1' thent <= tin;
62
u <= uin;v <= vin;triID <= triIDin;
end if;end if;
end process;
process (state,tin,t,enable,hit,reset,maxdist, tfar)begin
latchnear <= '0';latchfar <= '0';case state IS
when S_RESET =>if (enable = '1') and (hit = '1') and (tin < maxdist) thennext_state <= S_EXISTS;latchnear <= '1';latchfar <= '1';
elsif (enablenear = '1') and (hit = '1') and (tin < maxdist) thenlatchnear <= '1';next_state <= S_NOHIT;
elsenext_state <= S_RESET;
end if;when S_NOHIT =>if (reset = '1') thenif (enable = '1') and (hit = '1') and (tin < maxdist) thenlatchfar <= '1';latchnear <= '1';next_state <= S_EXISTS;
elsif (enablenear = '1') and (hit = '1') and (tin < maxdist) thenlatchfar <= '1';latchnear <= '1';next_state <= S_NOHIT;
elsenext_state <= S_RESET;
end if;elsif (enable = '1') and (hit = '1') and (tin < maxdist) thenlatchfar <= '1';if (tin < t) thenlatchnear <= '1';
end if;next_state <= S_EXISTS;
elsenext_state <= S_NOHIT;
end if;when S_EXISTS =>if (reset = '1') thenif (enable = '1') and (hit = '1') and (tin < maxdist) thenlatchfar <= '1';latchnear <= '1';next_state <= S_EXISTS;
elsif (enablenear = '1') and (hit = '1') and (tin < maxdist) thenlatchfar <= '1';latchnear <= '1';next_state <= S_NOHIT;
elsenext_state <= S_RESET;
end if;elseif (enable = '1') and (hit = '1') and (tin < maxdist) thenif (tin >= tfar) thenlatchfar <= '1';
end if;if (tin < t) thenlatchnear <= '1';
end if;elsif (enablenear = '1') and (hit = '1') and (tin < maxdist) thenif (tin <= t) thenlatchnear <= '1';
end if;end if;next_state <= S_EXISTS;
end if;end case;
end process;
end rtl;
63
**** onlyonecycle.vhd ****-- A debugging circuit that allows a single cycle pulse to be-- generated by through the ports package
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;
entity onlyonecycle isgeneric(
pulselength : natural := 1);port(
trigger : in std_logic;output : out std_logic;globalreset : in std_logic;clk : in std_logic);
end;
architecture rtl of onlyonecycle istype state_type is (S_IDLE,S_TRIGGERED,S_WAIT);
signal state : state_type;signal next_state : state_type;signal count : integer range 0 to pulselength-1;
beginProcess(clk,globalreset)begin
if (globalreset = '1') thenstate <= S_IDLE;count <= 0;
elsif (rising_edge(clk)) thenstate <= next_state;case state iswhen S_IDLE =>count <= pulselength-1;
when S_TRIGGERED =>count <= count-1;
when others =>end case;
end if;end process;
process (state, trigger,count)Begin
output <= '0';case state IS
when S_IDLE =>if trigger = '1' thennext_state <= S_TRIGGERED;
elsenext_state <= S_IDLE;
end if;when S_TRIGGERED =>output <= '1';if count = 0 thennext_state <= S_WAIT;
elsenext_state <= S_TRIGGERED;
end if;when S_WAIT =>if trigger = '0' thennext_state <= S_IDLE;
elsenext_state <= S_WAIT;
end if;end case;
end process;
end rtl;
**** raybuffer.vhd ****---------------------------------------------------- Ray Buffer, Output Selection & Bus Interface ---- ---- Writes are enabled through the bus --
64
-- WE Function ---- 000 Idle ---- 001 origx <= raydata 27..0 ---- 010 origy <= raydata 27..0 ---- 011 origz <= raydata 27..0 ---- 100 dirx <= raydata 15..0 ---- diry <= raydata 31..16 ---- 101 dirz <= raydata 15..0 ---- swap <= raydata 16 ---- 110 maxbuff[rayaddr] <= raydata 31..0 ---- 111 activeraygroup <= rayaddr 1..0 ---- enablenear <= raydata 0 ---- ---- subraynum is not latched ---- The output ray data is latched ----------------------------------------------------
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
library work;use work.complib.all;
entity raybuffer isport(
origx, origy, origz : out std_logic_vector(27 downto 0);dirx, diry, dirz : out std_logic_vector(15 downto 0);maxdist : out std_logic_vector(31 downto 0);raygroupID : out std_logic_vector(1 downto 0);swap : out std_logic;resetout : out std_logic;enablenear : out std_logic;
raydata : in std_logic_vector(31 downto 0);rayaddr : in std_logic_vector(3 downto 0);raywe : in std_logic_vector(2 downto 0); -- May need to be expanded
subraynum : in std_logic_vector(1 downto 0);clk : in std_logic);
end;
architecture rtl of raybuffer issignal origxwe, origywe, origzwe : std_logic;signal dirxwe, dirywe, dirzwe : std_logic;signal swapwe,raygroupwe : std_logic;signal maxwe : std_logic;
signal raddr : std_logic_vector(3 downto 0);signal activeraygroup : std_logic_vector(1 downto 0);signal swapvect : std_logic_vector(0 downto 0);signal resetl : std_logic;signal maxdist0,maxdist1,maxdist2 : std_logic_vector(31 downto 0);signal raygroupIDl : std_logic_vector(1 downto 0);signal maxbuf0,maxbuf1,maxbuf2 : std_logic_vector(31 downto 0);signal enablenearl : std_logic;
begin-- Ray output address logicraddr <= activeraygroup & subraynum;process (clk)begin
if (rising_edge (clk)) thenresetl <= raygroupwe;resetout <= resetl;raygroupID <= raygroupIDl;enablenear <= enablenearl;if subraynum = "00" thenmaxdist <= maxdist0;
elsif subraynum = "01" thenmaxdist <= maxdist1;
elsif subraynum = "10" thenmaxdist <= maxdist2;
end if;
if (raygroupwe = '1') thenactiveraygroup <= rayaddr(1 downto 0);
65
maxdist0 <= maxbuf0;maxdist1 <= maxbuf1;maxdist2 <= maxbuf2;enablenearl <= raydata(0);raygroupIDl <= rayaddr(3 downto 2);
end if;if (maxwe = '1') thenif rayaddr(1 downto 0) = "00" thenmaxbuf0 <= raydata;
elsif rayaddr(1 downto 0) = "01" thenmaxbuf1 <= raydata;
elsif rayaddr(1 downto 0) = "10" thenmaxbuf2 <= raydata;
end if;end if;
end if;end process;
-- Decode the write enable signalsorigxwe <= '1' when (raywe = "001") else '0';origywe <= '1' when (raywe = "010") else '0';origzwe <= '1' when (raywe = "011") else '0';dirxwe <= '1' when (raywe = "100") else '0';dirywe <= '1' when (raywe = "100") else '0';dirzwe <= '1' when (raywe = "101") else '0';swapwe <= '1' when (raywe = "101") else '0';maxwe <= '1' when (raywe = "110") else '0';raygroupwe <= '1' when (raywe = "111") else '0';
-- Instantate all the required ram elementsorigxram : dpram
generic map (28)port map (origxwe, raddr, rayaddr, origx, raydata(27 downto 0), clk);
origyram : dpramgeneric map (28)port map (origywe, raddr, rayaddr, origy, raydata(27 downto 0), clk);
origzram : dpramgeneric map (28)port map (origzwe, raddr, rayaddr, origz, raydata(27 downto 0), clk);
dirxram : dpramgeneric map (16)port map (dirxwe, raddr, rayaddr, dirx, raydata(15 downto 0), clk);
diryram : dpramgeneric map (16)port map (dirywe, raddr, rayaddr, diry, raydata(31 downto 16), clk);
dirzram : dpramgeneric map (16)port map (dirzwe, raddr, rayaddr, dirz, raydata(15 downto 0), clk);
swapram : dpramgeneric map (1)port map (swapwe, raddr, rayaddr, swapvect, raydata(16 downto 16), clk);
swap <= swapvect(0);end rtl;
**** raygencont.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_unsigned.all;
entity raygencont isgeneric(
id : std_logic);port(
go : in std_logic;initcount : in std_logic_vector(14 downto 0);busyout : out std_logic;cycles : buffer std_logic_vector(30 downto 0);nextaddr : out std_logic_vector(17 downto 0);nas : out std_logic;
-- Memory Controller InterfacedirReady : in std_logic;wantDir : out std_logic;dirIn : in std_logic_vector(47 downto 0);
66
addrIn : in std_logic_vector(15 downto 0);
-- RayInterface Interfaceas : out std_logic;addr : buffer std_logic_vector(3 downto 0);ack : in std_logic;dir : out std_logic_vector(47 downto 0);
-- Bound Controller Interfaceraygroup : buffer std_logic_vector(1 downto 0);raygroupvalid : out std_logic;busy : in std_logic;
globalreset : in std_logic;clk : in std_logic;statepeek : out std_logic_vector(2 downto 0));
end;
architecture rtl of raygencont istype state_type is (S_IDLE,S_SENDSET,S_WAITSENT,S_ENABLEBOUND);signal state : state_type;signal next_state : state_type;signal groupID : std_logic;signal count : std_logic_vector(14 downto 0);signal first : std_logic;signal destaddr : std_logic_vector(17 downto 0);
beginprocess(state)begin
case (state) iswhen S_IDLE => statepeek <= "001";when S_SENDSET => statepeek <= "010";when S_WAITSENT => statepeek <= "011";when S_ENABLEBOUND => statepeek <= "100";when others => statepeek <= "000";
end case;end process;
process(clk,globalreset)begin
if (globalreset = '1') thenstate <= S_IDLE;cycles <= (others => '0');dir <= (others => '0');addr(1 downto 0) <= "00";groupID <= '0';count <= (others => '0');first <= '0';destAddr <= (others => '0');raygroupvalid <= '0';
elsif (rising_edge (clk)) thenstate <= next_state;if (state /= S_IDLE) thencycles <= cycles + 1;
end if;case (state) iswhen S_IDLE =>if go = '1' thencycles <= (others => '0');
end if;addr(1 downto 0) <= "00";groupID <= '0';count <= initcount;
when S_SENDSET =>dir <= dirIn;
when S_WAITSENT =>if (ack = '1') and (addr(1 downto 0) /= "10") thenaddr(1 downto 0) <= addr(1 downto 0) + "01";
end if;if (ack = '1') and addr(1 downto 0) = "10" and busy = '0' thenraygroupvalid <= '1';
end if;when S_ENABLEBOUND =>if busy = '1' thengroupID <= not groupID;raygroupvalid <= '0';count <= count - 1;
67
end if;addr(1 downto 0) <= "00";
when others =>end case;
end if;end process;
addr(3 downto 2) <= raygroup;busyout <= '0' when state = S_IDLE else '1';raygroup <= id & groupID;nextaddr <= "11" & addrIn;nas <= '1' when (state = S_SENDSET and addr(1 downto 0) = "00" and dirReady = '1') else
'0';
process (state,go,ack,busy,dirReady,addr,count)begin
as <= '0';wantDir <= '0';case (state) is
when S_IDLE =>if (go = '1') thennext_state <= S_SENDSET;
elsenext_state <= S_IDLE;
end if;when S_SENDSET =>as <= dirReady;wantdir <= '1';if dirReady = '1' thennext_state <= S_WAITSENT;
elsenext_State <= S_SENDSET;
end if;when S_WAITSENT =>wantdir <= '0';as <= '1';if (ack = '1') and (addr(1 downto 0) /= "10") thennext_state <= S_SENDSET;
elsif (ack = '1') and (busy = '0') thennext_state <= S_ENABLEBOUND;
elsenext_state <= S_WAITSENT;
end if;when S_ENABLEBOUND =>if busy = '0' thennext_state <= S_ENABLEBOUND;
elsif count > 0 thennext_state <= S_SENDSET;
elsenext_state <= S_IDLE;
end if;end case;
end process;
end rtl;
**** raygentop.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
library work;use work.complib.all;
entity raygentop isport(
-- Ports Package Signalsrgwant_addr : out std_logic;rgwant_data : out std_logic;rgread_ready : out std_logic;rgaddr_ready : in std_logic;rgdata_ready : in std_logic;rgwant_read : in std_logic;rgdatain : in std_logic_vector(63 downto 0);rgdataout : out std_logic_vector(63 downto 0);
68
rgaddrin : in std_logic_vector(17 downto 0);origx : in std_logic_vector(27 downto 0);origy : in std_logic_vector(27 downto 0);origz : in std_logic_vector(27 downto 0);rgcont : in std_logic_vector(31 downto 0);rgstat : out std_logic_vector(31 downto 0);-- Memory Signalstm3_sram_data : inout std_logic_vector(63 downto 0);tm3_sram_addr : out std_logic_vector(18 downto 0);tm3_sram_we : out std_logic_vector(7 downto 0);tm3_sram_oe : out std_logic_vector(1 downto 0);tm3_sram_adsp : out std_logic;tm3_clk_v0 : in std_logic;
-- Interchip signalsraygroup01 : out std_logic_vector(1 downto 0);raygroupvalid01 : out std_logic;busy01 : in std_logic;raygroup10 : out std_logic_vector(1 downto 0);raygroupvalid10 : out std_logic;busy10 : in std_logic;
globalreset : in std_logic;
rgData : out std_logic_vector(31 downto 0);rgAddr : out std_logic_vector(3 downto 0);rgWE : out std_logic_vector(2 downto 0);rgAddrValid : out std_logic;rgDone : in std_logic;
rgResultData : in std_logic_vector(31 downto 0);rgResultReady : in std_logic;rgResultSource : in std_logic_vector(1 downto 0));
end;
architecture rtl of raygentop is
signal statepeek,statepeek2 : std_logic_vector(2 downto 0);signal as01,as10,ack01,ack10 : std_logic;signal addr01, addr10 : std_logic_vector( 3 downto 0);signal dir01,dir10,dir : std_logic_vector(47 downto 0);signal dirReady01, dirReady10, wantDir01, wantDir10 : std_logic;signal address : std_logic_vector(15 downto 0);signal cyclecounter : std_logic_vector(30 downto 0);signal nas01,nas10 : std_logic;signal go : std_logic;signal statepeekct : std_logic_vector(2 downto 0);-- result Signalssignal valid01,valid10 : std_logic;signal id01a,id01b,id01c : std_logic_vector(15 downto 0);signal id10a,id10b,id10c : std_logic_vector(15 downto 0);signal hit01a,hit01b,hit01c : std_logic;signal hit10a,hit10b,hit10c : std_logic;signal wantwriteback, writebackack : std_logic;signal writebackdata : std_logic_vector(63 downto 0);signal writebackaddr : std_logic_vector(17 downto 0);signal nextaddr01,nextaddr10 : std_logic_vector(17 downto 0);begin
onlyeonecycleinst : onlyonecycleport map(rgCont(0),go,globalreset,tm3_clk_v0);
sramcont : RGsramcontrollerport map(rgwant_addr,rgaddr_ready,rgaddrin,rgwant_data,rgdata_ready,rgdatain,
rgwant_read,rgread_ready,rgdataout,dirReady01,dirReady10,wantDir01,wantDir10,dir,address,wantwriteback,writebackack,writebackdata,writebackaddr,tm3_sram_data,tm3_sram_addr,tm3_sram_we,tm3_sram_oe,tm3_sram_adsp,globalreset, tm3_clk_v0,statepeek);
raysendinst : raysendport map(as01,as10,ack01,ack10,addr01,addr10,dir01,dir10,origx,origy,origz,
rgData,rgAddr, rgWE,rgAddrValid, rgDone, globalreset,tm3_clk_v0, statepeek2);
raygencontinst : raygencontgeneric map('1')port map(go, rgCont(15 downto 1),rgStat(31), cyclecounter, nextaddr01, nas01,
dirReady01, wantDir01, dir, address, as01,addr01,ack01,dir01,
69
raygroup01,raygroupvalid01,busy01, globalreset,tm3_clk_v0,statepeekct);
resultrecieveinst : resultrecieveport map(valid01,valid10,id01a,id01b,id01c,id10a,id10b,id10c,
hit01a,hit01b,hit01c,hit10a,hit10b,hit10c,rgResultData,rgResultReady,rgResultSource, globalreset,tm3_clk_v0);
resultwriteinst : resultwriterport map(valid01,valid10,id01a,id01b,id01c,id10a,id10b,id10c,
hit01a,hit01b,hit01c,hit10a,hit10b,hit10c,nextaddr01,nextaddr10,nas01,nas10,writebackdata,writebackaddr,wantwriteback,writebackack,globalreset,tm3_clk_v0);
rgStat(30 downto 0) <= cyclecounter;
as10 <= '0';nas10 <= '0';raygroupvalid10 <= '0';wantdir10 <= '0';
end rtl;
**** rayinterface.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
entity rayinterface isport(
max : in std_logic_vector(31 downto 0);maxwe : in std_logic;raygroup : in std_logic_vector(1 downto 0);raygroupwe : in std_logic;raygroupid : in std_logic_vector(1 downto 0);enablenear : in std_logic;
-- Interchip Bus Signals (Ray Generation Chip)rgData : in std_logic_vector(31 downto 0);rgAddr : in std_logic_vector(3 downto 0);rgWE : in std_logic_vector(2 downto 0);rgAddrValid : in std_logic;rgDone : buffer std_logic;
-- Interchip Bus Signals (Ray Tri Chip)raydata : out std_logic_vector(31 downto 0);rayaddr : out std_logic_vector(3 downto 0);raywe : out std_logic_vector(2 downto 0);
globalreset : in std_logic;clk : in std_logic);
end;
architecture rtl of rayinterface isbegin
Process(clk,globalreset)begin
if (globalreset = '1') thenraydata <= (others => '0');rayaddr <= (others => '0');raywe <= (others => '0');rgDone <= '0';
elsif (rising_edge(clk)) thenraywe <= (others => '0');if rgAddrValid = '0' thenrgDone <= '0';
end if;if raygroupwe = '1' thenraydata(0) <= enablenear;raydata(31 downto 1) <= (others => '0');raywe <= "111";rayaddr <= raygroupid & raygroup;
elsif maxwe = '1' thenraydata <= max;raywe <= "110";
70
rayaddr <= "00" & raygroupid;elsif rgAddrValid = '1' and rgDone = '0' thenraydata <= rgData;raywe <= rgWe;rayaddr <= rgAddr;rgDone <= '1';
end if;end if;
end process;
end rtl;
**** raysend.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
entity raysend isport(
as01,as10 : in std_logic;ack01,ack10 : buffer std_logic;addr01, addr10 : in std_logic_vector(3 downto 0);dir01, dir10 : in std_logic_vector(47 downto 0);origx,origy,origz : in std_logic_vector(27 downto 0);
rgData : out std_logic_vector(31 downto 0);rgAddr : out std_logic_vector(3 downto 0);rgWE : out std_logic_vector(2 downto 0);rgAddrValid : out std_logic;rgDone : in std_logic;
globalreset : in std_logic;clk : in std_logic;statepeek : out std_logic_vector(2 downto 0));
end;
architecture rtl of raysend istype state_type is (S_IDLE,S_ORIGX,S_ORIGY,S_ORIGZ,S_DIRXY,S_DIRZ,
S_ORIGXWAIT,S_ORIGYWAIT,S_ORIGZWAIT,S_DIRXYWAIT);signal state : state_type;signal next_state : state_type;
signal unitselect : std_logic;signal dir : std_logic_vector(47 downto 0);
beginprocess(state)begin
case state iswhen S_IDLE => statepeek <= "001";when S_ORIGX => statepeek <= "010";when S_ORIGY => statepeek <= "011";when S_ORIGZ => statepeek <= "100";when S_DIRXY => statepeek <= "101";when S_DIRZ => statepeek <= "110";when others => statepeek <= "000";
end case;end process;
dir <= dir01 when unitselect = '1' else dir10;
process(clk,globalreset)begin
if (globalreset = '1') thenstate <= S_IDLE;ack01 <= '0';ack10 <= '0';unitselect <= '1';rgWe <= "000";rgData <= (others => '0');rgAddrValid <= '0';rgAddr <= (others => '0');
elsif (rising_edge (clk)) thenstate <= next_state;
71
case (state) iswhen S_IDLE =>if ((as01 = '1') and (ack01 = '0')) or
((as10 = '1') and (ack10 = '0')) thenrgData <= "0000" & origx;rgWe <= "001";rgAddrValid <= '1';
end if;if (as01 = '1') and (ack01 = '0') thenrgAddr <= addr01;unitselect <= '1';
elsergAddr <= addr10;unitselect <= '0';
end if;if (as01 = '0' and ack01 = '1') thenack01 <= '0';
end if;if (as10 = '0' and ack10 = '1') thenack10 <= '0';
end if;when S_ORIGX =>if rgDONE = '1' thenrgAddrValid <= '0';
end if;when S_ORIGXWAIT =>rgData <= "0000" & origy;rgWe <= "010";rgAddrValid <= '1';
when S_ORIGY =>if rgDONE = '1' thenrgAddrValid <= '0';
end if;when S_ORIGYWAIT =>rgData <= "0000" & origz;rgWe <= "011";rgAddrValid <= '1';
when S_ORIGZ =>if rgDONE = '1' thenrgAddrValid <= '0';
end if;when S_ORIGZWAIT =>rgData <= dir(31 downto 16) & dir(47 downto 32);rgWe <= "100";rgAddrValid <= '1';
when S_DIRXY =>if rgDONE = '1' thenrgAddrValid <= '0';
end if;when S_DIRXYWAIT =>rgData <= "0000000000000000" & dir(15 downto 0);rgWe <= "101";rgAddrValid <= '1';
when S_DIRZ =>if unitselect = '1' thenack01 <= '1';
elseack10 <= '1';
end if;if rgDONE = '1' thenrgAddrValid <= '0';
end if;when others =>
end case;end if;
end process;
process (state,origx,origy,origz,dir,ack01,ack10,as10,as01,rgdone)begin
case (state) iswhen S_IDLE =>if ((as01 = '1') and (ack01 = '0')) or
((as10 = '1') and (ack10 = '0')) thennext_state <= S_ORIGX;
elsenext_state <= S_IDLE;
end if;
72
when S_ORIGX =>if rgDone = '1' thennext_state <= S_ORIGXWAIT;
elsenext_state <= S_ORIGX;
end if;when S_ORIGXWAIT =>next_state <= S_ORIGY;
when S_ORIGY =>if rgDone = '1' thennext_state <= S_ORIGYWAIT;
elsenext_state <= S_ORIGY;
end if;when S_ORIGYWAIT =>next_state <= S_ORIGZ;
when S_ORIGZ =>if rgDone = '1' thennext_state <= S_ORIGZWAIT;
elsenext_state <= S_ORIGZ;
end if;when S_ORIGZWAIT =>next_state <= S_DIRXY;
when S_DIRXY =>if rgDone = '1' thennext_state <= S_DIRXYWAIT;
elsenext_state <= S_DIRXY;
end if;when S_DIRXYWAIT =>next_state <= S_DIRZ;
when S_DIRZ =>if rgDone = '1' thennext_state <= S_IDLE;
elsenext_state <= S_DIRZ;
end if;end case;
end process;
end rtl;
**** raytri.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
library work;use work.complib.all;
entity raytri isport(
clk : in std_logic;
tout : out std_logic_vector(31 downto 0);uout : out std_logic_vector(15 downto 0);vout : out std_logic_vector(15 downto 0);triIDout : out std_logic_vector(15 downto 0);hitout : out std_logic;
vert0x,vert0y,vert0z : in std_logic_vector(27 downto 0);origx,origy,origz : in std_logic_vector(27 downto 0);dirx,diry,dirz : in std_logic_vector(15 downto 0);edge1x,edge1y, edge1z : in std_logic_vector(15 downto 0);edge1size : in std_logic_vector(1 downto 0);edge2x,edge2y, edge2z : in std_logic_vector(15 downto 0);edge2size : in std_logic_vector(1 downto 0);config : in std_logic_vector(0 downto 0);exchangeEdges : in std_logic;triID : in std_logic_vector(15 downto 0);
debugdetneg : out std_logic;debugsuneg : out std_logic;debugvneg : out std_logic;debugsugtdet : out std_logic;
73
debugvgtdet : out std_logic;debugtneg : out std_logic;debughitinter : out std_logic;debughit : out std_logic
);end;
architecture rtl of raytri is
-- Latch Connected Signalssignal tvecxl,tvecyl,tveczl : std_logic_vector(28 downto 0);signal edge1xr,edge1yr,edge1zr : std_logic_vector(15 downto 0);signal edge1xla,edge1yla,edge1zla : std_logic_vector(15 downto 0);signal edge1xlb,edge1ylb,edge1zlb : std_logic_vector(15 downto 0);signal edge2xr,edge2yr,edge2zr : std_logic_vector(15 downto 0);signal edge2xla,edge2yla,edge2zla : std_logic_vector(15 downto 0);signal edge2xlb,edge2ylb,edge2zlb : std_logic_vector(15 downto 0);signal dirxla,diryla,dirzla : std_logic_vector(15 downto 0);signal dirxlb,dirylb,dirzlb : std_logic_vector(15 downto 0);signal detl : std_logic_vector(50 downto 0);signal hitl : std_logic_vector(0 downto 0);signal configl : std_logic_vector(0 downto 0);signal edge1sizer, edge2sizer : std_logic_vector(1 downto 0);signal edge1sizel, edge2sizel : std_logic_vector(1 downto 0);
-- Intermediate Signalssignal pvecx,pvecy,pvecz : std_logic_vector(32 downto 0);signal det : std_logic_vector(50 downto 0);signal tvecx,tvecy,tvecz : std_logic_vector(28 downto 0);signal qvecx,qvecy,qvecz : std_logic_vector(45 downto 0);signal u,su : std_logic_vector(63 downto 0);signal v,usv : std_logic_vector(63 downto 0);signal t : std_logic_vector(63 downto 0);signal uv : std_logic_vector(64 downto 0);signal hitinter : std_logic;
-- Output Signalssignal hit : std_logic_vector(0 downto 0);signal ru : std_logic_vector(15 downto 0);signal rv : std_logic_vector(15 downto 0);
begin-- Level 1 Mathpvec : crossproduct
generic map (16,16)port map (dirxla,diryla,dirzla,edge2xla,edge2yla,edge2zla,pvecx,pvecy,pvecz,clk);
tvec : vectsubgeneric map (28)port map (origx,origy,origz,vert0x,vert0y,vert0z,tvecx,tvecy,tvecz,clk);
tvecdelay : vectdelaygeneric map (29,2)port map (tvecx,tvecy,tvecz,tvecxl,tvecyl,tveczl,clk);
edge1exchange : vectexchangegeneric map (16)port map (edge2x, edge2y, edge2z, edge1x, edge1y, edge1z,
edge1xr,edge1yr,edge1zr,exchangeEdges);
edge2exchange : vectexchangegeneric map (16)port map (edge1x, edge1y, edge1z, edge2x, edge2y, edge2z,
edge2xr,edge2yr,edge2zr,exchangeEdges);
-- changed to delay 1edge1adelay : vectdelay
generic map (16,1)port map (edge1xr,edge1yr,edge1zr,edge1xla,edge1yla,edge1zla,clk);
-- changed to delay 2edge1bdelay : vectdelay
generic map (16,2)port map (edge1xla,edge1yla,edge1zla,edge1xlb,edge1ylb,edge1zlb,clk);
qvec : crossproductgeneric map (29,16)port map (tvecx,tvecy,tvecz,edge1xla,edge1yla,edge1zla,qvecx,qvecy,qvecz,clk);
74
det : dotproductgeneric map (16,33)port map(edge1xlb,edge1ylb,edge1zlb,pvecx,pvecy,pvecz,det,clk);
ui : dotproductgeneric map (29,33)port map (tvecxl,tvecyl,tveczl,pvecx,pvecy,pvecz,u,clk);
dirdelaya : vectdelaygeneric map(16,1)port map(dirx,diry,dirz,dirxla,diryla,dirzla,clk);
dirdelayb : vectdelaygeneric map(16,2)port map(dirxla,diryla,dirzla,dirxlb,dirylb,dirzlb,clk);
vi : dotproductgeneric map (16,46)port map (dirxlb,dirylb,dirzlb,qvecx,qvecy,qvecz,usv,clk);
edge2delaya : vectdelaygeneric map(16,1)port map (edge2xr,edge2yr,edge2zr,edge2xla,edge2yla,edge2zla,clk);
edge2delayb : vectdelaygeneric map(16,2)port map (edge2xla,edge2yla,edge2zla,edge2xlb,edge2ylb,edge2zlb,clk);
ti : dotproductgeneric map (16,46)port map (edge2xlb,edge2ylb,edge2zlb,qvecx,qvecy,qvecz,t,clk);
configdelay : delaygeneric map (1,6)port map(config,configl,clk);
detdelay : delaygeneric map (51,1)port map(det,detl,clk);
divt : dividegeneric map(64,32,51,18)port map(t,det,tout,clk);
divu : dividegeneric map(64,16,51,16) -- Changed fraction part to 16port map(su,det,ru,clk);
divv : dividegeneric map(64,16,51,16) -- Changed fraction part to 16port map(v,det,rv,clk);
rudelay : delaygeneric map (16,16)port map(ru,uout,clk);
rvdelay : delaygeneric map (16,16)port map (rv, vout,clk);
triIDdelay : delaygeneric map (16,37)port map (triID,triIDout,clk);
-- Shifter sectionedge1sizeexchange : exchange
generic map(2)port map (edge2size, edge1size, edge1sizer, exchangeEdges);
edge2sizeexchange : exchangegeneric map(2)port map (edge1size, edge2size, edge2sizer, exchangeEdges);
edge1sizeDelay : delaygeneric map (2,5)port map(edge1sizer,edge1sizel,clk);
75
edge2sizeDelay : delaygeneric map (2,5)port map(edge2sizer,edge2sizel,clk);
shifter1 : shiftergeneric map (64)port map(usv,v,edge1sizel);
shifter2 : shiftergeneric map (64)port map(u,su,edge2sizel);
-- Sun interface (address mapped input registers)
hitdelay : delaygeneric map (1,30)port map (hit,hitl,clk);
hitout <= hitl(0);
debugdetneg <= '1' when (det < 0) else '0';debugsuneg <= '1' when (su < 0) else '0';debugvneg <= '1' when (v < 0) else '0';debugsugtdet <= '1' when (su > det) else '0';debugvgtdet <= '1' when (v > det) else '0';debugtneg <= '1' when (t < 0) else '0';debughitinter <= hitinter;debughit <= hit(0);
process(clk)begin
if (rising_edge(clk)) then-- Hit detection Logic (2 cycles)uv <= (su(63) & su)+(v(63) & v);if ((det < 0) or (su < 0) or (v < 0) or (su > det) or (v > det) or (t <= 0)) thenhitinter <= '0';
elsehitinter <= '1';
end if;if ((hitinter = '0') or ((configl(0) = '0') and (uv > detl))) thenhit(0) <= '0';
elsehit(0) <= '1';
end if;-- Hit Detection Logic Ends
end if;end process;
end rtl;
**** resultcounter.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_unsigned.all;
entity resultcounter isport(
resultID : in std_logic_vector(1 downto 0);newresult : in std_logic;done : out std_logic_vector(1 downto 0);reset : in std_logic;globalreset : in std_logic;clk : in std_logic);
end;
architecture rtl of resultcounter issignal count : std_logic_vector(3 downto 0);signal curr : std_logic_vector(1 downto 0);
begindone <= curr when count = 0 else "00";
process(clk,globalreset,reset)begin
if (globalreset = '1') or (reset = '1') thencount <= "1000";curr <= (others => '0');
elsif (rising_edge(clk)) then
76
if (resultID /= 0) and (newresult = '1') and (count /= 0) thencount <= count - 1;curr <= resultID;
end if;end if;
end process;end rtl;
**** resultinterface.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
entity resultinterface isport(
t1b,t2b,t3b : out std_logic_vector(31 downto 0);tf1b,tf2b,tf3b : out std_logic_vector(31 downto 0);u1b,u2b,u3b,v1b,v2b,v3b : out std_logic_vector(15 downto 0);id1b,id2b,id3b : out std_logic_vector(15 downto 0);hit1b,hit2b,hit3b : out std_logic;resultID : out std_logic_vector(1 downto 0);newdata : out std_logic;resultready : in std_logic;resultdata : in std_logic_vector(31 downto 0);globalreset : in std_logic;clk : in std_logic);
end;
architecture rtl of resultinterface istype state_type is (S_IDLE,S_READ1,S_READ2,S_READ3,S_READ4,S_READ5,S_READ6,
S_READ7,S_READ8,S_READ9,S_READ10,S_READ11);signal state : state_type;signal next_state : state_type;
beginProcess(clk,globalreset)begin
if (globalreset = '1') thenstate <= S_IDLE;t1b <= (others => '0'); t2b <= (others => '0'); t3b <= (others => '0');tf1b <= (others => '0'); tf2b <= (others => '0'); tf3b <= (others => '0');u1b <= (others => '0'); u2b <= (others => '0'); u3b <= (others => '0');v1b <= (others => '0'); v2b <= (others => '0'); v3b <= (others => '0');id1b <= (others => '0'); id2b <= (others => '0'); id3b <= (others => '0');hit1b <= '0'; hit2b <= '0'; hit3b <= '0';resultID <= (others => '0');newdata <= '0';
elsif (rising_edge(clk)) thenstate <= next_state;newdata <= '0';case state iswhen S_IDLE =>if (resultready = '1') thent1b <= resultdata;
end if;when S_READ1 =>tf1b <= resultdata;
when S_READ2 =>u1b <= resultdata(31 downto 16);v1b <= resultdata(15 downto 0);
when S_READ3 =>id1b <= resultdata(15 downto 0);hit1b <= resultdata(16);resultID <= resultdata(18 downto 17);
when S_READ4 =>t2b <= resultdata;
when S_READ5 =>tf2b <= resultdata;
when S_READ6 =>u2b <= resultdata(31 downto 16);v2b <= resultdata(15 downto 0);
when S_READ7 =>id2b <= resultdata(15 downto 0);hit2b <= resultdata(16);
when S_READ8 =>
77
t3b <= resultdata;when S_READ9 =>tf3b <= resultdata;
when S_READ10 =>u3b <= resultdata(31 downto 16);v3b <= resultdata(15 downto 0);
when S_READ11 =>id3b <= resultdata(15 downto 0);hit3b <= resultdata(16);newdata <= '1';
end case;end if;
end process;
process (state, resultready)Begin
case state ISwhen S_IDLE =>if (resultready = '1') thennext_state <= S_READ1;
elsenext_state <= S_IDLE;
end if;when S_READ1 =>next_state <= S_READ2;
when S_READ2 =>next_state <= S_READ3;
when S_READ3 =>next_state <= S_READ4;
when S_READ4 =>next_state <= S_READ5;
when S_READ5 =>next_state <= S_READ6;
when S_READ6 =>next_state <= S_READ7;
when S_READ7 =>next_state <= S_READ8;
when S_READ8 =>next_state <= S_READ9;
when S_READ9 =>next_state <= S_READ10;
when S_READ10 =>next_state <= S_READ11;
when S_READ11 =>next_state <= S_IDLE;
end case;end process;
end rtl;
**** resultrecieve.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
entity resultrecieve isport(
valid01,valid10 : out std_logic;id01a,id01b,id01c : out std_logic_vector(15 downto 0);id10a,id10b,id10c : out std_logic_vector(15 downto 0);hit01a,hit01b,hit01c : out std_logic;hit10a,hit10b,hit10c : out std_logic;
rgResultData : in std_logic_vector(31 downto 0);rgResultReady : in std_logic;rgResultSource : in std_logic_vector(1 downto 0);
globalreset : in std_logic;clk : in std_logic);
end;
architecture rtl of resultrecieve istype state_type is (S_IDLE,S_READ01,S_READ10);signal state : state_type;signal next_state : state_type;
begin
78
process(clk,globalreset)begin
if (globalreset = '1') thenstate <= S_IDLE;valid01 <= '0'; valid10 <= '0';hit01a <= '0'; hit01b <= '0'; hit01c <= '0';hit10a <= '0'; hit10b <= '0'; hit10c <= '0';id01a <= (others => '0'); id01b <= (others => '0'); id01c <= (others => '0');id10a <= (others => '0'); id10b <= (others => '0'); id10c <= (others => '0');
elsif (rising_edge (clk)) thenstate <= next_state;valid01 <= '0';valid10 <= '0';case (state) iswhen S_IDLE =>if rgResultReady = '1' and rgResultSource = "01" thenid01a <= rgResultData(31 downto 16);id01b <= rgResultData(15 downto 0);
elsif rgResultReady = '1' and rgResultSource = "10" thenid10a <= rgResultData(31 downto 16);id10b <= rgResultData(15 downto 0);
end if;when S_READ01 =>id01c <= rgResultData(15 downto 0);hit01a <= rgResultData(18);hit01b <= rgResultData(17);hit01c <= rgResultData(16);valid01 <= '1';
when S_READ10 =>id10c <= rgResultData(15 downto 0);hit10a <= rgResultData(18);hit10b <= rgResultData(17);hit10c <= rgResultData(16);valid10 <= '1';
when others =>end case;
end if;end process;
process (state,rgResultReady,rgResultSource)begin
case (state) iswhen S_IDLE =>if rgResultReady = '1' and rgResultSource = "01" thennext_state <= S_READ01;
elsif rgResultReady = '1' and rgResultSource = "10" thennext_state <= S_READ10;
elsenext_state <= S_IDLE;
end if;when S_READ01 =>next_state <= S_IDLE;
when S_READ10 =>next_state <= S_IDLE;
end case;end process;
end rtl;
**** resulttransmit.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
entity resulttransmit isport(
valid01,valid10 : in std_logic;id01a,id01b,id01c : in std_logic_vector(15 downto 0);id10a,id10b,id10c : in std_logic_vector(15 downto 0);hit01a,hit01b,hit01c : in std_logic;hit10a,hit10b,hit10c : in std_logic;
-- Interchip Bus SignalsrgResultData : out std_logic_vector(31 downto 0);rgResultReady : out std_logic;rgResultSource : out std_logic_vector(1 downto 0);
79
globalreset : in std_logic;clk : in std_logic);
end;
architecture rtl of resulttransmit istype state_type is (S_IDLE,S_SEND01A,S_SEND01B,S_SEND10A,S_SEND10B);signal state : state_type;signal next_state : state_type;signal pending01,pending10 : std_logic;
begin
process(clk,globalreset)begin
if (globalreset = '1') thenstate <= S_IDLE;pending01 <= '0';pending10 <= '0';rgresultdata <= (others => '0');rgresultsource <= (others => '0');rgresultready <= '0';
elsif (rising_edge (clk)) thenif valid01 = '1' thenpending01 <= '1';
end if;if valid10 = '1' thenpending10 <= '1';
end if;rgResultReady <= '0';state <= next_state;case (state) iswhen S_SEND01A =>rgResultData <= id01a & id01b;rgResultReady <= '1';rgResultSource <= "01";
when S_SEND01B =>rgResultData <= "0000000000000" & hit01a & hit01b & hit01c & id01c;rgResultReady <= '0';rgResultSource <= "01";pending01 <= '0';
when S_SEND10A =>rgResultData <= id10a & id10b;rgResultReady <= '1';rgResultSource <= "10";
when S_SEND10B =>rgResultData <= "0000000000000" & hit10a & hit10b & hit10c & id10c;rgResultReady <= '0';rgResultSource <= "10";pending10 <= '0';
when others =>end case;
end if;end process;
process (state,pending01,pending10)begin
case (state) iswhen S_IDLE =>if pending01 = '1' thennext_state <= S_SEND01A;
elsif pending10 = '1' thennext_state <= S_SEND10A;
elsenext_state <= S_IDLE;
end if;when S_SEND01A =>next_state <= S_SEND01B;
when S_SEND01B =>next_state <= S_IDLE;
when S_SEND10A =>next_state <= S_SEND10B;
when S_SEND10B =>next_state <= S_IDLE;
end case;end process;
end rtl;
80
**** resultwriter.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
library work;use work.complib.all;
entity resultwriter isport(
valid01,valid10 : in std_logic;id01a,id01b,id01c : in std_logic_vector(15 downto 0);id10a,id10b,id10c : in std_logic_vector(15 downto 0);hit01a,hit01b,hit01c : in std_logic;hit10a,hit10b,hit10c : in std_logic;addr01, addr10 : in std_logic_vector(17 downto 0);as01,as10 : in std_logic;
dataout : out std_logic_vector(63 downto 0);addrout : out std_logic_vector(17 downto 0);write : out std_logic;ack : in std_logic;globalreset : in std_logic;clk : in std_logic);
end;
architecture rtl of resultwriter istype state_type is (S_IDLE,S_PROCESS);signal state : state_type;signal next_state : state_type;
signal pending01, pending10 : std_logic;signal addrout01, addrout10 : std_logic_vector(17 downto 0);signal shiften01,shiften10 : std_logic;
begin
fifo3insta : fifo3port map(addr01,as01,addrout01,shiften01,globalreset,clk);
fifo3instb : fifo3port map(addr10,as10,addrout10,shiften10,globalreset,clk);
shiften01 <= '1' when pending01 = '1' and (state = S_PROCESS) and ack = '1' else '0';shiften10 <= '1' when pending10 = '1' and pending01 ='0' and (state = S_PROCESS) and ack
= '1' else '0';
process(clk,globalreset)begin
if (globalreset = '1') thenstate <= S_IDLE;pending01 <= '0';pending10 <= '0';
elsif (rising_edge (clk)) thenstate <= next_state;if valid01 = '1' thenpending01 <= '1';
end if;if valid10 = '1' thenpending10 <= '1';
end if;case (state) iswhen S_PROCESS =>if ack = '1' and pending01 = '1' thenpending01 <= '0';
elsif ack = '1' and pending10 = '1' thenpending10 <= '0';
end if;when others =>
end case;end if;
end process;
dataout <= ('0' & hit01a & "000000" & hit01a & "000000" & hit01a & "000000" &hit01b & "000000" & hit01b & "000000" & hit01b & "000000" &hit01c & "000000" & hit01c & "000000" & hit01c & "000000") when
81
pending01 = '1' else('0' & hit10a & "000000" & hit10a & "000000" & hit10a & "000000" &
hit10b & "000000" & hit10b & "000000" & hit10b & "000000" &hit10c & "000000" & hit10c & "000000" & hit10c & "000000");
addrout <= addrout01 when pending01 = '1' else addrout10;write <= '1' when state = S_PROCESS else '0';
process (state,pending01,pending10,ack)begin
case (state) iswhen S_IDLE =>if pending01 = '1' or pending10 = '1' thennext_state <= S_PROCESS;
elsenext_state <= S_IDLE;
end if;when S_PROCESS =>if ack = '1' thennext_state <= S_IDLE;
elsenext_state <= S_PROCESS;
end if;end case;
end process;
end rtl;
**** Rgsramcontroller.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
entity RGsramcontroller isport(
want_addr : out std_logic;addr_ready : in std_logic;addrin : in std_logic_vector(17 downto 0);want_data : out std_logic;data_ready : in std_logic;datain : in std_logic_vector(63 downto 0);want_read : in std_logic;read_ready : out std_logic;dataout : out std_logic_vector(63 downto 0);
dirReady01,dirReady10 : out std_logic;wantDir01,wantDir10 : in std_logic;dir : out std_logic_vector(47 downto 0);addr : out std_logic_vector(15 downto 0);
wantwriteback : in std_logic;writebackack : out std_logic;writebackdata : in std_logic_vector(63 downto 0);writebackaddr : in std_logic_vector(17 downto 0);
tm3_sram_data : inout std_logic_vector(63 downto 0);tm3_sram_addr : out std_logic_vector(18 downto 0);tm3_sram_we : out std_logic_vector(7 downto 0);tm3_sram_oe : out std_logic_vector(1 downto 0);tm3_sram_adsp : out std_logic;globalreset : in std_logic;clk : in std_logic;statepeek : out std_logic_vector(2 downto 0));
end;
architecture rtl of RGsramcontroller istype state_type is
(S_IDLE,S_LATCHADDR,S_READ,S_WRITE,S_WAIT,S_READ01,S_READ10,S_WRITEBACK);signal state : state_type;signal next_state : state_type;
signal waddress : std_logic_vector(17 downto 0);begin
process(state)
82
begincase state is
when S_IDLE => statepeek <= "000";when S_LATCHADDR => statepeek <= "001";when S_READ => statepeek <= "010";when S_WRITE => statepeek <= "011";when S_WAIT => statepeek <= "100";when S_READ01 => statepeek <= "101";when S_READ10 => statepeek <= "110";when S_WRITEBACK => statepeek <= "111";
end case;end process;
dataout <= tm3_sram_data;dir <= tm3_sram_data(47 downto 0);addr <= tm3_sram_data(63 downto 48);
process(clk,globalreset)begin
if (globalreset = '1') thenstate <= S_IDLE;waddress <= (others => '0');
elsif (rising_edge (clk)) thenstate <= next_state;
case (state) iswhen S_IDLE =>if (addr_ready = '1') thenwaddress <= addrin;
end if;when S_WRITE =>waddress <= waddress+1;
when S_READ =>if (want_read = '0') thenwaddress <= waddress+1;
end if;when S_READ01 =>if wantDir01 = '0' thenwaddress <= waddress+1;
end if;when S_READ10 =>if wantDir10 = '0' thenwaddress <= waddress+1;
end if;when others =>
end case;end if;
end process;
process(state,addr_ready,data_ready,waddress,datain,wantdir10,wantdir01,want_read,wantwriteback,writebackdata,writebackaddr)
begintm3_sram_we <= "11111111";tm3_sram_oe <= "01";tm3_sram_adsp <= '0';tm3_sram_data <= (others => 'Z');tm3_sram_addr <= '0' & waddress;want_addr <= '1';want_data <= '1';read_ready <= '1';dirReady01 <= '0';dirReady10 <= '0';writebackack <= '0';case (state) is
when S_IDLE =>if (addr_ready = '1') thennext_state <= S_LATCHADDR;
elsif (want_read = '1') thennext_state <= S_READ;
elsif (data_ready = '1') thennext_state <= S_WRITE;
elsif (wantDir01 = '1') thennext_state <= S_READ01;
elsif (wantDir10 = '1') thennext_state <= S_READ10;
elsif (wantWriteback = '1') thennext_state <= S_WRITEBACK;
83
elsenext_state <= S_IDLE;
end if;when S_READ10 =>dirReady10 <= '1';if wantDir10 = '0' thennext_state <= S_IDLE;
elsenext_state <= S_READ10;
end if;when S_READ01 =>dirReady01 <= '1';if wantDir01 = '0' thennext_state <= S_IDLE;
elsenext_state <= S_READ01;
end if;when S_LATCHADDR =>want_addr <= '0';if (addr_ready = '0') thennext_state <= S_IDLE;
elsenext_state <= S_LATCHADDR;
end if;when S_READ =>read_ready <= '0';if (want_read = '1') thennext_state <= S_READ;
elsenext_state <= S_IDLE;
end if;when S_WRITEBACK =>tm3_sram_data <= writebackdata;tm3_sram_we <= "00000000";tm3_sram_oe <= "11";tm3_sram_adsp <= '0';tm3_sram_addr <= '0' & writebackaddr;writebackAck <= '1';next_state <= S_IDLE;
when S_WRITE =>tm3_sram_data <= datain;tm3_sram_we <= "00000000";tm3_sram_oe <= "11";tm3_sram_adsp <= '0';want_data <= '0';next_state <= S_WAIT;
when S_WAIT =>if data_ready = '1' thennext_state <= S_WAIT;
elsenext_state <= S_IDLE;
end if;want_data <= '0';
end case;end process;
end rtl;
**** shifter.vhd ****---------------------------------------------- Variable Combinational Shift Component ---- ---- B = A shifted left by specified amt ---- ---- Factor Bits Shifted Right ---- 00 0 1 ---- 01 4 1/16 ---- 10 8 1/256 ---- 11 12 1/4096 ----------------------------------------------library ieee;use ieee.std_logic_1164.all;
entity shifter isgeneric (
width : natural := 32);port(
84
A : in std_logic_vector(width-1 downto 0);B : out std_logic_vector(width-1 downto 0);factor : in std_logic_vector(1 downto 0));
end;
architecture rtl of shifter isbegin
process (factor,A)begin
case (factor) iswhen "00" => B <= A;when "01" => B <= "0000" & A(width-1 downto 4);when "10" => B <= "00000000" & A(width-1 downto 8);when "11" => B <= "000000000000" & A(width-1 downto 12);
end case;end process;
end rtl;
**** sortedstack.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
entity sortedstack isgeneric (
keywidth : natural := 32;datawidth : natural := 32+16;depth : natural := 8);
port(keyin : in std_logic_vector(keywidth-1 downto 0);datain : in std_logic_vector(datawidth-1 downto 0);write : in std_logic;reset : in std_logic;peekdata : out std_logic_vector(datawidth*depth-1 downto 0);globalreset : in std_logic;clk : in std_logic);
end;
architecture rtl of sortedstack istype stdlogicarraykey is array(0 to depth-1) of std_logic_vector(keywidth-1 downto 0);type stdlogicarraydata is array(0 to depth-1) of std_logic_vector(datawidth-1 downto 0);type stdlogicarraybit is array(0 to depth-1) of std_logic;
signal key : stdlogicarraykey;signal data : stdlogicarraydata;signal full : stdlogicarraybit;signal location : integer range 0 to depth-1;
beginpeeklp : for k in 0 to depth-1 generate
peekdata((k+1)*(datawidth)-1 downto k*(datawidth))<=data(k) when full(k)='1' else(others=>'0');
end generate peeklp;
-- Select the proper insertion pointprocess (keyin,key,full)begin
location <= depth-1;nrst: for k in depth-2 downto 0 loopif ((keyin < key(k)) or (full(k) = '0')) thenlocation <= k;
end if;end loop nrst;
end process;
process (clk,globalreset,reset)begin
if ((globalreset = '1') or (reset = '1')) thenclr: for k in 0 to depth-1 loopfull(k) <= '0';key(k) <= (others => '0');data(k) <= (others => '0');
end loop clr;elsif rising_edge(clk) then
if (write = '1') thenkey(location) <= keyin;
85
data(location) <= datain;full(location) <= '1';shft: for k in 0 to depth-2 loopif (k >= location) thenkey(k+1) <= key(k);data(k+1) <= data(k);full(k+1) <= full(k);
end if;end loop shft;
end if;end if;
end process;
end rtl;
**** spram.vhd ****--------------------------------------------------------- Signal Ported Ram Modual ---- ---- - Synplify should infer ram from the coding style---- - The depth of the ram is equal to 2**depth ---- ----------------------------------------------------------- Further Reading: RAM Inferencing with Synplify-- http://www.synplicity.com/literature/pdf/ram_inferencing.pdf
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_signed.all;
entity spram isgeneric(
width : natural := 16;depth : natural := 4);
port(we : in std_logic;addr : in std_logic_vector(depth-1 downto 0);dataout : out std_logic_vector(width-1 downto 0);datain : in std_logic_vector(width-1 downto 0);clk : in std_logic);
end;
architecture rtl of spram istype memarray is array(2**depth-1 downto 0) of
std_logic_vector(width-1 downto 0);signal mem : memarray;
begindataout <= mem(conv_integer(addr));
process(clk,we,addr)begin
if (rising_edge (clk)) thenif (we = '1') thenmem(conv_integer(addr)) <= datain;
end if;end if;
end process;
end rtl;
**** spramblock.vhd ****--------------------------------------------------------- Single Ported Ram Modual w/Registered Output ---- - Synpify should infer ram from the coding style ---- - Depth is the number of bits of address ---- the true depths is 2**depth ---------------------------------------------------------
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_signed.all;library synplify;use synplify.attributes.all;
86
entity spramblock isgeneric(
width : natural := 16;depth : natural := 8);
port(we : in std_logic;addr : in std_logic_vector(depth-1 downto 0);datain : in std_logic_vector(width-1 downto 0);dataout : out std_logic_vector(width-1 downto 0);clk : in std_logic);
end;
architecture rtl of spramblock istype memarray is array(2**depth-1 downto 0) of std_logic_vector(width-1 downto 0);
signal raddr : std_logic_vector(depth-1 downto 0);signal mem : memarray;attribute syn_ramstyle of mem : signal is "no_rw_check";
begindataout <= mem(conv_integer(raddr));process(clk,we,addr)begin
if (rising_edge (clk)) thenraddr <= addr;if (we = '1') thenmem(conv_integer(addr)) <= datain;
end if;end if;
end process;end rtl;
**** sramcontroller.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
entity sramcontroller isport(
want_addr : out std_logic;addr_ready : in std_logic;addrin : in std_logic_vector(17 downto 0);want_data : out std_logic;data_ready : in std_logic;datain : in std_logic_vector(63 downto 0);
addr : in std_logic_vector(17 downto 0);addrvalid : in std_logic;data : out std_logic_vector(63 downto 0);datavalid : buffer std_logic;
tm3_sram_data : inout std_logic_vector(63 downto 0);tm3_sram_addr : out std_logic_vector(18 downto 0);tm3_sram_we : out std_logic_vector(7 downto 0);tm3_sram_oe : out std_logic_vector(1 downto 0);tm3_sram_adsp : out std_logic;globalreset : in std_logic;clk : in std_logic;statepeek : out std_logic_vector(2 downto 0));
end;
architecture rtl of sramcontroller istype state_type is (S_IDLE,S_WRITE1,S_WRITE2,S_WRITE3,S_WRITEDONE,S_READ);signal state : state_type;signal next_state : state_type;
signal waddress : std_logic_vector(17 downto 0);begin
process(state)begin
case state iswhen S_IDLE => statepeek <= "001";when S_WRITE1 => statepeek <= "010";when S_WRITE2 => statepeek <= "011";when S_WRITE3 => statepeek <= "100";when S_WRITEDONE => statepeek <= "101";
87
when S_READ => statepeek <= "110";when others => statepeek <= "000";
end case;end process;
process(clk,globalreset)begin
if (globalreset = '1') thenstate <= S_IDLE;waddress <= (others => '0');data <= (others => '0');datavalid <= '0';
elsif (rising_edge (clk)) thenstate <= next_state;
case (state) iswhen S_IDLE =>if (addr_ready = '1') thenwaddress <= addrin;
end if;if addrvalid = '0' thendatavalid <= '0';
end if;when S_WRITE2 =>if (data_ready = '1') thenwaddress <= waddress+1;
end if;when S_READ =>data <= tm3_sram_data;datavalid <= '1';
when others =>end case;
end if;end process;
process (state,addr_ready,data_ready,waddress,datain,addrvalid,datavalid,addr)begin
tm3_sram_we <= "11111111";tm3_sram_oe <= "11";tm3_sram_adsp <= '1';tm3_sram_data <= (others => 'Z');tm3_sram_addr <= (others => '-');want_addr <= '1';want_data <= '0';case (state) is
when S_IDLE =>if (addr_ready = '1') thennext_state <= S_WRITE1;
elsif addrvalid = '1' and datavalid = '0' thennext_state <= S_READ;tm3_sram_addr <= '0' & addr;tm3_sram_adsp <= '0';tm3_sram_oe <= "01";
elsenext_state <= S_IDLE;
end if;when S_READ =>next_state <= S_IDLE;
when S_WRITE1 =>want_addr <= '0';want_data <= '1';if (addr_ready = '1') thennext_state <= S_WRITE1;
elsenext_state <= S_WRITE2;
end if;when S_WRITE2 =>want_data <= '1';tm3_sram_addr <= '0' & waddress;tm3_sram_data <= datain;if (addr_ready = '1') thennext_state <= S_WRITEDONE;
elsif (data_ready = '1') thentm3_sram_we <= "00000000";tm3_sram_adsp <= '0';next_state <= S_WRITE3;
88
elsenext_state <= S_WRITE2;
end if;when S_WRITE3 =>if (data_ready = '1') thennext_state <= S_WRITE3;
elsenext_state <= S_WRITE2;
end if;when S_WRITEDONE =>want_addr <= '0';if (addr_ready = '1') thennext_state <= S_WRITEDONE;
elsenext_state <= S_IDLE;
end if;end case;
end process;end rtl;
**** test.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
library work;use work.complib.all;
entity test isport(
triIDvalid : out std_logic;triID : out std_logic_vector(15 downto 0);wanttriID : in std_logic;raydata : out std_logic_vector(31 downto 0);rayaddr : out std_logic_vector(3 downto 0);raywe : out std_logic_vector(2 downto 0);resultready : in std_logic;resultdata : in std_logic_vector(31 downto 0);globalreset : out std_logic;
want_braddr : out std_logic;braddr_ready : in std_logic;braddrin : in std_logic_vector(9 downto 0);want_brdata : out std_logic;brdata_ready : in std_logic;brdatain : in std_logic_vector(31 downto 0);
want_addr2 : out std_logic;addr2_ready : in std_logic;addr2in : in std_logic_vector(17 downto 0);want_data2 : out std_logic;data2_ready : in std_logic;data2in : in std_logic_vector(63 downto 0);
pglobalreset : in std_logic;tm3_clk_v0 : in std_logic;tm3_sram_data : inout std_logic_vector(63 downto 0);tm3_sram_addr : out std_logic_vector(18 downto 0);tm3_sram_we : out std_logic_vector(7 downto 0);tm3_sram_oe : out std_logic_vector(1 downto 0);tm3_sram_adsp : out std_logic;
-- Bus Signals (To Ray Generator Unit)raygroup01 : in std_logic_vector(1 downto 0);raygroupvalid01 : in std_logic;busy01 : out std_logic;raygroup10 : in std_logic_vector(1 downto 0);raygroupvalid10 : in std_logic;busy10 : out std_logic;
rgData : in std_logic_vector(31 downto 0);rgAddr : in std_logic_vector(3 downto 0);rgWE : in std_logic_vector(2 downto 0);rgAddrValid : in std_logic;rgDone : out std_logic;
89
rgResultData : out std_logic_vector(31 downto 0);rgResultReady : out std_logic;rgResultSource : out std_logic_vector(1 downto 0);
t1a : out std_logic_vector(31 downto 0);t1b : out std_logic_vector(31 downto 0);u1a : out std_logic_vector(15 downto 0);u1b : out std_logic_vector(15 downto 0);v1a : out std_logic_vector(15 downto 0);v1b : out std_logic_vector(15 downto 0);id1a : out std_logic_vector(15 downto 0);id1b : out std_logic_vector(15 downto 0);hit1a : out std_logic;hit1b : out std_logic;
debug1 : out std_logic_vector(31 downto 0);debug2 : out std_logic_vector(31 downto 0);debug3 : out std_logic_vector(31 downto 0);input1 : in std_logic;input2 : in std_logic;input3 : in std_logic_vector(31 downto 0));
end;
architecture rtl of test issignal max,max01,max10 : std_logic_vector(31 downto 0);signal maxwe,maxwe01,maxwe10 : std_logic;signal raygroupwe,raygroupwe01,raygroupwe10 : std_logic;signal raygroupout,raygroupout01,raygroupout10 : std_logic_vector(1 downto 0);signal raygroupid, raygroupid01,raygroupid10 : std_logic_vector(1 downto 0);signal resultid : std_logic_vector(1 downto 0);signal t1i,t2i,t3i,tf1i,tf2i,tf3i : std_logic_vector(31 downto 0);signal u1i,u2i,u3i,v1i,v2i,v3i : std_logic_vector(15 downto 0);signal id1i,id2i,id3i : std_logic_vector(15 downto 0);signal hit1i,hit2i,hit3i : std_logic;signal newresult : std_logic;signal write,reset,reset01,reset10 : std_logic;signal peekdata,peeklatch : std_logic_vector(871 downto 0);signal commit01,commit10 : std_logic;signal baseaddress01,baseaddress10 : std_logic_vector(1 downto 0);signal done : std_logic_vector(1 downto 0);signal cntreset,cntreset01,cntreset10 : std_logic;signal passCTS01, passCTS10 : std_logic;signal triIDvalid01, triIDvalid10 : std_logic;signal triID01, triID10 : std_logic_vector(15 downto 0);signal gnd : std_logic;signal boundNodeID,BoundNodeID01, BoundNodeID10 : std_logic_vector(9 downto 0);signal enablenear,enablenear01,enablenear10 : std_logic;signal max0_01,max1_01,max2_01,max0_10,max1_10,max2_10 : std_logic_vector(31 downto 0);signal ack01,ack10,empty01,dataready01,empty10,dataready10,lhreset01,lhreset10 :
std_logic;signal boundnodeIDout01,boundnodeIDout10 : std_logic_vector(9 downto 0);signal level01,level10 : std_logic_vector(1 downto 0);signal hitmask01,hitmask10 : std_logic_vector(2 downto 0);
-- Offset Block Ram Read Signalssignal ostaddr,addrind01,addrind10 : std_logic_vector(9 downto 0);signal ostaddrvalid,addrindvalid01,addrindvalid10,ostdatavalid : std_logic;signal ostdata : std_logic_vector(31 downto 0);-- Tri List Ram Read Signalssignal tladdr,tladdr01,tladdr10 : std_logic_vector(17 downto 0);signal tladdrvalid,tladdrvalid01,tladdrvalid10,tldatavalid : std_logic;signal tldata : std_logic_vector(63 downto 0);-- Final Result Signalssignal t1_01,t2_01,t3_01,t1_10,t2_10,t3_10 : std_logic_vector(31 downto 0);signal v1_01,v2_01,v3_01,v1_10,v2_10,v3_10 : std_logic_vector(15 downto 0);signal u1_01,u2_01,u3_01,u1_10,u2_10,u3_10 : std_logic_vector(15 downto 0);signal id1_01,id2_01,id3_01,id1_10,id2_10,id3_10 : std_logic_Vector(15 downto 0);signal hit1_01,hit2_01,hit3_01,hit1_10,hit2_10,hit3_10 : std_logic;signal bcvalid01, bcvalid10 : std_logic;
signal peekoffset1a,peekoffset1b,peekoffset0a,peekoffset0b : std_logic_vector(2 downto0);
signal peekoffset2a,peekoffset2b : std_logic_vector(2 downto 0);signal peekaddressa,peekaddressb : std_logic_vector(4 downto 0);
signal doutput,dack : std_logic;signal state01,state10 : std_logic_vector(4 downto 0);
90
signal junk1,junk1b : std_logic_vector(2 downto 0);signal junk2,junk2a : std_logic;signal junk3,junk4 : std_logic_vector(1 downto 0);signal d1 : std_logic_vector(31 downto 0);signal debugstoplevel01,debugstoplevel10 : std_logic_vector(1 downto 0);signal debugleafbreak : std_logic;signal debugcount01,debugcount10 : std_logic_vector(13 downto 0);signal debugsubcount01, debugsubcount10 : std_logic_vector(1 downto 0);signal statesram : std_logic_vector(2 downto 0);
begind1(12 downto 7) <= (others => '0');
debugstoplevel01 <= input3(1 downto 0);debugstoplevel10 <= input3(3 downto 2);debugleafbreak <= input3(4);
t1a <= t1_01;t1b <= t1_10;u1a <= u1_01;u1b <= u1_10;v1a <= v1_01;v1b <= v1_10;id1a <= id1_01;id1b <= id1_10;hit1a <= hit1_01;hit1b <= hit1_10;
oc : onlyonecycleport map(input1,doutput,pglobalreset,tm3_clk_v0);
debug1 <= d1;d1(0) <= empty01;d1(1) <= dataready01;d1(3 downto 2) <= level01;d1(25 downto 16) <= boundnodeIDout01;d1(15 downto 13) <= (others => '0');d1(6 downto 4) <= (others => '0');debug2 <= max0_01;debug3 <= max1_01;
-- Real Stuff Starts Here
ostaddr <= addrind01 or addrind10;ostaddrvalid <= addrindvalid01 or addrindvalid10;
offsettable : vblockramcontrollergeneric map(32,10)port map(want_braddr,braddr_ready,braddrin,want_brdata,brdata_ready,brdatain,
ostaddr,ostaddrvalid,ostdata,ostdatavalid,pglobalreset,tm3_clk_v0);
tladdr <= tladdr01 or tladdr10;tladdrvalid <= tladdrvalid01 or tladdrvalid10;
trilist : sramcontrollerport map(want_addr2,addr2_ready,addr2in,want_data2,data2_ready,data2in,
tladdr,tladdrvalid,tldata,tldatavalid,tm3_sram_data,tm3_sram_addr,tm3_sram_we,tm3_sram_oe,tm3_sram_adsp,pglobalreset,tm3_clk_v0, statesram);
globalreset <= pglobalreset;
ri : resultinterfaceport map(t1i,t2i,t3i,tf1i,tf2i,tf3i,u1i,u2i,u3i,
v1i,v2i,v3i,id1i,id2i,id3i,hit1i,hit2i,hit3i,resultID,newresult,resultready,resultdata,pglobalreset,tm3_clk_v0);
rayint : rayinterfaceport map(max,maxwe, raygroupout,raygroupwe,raygroupid,enablenear,
rgData,rgAddr,rgWe,rgAddrvalid,rgDone,raydata,rayaddr,raywe, pglobalreset,tm3_clk_v0);
boundcont01 : boundcontrollergeneric map('1',"01")port map(max01,maxwe01,raygroupout01,raygroupwe01,raygroupid01,
enablenear01,raygroup01,raygroupvalid01,busy01,
91
triIDvalid01, triID01,wanttriID,reset01,baseaddress01,newresult,boundNodeID01,
resultID,hitmask01,dataready01,empty01,level01,max0_01,max1_01,max2_01,boundNodeIDout01,ack01,lhreset01,addrind01,addrindvalid01,ostdata,ostdatavalid,tladdr01,tladdrvalid01,tldata,tldatavalid,t1i,t2i,t3i,u1i,u2i,u3i,v1i,v2i,v3i,id1i,id2i,id3i,hit1i,hit2i,hit3i,
t1_01,t2_01,t3_01,u1_01,u2_01,u3_01,v1_01,v2_01,v3_01,id1_01,id2_01,id3_01,hit1_01,hit2_01,hit3_01,
bcvalid01,done,cntreset01,passCTS01,passCTS10,pglobalreset,tm3_clk_v0,state01,debugstoplevel01,
debugleafbreak,debugsubcount01,debugcount01);
boundcont10 : boundcontrollergeneric map('0',"10")port map(max10,maxwe10,raygroupout10,raygroupwe10,raygroupid10,
enablenear10, raygroup10, raygroupvalid10, busy10,triIDvalid10, triID10, wanttriID,reset10,
baseaddress10,newresult,BoundNodeID10,resultID,hitmask10,dataready10,empty10,level10,max0_10,max1_10,max2_10,boundNodeIDout10,ack10, lhreset10,addrind10,addrindvalid10,ostdata,ostdatavalid,tladdr10,tladdrvalid10,tldata,tldatavalid,t1i,t2i,t3i,u1i,u2i,u3i,v1i,v2i,v3i,id1i,id2i,id3i,hit1i,hit2i,hit3i,
t1_10,t2_10,t3_10,u1_10,u2_10,u3_10,v1_10,v2_10,v3_10,id1_10,id2_10,id3_10,hit1_10,hit2_10,hit3_10,
bcvalid10,done,cntreset10,passCTS10,passCTS01,pglobalreset,tm3_clk_v0,state10,debugstoplevel10,
debugleafbreak,debugsubcount10,debugcount10);
restransinst : resulttransmitport map(bcvalid01,bcvalid10,id1_01,id2_01,id3_01,id1_10,id2_10,id3_10,
hit1_01,hit2_01,hit3_01,hit1_10,hit2_10,hit3_10,rgResultData,rgResultReady,rgResultSource, pglobalreset,tm3_clk_v0);
gnd <= '0';
raygroupout <= raygroupout01 or raygroupout10;raygroupwe <= raygroupwe01 or raygroupwe10;raygroupid <= raygroupid01 or raygroupid10;triIDvalid <= triIDvalid01 or triIDvalid10;enablenear <= enablenear01 or enablenear10;triID <= triID01 or triID10;cntreset <= cntreset01 or cntreset10;reset <= reset01 or reset10;max <= max01 or max10;maxwe <= maxwe01 or maxwe10;
process (boundNodeID01,boundNodeID10,resultID)begin
if resultID = "01" thenboundNodeID <= BoundNodeID01;
elsif resultID = "10" thenboundNodeID <= BoundNodeID10;
elseboundNodeID <= (others => '-');
end if;end process;
write <= '1' when (newresult = '1') and (resultID /= 0) and((hit1i = '1') or (hit2i = '1') or (hit3i = '1')) else '0';
st : sortedstackgeneric map(32, 109, 8)port map
(t1i,hit3i&hit2i&hit1i&tf3i&tf2i&tf1i&boundNodeID,write,reset,peekdata,pglobalreset,tm3_clk_v0);
commit01 <= '1' when done = "01" else '0';commit10 <= '1' when done = "10" else '0';
dack <= doutput or ack01;
lh01 : listhandlerport map(peeklatch,commit01,hitmask01,dack,max0_01,max1_01,max2_01,
boundnodeIDout01,level01,empty01,dataready01,lhreset01,
92
pglobalreset,tm3_clk_v0,peekoffset0a,peekoffset1a, peekoffset2a,junk2a,junk4);
lh02 : listhandlerport map(peeklatch,commit10,hitmask10,ack10,max0_10,max1_10,max2_10,
boundnodeIDout10,level10,empty10,dataready10,lhreset10,pglobalreset,tm3_clk_v0,junk1,junk1b,peekoffset2b,junk2,junk3);
process (tm3_clk_v0,pglobalreset)begin-- The reset is only for debuggingif (pglobalreset = '1') then
d1(31 downto 26) <= (others => '0');peeklatch <= (others => '0');
elsif rising_edge(tm3_clk_v0) thenif newresult = '1' thend1(31 downto 26) <= d1(31 downto 26) + 1;
end if;if (done /= 0) thenpeeklatch <= peekdata;
end if;end if;
end process;
rc : resultcounterport map(resultID,newresult,done,cntreset,pglobalreset,tm3_clk_v0);
end rtl;
**** top.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
library work;use work.complib.all;
entity top isport(
want_saddr : out std_logic;saddr_ready : in std_logic;saddrin : in std_logic_vector(17 downto 0);want_sdata : out std_logic;sdata_ready : in std_logic;sdatain : in std_logic_vector(63 downto 0);
tm3_sram_data : inout std_logic_vector(63 downto 0);tm3_sram_addr : out std_logic_vector(18 downto 0);tm3_sram_we : out std_logic_vector(7 downto 0);tm3_sram_oe : out std_logic_vector(1 downto 0);tm3_sram_adsp : out std_logic;
triIDvalid : in std_logic;triID : in std_logic_vector(15 downto 0);wanttriID : out std_logic;raydata : in std_logic_vector(31 downto 0);rayaddr : in std_logic_vector(3 downto 0);raywe : in std_logic_vector(2 downto 0);resultready : out std_logic;resultdata : out std_logic_vector(31 downto 0);
tm3_io_3 : out std_logic_vector(31 downto 0);globalreset : in std_logic;tm3_clk_v0 : in std_logic
);end;
architecture rtl of top istype stdlogicarray32 is array(0 to 2) of std_logic_vector(31 downto 0);type stdlogicarray16 is array(0 to 2) of std_logic_vector(15 downto 0);
-- Memory Interface Signalssignal tridata : std_logic_vector(191 downto 0);signal triID_out : std_logic_vector(15 downto 0);signal cyclenum : std_logic_vector(1 downto 0);
93
signal masterenable,masterenablel : std_logic_vector(0 downto 0);signal swap : std_logic;
-- Ray Tri Interface Signalssignal tout : std_logic_vector(31 downto 0);signal uout : std_logic_vector(15 downto 0);signal vout : std_logic_vector(15 downto 0);signal triIDout : std_logic_vector(15 downto 0);signal hitout : std_logic;signal origx,origy,origz : std_logic_vector(27 downto 0);signal dirx,diry,dirz : std_logic_vector(15 downto 0);
-- Nearest Unit Signalssignal nt,ft : stdlogicarray32;signal nu,nv,ntriID : stdlogicarray16;signal anyhit : std_logic_vector(2 downto 0);signal n0enable, n1enable, n2enable,nxenable : std_logic;signal enablenear,enablenearl : std_logic_vector(0 downto 0);signal resetl,reset : std_logic_vector(0 downto 0);signal maxdist, maxdistl : std_logic_vector(31 downto 0);signal raygroupID, raygroupIDl : std_logic_vector(1 downto 0);
-- Debug signalssignal pod1 : std_logic_vector(15 downto 1);signal pod2 : std_logic_vector(15 downto 0);signal debugdetneg : std_logic;signal debugsuneg : std_logic;signal debugvneg : std_logic;signal debugsugtdet : std_logic;signal debugvgtdet : std_logic;signal debugtneg : std_logic;signal debughitinter : std_logic;signal debughit : std_logic;begin
tm3_io_3 <= pod2 & '0' & pod1;pod1(1) <= masterenable(0);pod1(2) <= n2enable;pod1(3) <= resetl(0);pod1(4) <= anyhit(2);pod1(5) <= debugdetneg;pod1(6) <= debugsuneg;pod1(7) <= debugvneg;pod1(8) <= debugsugtdet;pod1(9) <= debugvgtdet;pod1(10) <= debugtneg;pod1(11) <= debughitinter;pod1(12) <= debughit;pod1(13) <= hitout;
pod1(15 downto 14) <= tridata(161 downto 160); -- vert0z
pod2(3 downto 0) <= dirx(3 downto 0);pod2(5 downto 4) <= diry(1 downto 0);pod2(7 downto 6) <= dirz(1 downto 0);pod2(11 downto 8) <= tridata(99 downto 96); -- vert0xpod2(13 downto 12) <= tridata(1 downto 0); -- edge1xpod2(15 downto 14) <= tridata(65 downto 64); -- edge2y
mem : memoryinterfaceport map(
want_saddr,saddr_ready,saddrin,want_sdata,sdata_ready,sdatain,tridata, triID_out, masterenable(0), triIDvalid, triID, wanttriID,cyclenum,tm3_sram_data,tm3_sram_addr,tm3_sram_we,tm3_sram_oe,tm3_sram_adsp,globalreset, tm3_clk_v0);
triunit : raytriport map(
tm3_clk_v0,tout,uout,vout,triIDout,hitout,tridata(123 downto 96), tridata(155 downto 128), tridata(187 downto 160),origx,origy,origz, dirx,diry,dirz,tridata(15 downto 0), tridata(31 downto 16), tridata(47 downto 32), tridata(125
downto 124),tridata(63 downto 48), tridata(79 downto 64), tridata(95 downto 80), tridata(157
downto 156),tridata(191 downto 191), swap,triID_out,
94
debugdetneg,debugsuneg,debugvneg,debugsugtdet,debugvgtdet,debugtneg,debughitinter,debughit);
nc0 : nearcmpspecport map(tout,uout,vout,triIDout,hitout,nt(0),ft(0),nu(0),nv(0),ntriID(0),anyhit(0), maxdistl,n0enable,nxenable,resetl(0),globalreset,tm3_clk_v0);
nc1 : nearcmpport map(
tout,uout,vout,triIDout,hitout,nt(1),ft(1),nu(1),nv(1),ntriID(1),anyhit(1),maxdistl,n1enable,resetl(0),globalreset,tm3_clk_v0);
nc2 : nearcmpport map(
tout,uout,vout,triIDout,hitout,nt(2),ft(2),nu(2),nv(2),ntriID(2),anyhit(2),maxdistl,n2enable,resetl(0),globalreset,tm3_clk_v0);
n0enable <= '1' when (cyclenum = "10") and (masterenablel(0) = '1') else '0';n1enable <= '1' when (cyclenum = "00") and (masterenablel(0) = '1') else '0';n2enable <= '1' when (cyclenum = "01") and (masterenablel(0) = '1') else '0';nxenable <= '1' when (enablenearl(0) = '1') and (masterenablel(0) = '1') else '0';maxdelay : delay
generic map (32,37)port map(maxdist,maxdistl,tm3_clk_v0);
raygroupdelay : delaygeneric map (2,37+1) -- One delay level to account for near cmp internal latchport map(raygroupID,raygroupIDl,tm3_clk_v0);
enableneardelay : delaygeneric map (1,37)port map(enablenear,enablenearl,tm3_clk_v0);
mastdelay : delaygeneric map (1,37)port map(masterenable,masterenablel,tm3_clk_v0);
resetdelay : delaygeneric map (1,37)port map(reset,resetl,tm3_clk_v0);
resstate : resultstateport map (resetl(0),
nt(0),nt(1),nt(2),ft(0),ft(1),ft(2),nu(0),nu(1),nu(2),nv(0),nv(1),nv(2),
ntriID(0),ntriID(1),ntriID(2),anyhit(0),anyhit(1),anyhit(2),raygroupIDl,resultready,resultdata,globalreset, tm3_clk_v0);
raybuff : raybufferport map ( origx, origy, origz, dirx, diry, dirz, maxdist, raygroupID, swap,
reset(0),enablenear(0),raydata, rayaddr, raywe, cyclenum,tm3_clk_v0);
end rtl;
**** vblockramcontroller.vhd ****library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
library work;use work.complib.all;
entity vblockramcontroller isgeneric(
width : natural := 32;depth : natural := 10);
port(want_addr : out std_logic;addr_ready : in std_logic;addrin : in std_logic_vector(depth-1 downto 0);want_data : out std_logic;data_ready : in std_logic;datain : in std_logic_vector(width-1 downto 0);
95
addr : in std_logic_vector(depth-1 downto 0);addrvalid : in std_logic;data : out std_logic_vector(width-1 downto 0);datavalid : buffer std_logic;
globalreset : in std_logic;clk : in std_logic);
end;
architecture rtl of vblockramcontroller istype state_type is (S_IDLE,S_WRITE1,S_WRITE2,S_WRITE3,S_WRITEDONE,S_READ);signal state : state_type;signal next_state : state_type;
signal waddr,saddr : std_logic_vector(depth-1 downto 0);signal dataout : std_logic_vector(width-1 downto 0);signal we : std_logic;
begin
saddr <= waddr when state /= S_IDLE else addr;
ramblock : spramblockgeneric map (width,depth)port map(we,saddr,datain,dataout,clk);
process(clk,globalreset)begin
if (globalreset = '1') thenstate <= S_IDLE;waddr <= (others => '0');data <= (others => '0');datavalid <= '0';
elsif (rising_edge (clk)) thenstate <= next_state;
case (state) iswhen S_IDLE =>if (addr_ready = '1') thenwaddr <= addrin;
end if;if addrvalid = '0' thendatavalid <= '0';
end if;when S_WRITE2 =>if (data_ready = '1') thenwaddr <= waddr+1;
end if;when S_READ =>data <= dataout;datavalid <= '1';
when others =>end case;
end if;end process;
process (state,addr_ready,data_ready,addrvalid,datavalid)begin
we <= '0';want_addr <= '1';want_data <= '0';case (state) is
when S_IDLE =>if (addr_ready = '1') thennext_state <= S_WRITE1;
elsif addrvalid = '1' and datavalid = '0' thennext_state <= S_READ;
elsenext_state <= S_IDLE;
end if;when S_READ =>next_state <= S_IDLE;
when S_WRITE1 =>want_addr <= '0';want_data <= '1';if (addr_ready = '1') thennext_state <= S_WRITE1;
96
elsenext_state <= S_WRITE2;
end if;when S_WRITE2 =>want_data <= '1';if (addr_ready = '1') thennext_state <= S_WRITEDONE;
elsif (data_ready = '1') thenwe <= '1';next_state <= S_WRITE3;
elsenext_state <= S_WRITE2;
end if;when S_WRITE3 =>if (data_ready = '1') thennext_state <= S_WRITE3;
elsenext_state <= S_WRITE2;
end if;when S_WRITEDONE =>want_addr <= '0';if (addr_ready = '1') thennext_state <= S_WRITEDONE;
elsenext_state <= S_IDLE;
end if;end case;
end process;
end rtl;
**** vectdelay.vhd ****--------------------------------------------- Variable Length Vector Shift Register ---- Provides a specified number of ---- clock cycle delay for a 3 signals ---------------------------------------------
library ieee;use ieee.std_logic_1164.all;
entity vectdelay isgeneric (
width : natural := 32;depth : natural := 1);
port(xin,yin,zin : in std_logic_vector(width-1 downto 0);xout,yout,zout : out std_logic_vector(width-1 downto 0);clk : in std_logic);
end;
architecture rtl of vectdelay istype delayarray is array(0 to depth-1) of std_logic_vector(width-1 downto 0);
signal bufferx : delayarray;signal buffery : delayarray;signal bufferz : delayarray;
beginxout <= bufferx(depth-1);yout <= buffery(depth-1);zout <= bufferz(depth-1);
process(clk)begin
if (rising_edge(clk)) thenbufferx(0) <= xin;buffery(0) <= yin;bufferz(0) <= zin;if (depth > 1) thenrow : for k in 0 to depth-2 loopbufferx(k+1) <= bufferx(k);buffery(k+1) <= buffery(k);bufferz(k+1) <= bufferz(k);
end loop row;end if;
end if;
97
end process;end rtl;
**** vectexchange.vhd ****------------------------------------ Vector Mux Component ---- C = A when ABn = '1' else B ------------------------------------library ieee;use ieee.std_logic_1164.all;
entity vectexchange isgeneric (
width : natural := 32);port(
Ax,Ay,Az : in std_logic_vector(width-1 downto 0);Bx,By,Bz : in std_logic_vector(width-1 downto 0);Cx,Cy,Cz : out std_logic_vector(width-1 downto 0);ABn : in std_logic);
end;
architecture rtl of vectexchange isbegin
Cx <= Ax when (ABn = '1') else Bx;Cy <= Ay when (ABn = '1') else By;Cz <= Az when (ABn = '1') else Bz;
end rtl;
**** vectsub.vhd ****------------------------------------------- Signed Vector Subtraction Component ---- C = A - B ---- The output, C, is latched -------------------------------------------
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_arith.all;use ieee.std_logic_signed.all;
entity vectsub isgeneric (
width : natural := 32);port(
Ax,Ay,Az : in std_logic_vector(width-1 downto 0);Bx,By,Bz : in std_logic_vector(width-1 downto 0);Cx,Cy,Cz : out std_logic_vector(width downto 0);clk : in std_logic);
end;
architecture rtl of vectsub isbegin
process(clk)begin
if (rising_edge(clk)) thenCx <= (Ax(width-1) & Ax) - (Bx(width-1) & Bx);Cy <= (Ay(width-1) & Ay) - (By(width-1) & By);Cz <= (Az(width-1) & Az) - (Bz(width-1) & Bz);
end if;end process;
end rtl;
98
Appendix C: C Code
**** load.c **** Raytracing processor interface program#include <stdio.h>#include <stdlib.h>#include <strings.h>#include <math.h>#include "portutil.h"#include "framebuf.h"#include "trilist.h"
#define TM3enable#define PI 3.1415
typedef struct {float x,y,z;
} vect3f;
vect3f normalize(vect3f in) {vect3f result;float len;
len = sqrt(in.x*in.x+in.y*in.y+in.z*in.z);result.x = in.x / len;result.y = in.y / len;result.z = in.z / len;return result;
}
vect3f cross(vect3f a, vect3f b) {vect3f result;result.x = a.y*b.z-a.z*b.y;result.y = a.z*b.x-a.x*b.z;result.z = a.x*b.y-a.y*b.x;return result;
}
long long int packray(vect3f ray) {signed short int x,y,z;x = ray.x;y = ray.y;z = ray.z;
return ((((unsigned long long int) x) << 32) +(((unsigned long long int) y) << 16) +(((unsigned long long int) z))) & (0x0000ffffffffffffl);
}
void tmSendRays(vect3f orig, vect3f dir, vect3f up, float view_x, float view_y) {int rgdatain,rgaddrin,origx,origy,origz;unsigned long long int *data;int x,y;float sx,sy;vect3f leftn,dirn,upn;vect3f raydir;float tanx, tany;dirn = normalize(dir);upn = normalize(up);leftn = normalize(cross(up,dir));tanx = tan( (float)view_x/360*PI);tany = tan( (float)view_y/360*PI);data = (long long int*) malloc(8*321*240);
for(y = 0; y < 240; ++y) {for(x = 0; x < 320; ++x) {
sx = 2*tanx*(x-160)/320;sy = 2*tany*(y-120)/240;raydir.x = dirn.x+leftn.x*sx+upn.x*sy;raydir.y = dirn.y+leftn.y*sx+upn.y*sy;raydir.z = dirn.z+leftn.z*sx+upn.z*sy;raydir = normalize(raydir);raydir.x *= 32767;raydir.y *= 32767;raydir.z *= 32767;data[y*321+x] = packray(raydir) + (((unsigned long long int)(y*107+floor(x/3))) <<
99
48);}data[y*321+320] = 0xffff000000000000l;
}
#ifdef TM3enableorigx = openPort("origx","w");origy = openPort("origy","w");origz = openPort("origz","w");x = orig.x; writeIntPort(origx,"OrigX",x);x = orig.y; writeIntPort(origy,"OrigY",x);x = orig.z; writeIntPort(origz,"OrigZ",x);tm_close(origx);tm_close(origy);tm_close(origz);
// write ray direction datargaddrin = openPort("rgaddrin","w");rgdatain = openPort("rgdatain","w");write3BytesPort(rgaddrin,"Addr",0);writePort(rgdatain,"Data in",(char *) data,8*321*240);printf("Rays written to TM3\n");tm_close(rgaddrin);tm_close(rgdatain);
#endiffree(data);
}
void tm3go() {int pglobalreset,rgcont,rgstat,rgaddrin,input3;
#ifdef TM3enableinput3 = openPort("input3","w");writeIntPort(input3,"Stop Level",3);tm_close(input3);
printf("Rendering Image\n");pglobalreset = openPort("pglobalreset","w");toggleBitPort(pglobalreset,"Global Reset",1);tm_close(pglobalreset);
rgaddrin = openPort("rgaddrin","w");write3BytesPort(rgaddrin,"Addr",0);tm_close(rgaddrin);
rgcont = openPort("rgcont","w");writeIntPort(rgcont,"Control Port",(321*240/3)*2+1);writeIntPort(rgcont,"Control Port",(321*240/3)*2);tm_close(rgcont);
rgstat = openPort("rgstat","r");while((readPort4(rgstat,"Status Port") & 0x80000000) != 0);printf("Total Cycles: %u\n",readPort4(rgstat,"Status Port"));tm_close(rgstat);
#endif}
void writeSRAM0TM3(unsigned long long int *data, int address, int bytes) {int saddrin,sdatain;
saddrin = openPort("saddrin","w");sdatain = openPort("sdatain","w");writeMemoryPort3(saddrin,sdatain,"TriData: addr","TriData: data",address,bytes,(char
*)data);tm_close(saddrin);tm_close(sdatain);
}
void writeSRAM1TM3(unsigned long long int *data, int address, int bytes) {int addr2in,data2in;
addr2in = openPort("addr2in","w");data2in = openPort("data2in","w");writeMemoryPort3(addr2in,data2in,"TriList: addr","TriList: data",address,bytes,(char
*)data);tm_close(addr2in);tm_close(data2in);
100
}
void writeIndMemTM3(unsigned int *data, int address, int bytes) {int braddrin,brdatain;
braddrin = openPort("braddrin","w");brdatain = openPort("brdatain","w");writeMemoryPort2(braddrin,brdatain,"indmem:addr","indmem:data",address,bytes,(char
*)data);tm_close(braddrin);tm_close(brdatain);
}
void tm3writeTGA() {FrameBuffer *buf;unsigned long long int *data;int x,y;int rgaddrin,rgdataout;
#ifdef TM3enablergaddrin = openPort("rgaddrin","w");rgdataout = openPort("rgdataout","r");write3BytesPort(rgaddrin,"Addr",0x30000);data = (unsigned long long int *) malloc(8*107*240);readPort(rgdataout,"Data out",(char *) data,8*107*240);tm_close(rgaddrin);tm_close(rgdataout);
buf = createFrameBuffer(320,240);for (y = 0; y < 240; ++y) {
for(x = 0; x < 107; ++x) {setPixel(buf,(x*3) ,y,2*((data[y*107+x] >> 56) & 0x7f),2*((data[y*107+x] >> 49) &
0x7f),2*((data[y*107+x] >> 42) & 0x7f));setPixel(buf,(x*3)+1,y,2*((data[y*107+x] >> 35) & 0x7f),2*((data[y*107+x] >> 28) &
0x7f),2*((data[y*107+x] >> 21) & 0x7f));if (x != 106)setPixel(buf,(x*3)+2,y,2*((data[y*107+x] >> 14) & 0x7f),2*((data[y*107+x] >> 7) &
0x7f),2*(data[y*107+x] & 0x7f));}
}writeTGA(buf,"dataout.tga");free(data);
// destroyFrameBuffer(buf);#endif}
int readintn(FILE *f) {int result;fscanf(f,"%d\n",&result);return result;
}
int readint(FILE *f) {int result;fscanf(f,"%d",&result);return result;
}
void loadcamera(FILE *f, int *line) {char buf[50];vect3f orig,dir,up;float view_x,view_y;
while (!feof(f)) {fscanf(f,"%s",buf); (*line)++;if (strcasecmp(buf,"origx") == 0) orig.x = readintn(f);else if (strcasecmp(buf,"origy") == 0) orig.y = readintn(f);else if (strcasecmp(buf,"origz") == 0) orig.z = readintn(f);else if (strcasecmp(buf,"dirx") == 0) dir.x = readintn(f);else if (strcasecmp(buf,"diry") == 0) dir.y = readintn(f);else if (strcasecmp(buf,"dirz") == 0) dir.z = readintn(f);else if (strcasecmp(buf,"upx") == 0) up.x = readintn(f);else if (strcasecmp(buf,"upy") == 0) up.y = readintn(f);else if (strcasecmp(buf,"upz") == 0) up.z = readintn(f);else if (strcasecmp(buf,"FOVX") == 0) fscanf(f,"%g\n",&view_x);else if (strcasecmp(buf,"FOVY") == 0) fscanf(f,"%g\n",&view_y);else if (strcasecmp(buf,"endcamera") == 0) {
101
tmSendRays(orig,dir,up,view_x,view_y);fscanf(f,"\n"); return;
}else {
printf("Line %d: Expected endcamera found %s instead\n",*line,buf);}
}printf("Line %d: Expected endcamera found EOF instead\n",*line);exit(1);
}
sPolygon loadPoly(FILE *f, char square) {sPolygon result;result.vert0x = readint(f);result.vert0y = readint(f);result.vert0z = readint(f);result.vert1x = readint(f);result.vert1y = readint(f);result.vert1z = readint(f);result.vert2x = readint(f);result.vert2y = readint(f);result.vert2z = readint(f);result.square = square;return result;
}
void loadleaf(FILE *f, int *line) {char buf[50];
while (!feof(f)) {fscanf(f,"%s",buf); (*line)++;if (strcasecmp(buf,"poly") == 0) {
addObjectPoly(loadPoly(f,0));} else if (strcasecmp(buf,"square") == 0) {
addObjectPoly(loadPoly(f,1));} else if (strcasecmp(buf,"endleaf") == 0) {
fscanf(f,"\n");return;
} else {printf("Line %d: Expected endleaf found %s instead\n",*line,buf);exit(1);
}}printf("Line %d: Expected endleaf found EOF instead\n",*line);exit(1);
}
void loadlevel2(FILE *f, int *line) {char buf[50];char count = 0;
if (push() == 1) {printf("Line %d: Only 8 level2 bounding boxes supported\n");exit(1);
}
while (!feof(f)) {fscanf(f,"%s",buf); (*line)++;if (strcasecmp(buf,"leaf") == 0) {
loadleaf(f,line);} else if (strcasecmp(buf,"poly") == 0) {
addBoundPoly(loadPoly(f,0)); count++;} else if (strcasecmp(buf,"square") == 0) {
addBoundPoly(loadPoly(f,1)); count++;} else if (strcasecmp(buf,"endlevel2") == 0) {
fscanf(f,"\n");pop();if (count == 6) return;printf("Line %d: A bounding box requires 6 bounding polys\n",*line);exit(1);
} else {printf("Line %d: Expected endlevel2 found %s instead\n",*line,buf);exit(1);
}}printf("Line %d: Expected endlevel2 found EOF instead\n",*line);exit(1);
102
}
void loadlevel1(FILE *f, int *line) {char buf[50];char count = 0;
if (push() == 1) {printf("Line %d: Only 8 level1 bounding boxes supported\n");exit(1);
}
while (!feof(f)) {fscanf(f,"%s",buf); (*line)++;if (strcasecmp(buf,"level2") == 0) {loadlevel2(f,line);
} else if (strcasecmp(buf,"poly") == 0) {addBoundPoly(loadPoly(f,0)); count++;
} else if (strcasecmp(buf,"square") == 0) {addBoundPoly(loadPoly(f,1)); count++;
} else if (strcasecmp(buf,"endlevel1") == 0) {fscanf(f,"\n");pop();if (count == 6) return;printf("Line %d: A bounding box requires 6 bounding polys\n",*line);exit(1);
} else {printf("Line %d: Expected endlevel1 found %s instead\n",*line,buf);exit(1);
}}printf("Line %d: Expected endlevel1 found EOF instead\n",*line);exit(1);
}
void loadlevel0(FILE *f, int *line) {char buf[50];char count = 0;
if (push() == 1) {printf("Line %d: Only 8 level0 bounding boxes supported\n");exit(1);
}
while (!feof(f)) {fscanf(f,"%s",buf); (*line)++;if (strcasecmp(buf,"level1") == 0) {
loadlevel1(f,line);} else if (strcasecmp(buf,"poly") == 0) {
addBoundPoly(loadPoly(f,0)); count++;} else if (strcasecmp(buf,"square") == 0) {
addBoundPoly(loadPoly(f,1)); count++;} else if (strcasecmp(buf,"endlevel0") == 0) {
fscanf(f,"\n");pop();if (count == 6) return;printf("Line %d: A bounding box requires 6 bounding polys\n",*line);exit(1);
} else {printf("Line %d: Expected endlevel0 found %s instead\n",*line,buf);exit(1);
}}printf("Line %d: Expected endlevel0 found EOF instead\n",*line);exit(1);
}
void loadSceneData(FILE *f, int *line) {char buf[50];
unsigned long long int *test;int x;
initTriData();initIndirect();
while (!feof(f)) {fscanf(f,"%s\n",buf); (*line)++;
103
if (strcasecmp(buf,"level0") == 0) loadlevel0(f,line);else if (strcasecmp(buf,"endscenedata") == 0) {
finalizeIndirect();#ifdef TM3enable
printf("Writing Triangle Data\n");writeSRAM0TM3(gettridata(),0,65535*4*8);printf("Writing Leaf Node Data\n");writeSRAM1TM3(getindirect(),0,1024*512*2);printf("Writing Indirection Data\n");writeIndMemTM3(getindirectcount(),0,512*2*2);
#endifreturn;
} else {printf("Line %d: Expected endscenedata found %s instead\n",*line,buf);exit(1);
}}printf("Line %d: Expected endscenedata found EOF instead\n",*line);exit(1);
}
int loadfile(FILE *f) {char buf[50];int line;
/* Check for ID string */fscanf(f,"%s\n",buf); line = 1;if (strcasecmp(buf,"TMray") != 0) {
printf("Incorrect input format\n");return 1;
}
while (!feof(f)) {fscanf(f,"%s\n",buf); line++;if (strcasecmp(buf,"Camera")==0) loadcamera(f, &line);else if (strcasecmp(buf,"SceneData")==0) loadSceneData(f,&line);else if (strcasecmp(buf,"WriteTGA")==0) tm3writeTGA();else if (strcasecmp(buf,"go")==0) tm3go();else {
printf("Line %d: Expected Valid Keyword found '%s' instead\n",line,buf);return 1;
}}return 0;
}
int main (int argc, char *argv[]) {FILE *f;
if (argc != 2) {printf("Usage: %s filename\n",argv[0]);return 1;
}if (!(f = fopen(argv[1],"r"))) {
printf("File %s not found\n",argv[1]);exit(1);
}#ifdef TM3enable
tm_init("");#endif
if (loadfile(f)) exit(1);
fclose(f);return 0;
}
**** framebuf.c ****#include <stdio.h>#include <assert.h>#include "framebuf.h"
void writeTGA(FrameBuffer *buf, char *name) {unsigned short temp;int x;FILE *fout;
104
assert(buf != NULL);/* Initialize the File Header */
fout = fopen(name,"wb");assert(fout != NULL);temp = 0; fwrite(&temp,sizeof(temp),1,fout);temp = 2 << 8; fwrite(&temp,sizeof(temp),1,fout);for (x = 0; x < 4; x++) {
temp = 0; fwrite(&temp,sizeof(temp),1,fout);}temp = (buf->Width << 8) + (buf->Width >> 8);fwrite(&temp,sizeof(temp),1,fout);temp = (buf->Height << 8) + (buf->Height >> 8);fwrite(&temp,sizeof(temp),1,fout);temp = 0x1830; fwrite(&temp,sizeof(temp),1,fout);fwrite(buf->Data,buf->Width*buf->Height*3,1,fout);fclose(fout);
}
FrameBuffer *createFrameBuffer(int width, int height) {FrameBuffer *result;
result= (FrameBuffer *) malloc(sizeof(FrameBuffer));assert(result != NULL);result->Width = width;result->Height = height;result->Data= (char *) malloc(width*height*3);memset(result->Data,0,width*height*3);assert(result->Data != NULL);return result;
}
void setPixel(FrameBuffer *buf, int x, int y, char red, char green, char blue) {assert(buf != NULL);assert(buf->Data != NULL);assert( (x >= 0) && (x < buf->Width) );assert( (y >= 0) && (y < buf->Height) );buf->Data[(y*buf->Width+buf->Width-x-1)*3+2] = red;buf->Data[(y*buf->Width+buf->Width-x-1)*3+1] = green;buf->Data[(y*buf->Width+buf->Width-x-1)*3] = blue;
}
**** portutil.c ****#include <stdio.h>#include "portutil.h"
unsigned char readPort1(int port, char *name) {unsigned char temp;
if(tm_read(port, &temp, 1) != 1) {fprintf(stderr, "ERROR: Error reading %s port\n",name);exit(1);
}return temp;
}
unsigned short int readPort2(int port, char *name) {unsigned short int temp;
if(tm_read(port, &temp, 2) != 2) {fprintf(stderr, "ERROR: Error reading %s port\n",name);exit(1);
}return temp;
}
unsigned int readPort4(int port, char *name) {unsigned int temp;
if(tm_read(port, &temp, 4) != 4) {fprintf(stderr, "ERROR: Error reading %s port\n",name);exit(1);
}return temp;
}
105
int openPort(char *name, char *mode) {int temp;if ((temp = tm_open(name,mode)) < 0) {
fprintf(stderr,"ERROR: Can't open port %s in mode %s\n",name,mode);exit(1);
}return temp;
}
void writePort(int port, char *name, char *data, int bytes) {if(tm_write(port, data, bytes) != bytes) {
fprintf(stderr, "ERROR: Unable to write %u bytes to port %s [%u]\n",bytes,name,port);exit(1);
}}
void readPort(int port, char *name, char *data, int bytes) {if(tm_read(port, data, bytes) != bytes) {
fprintf(stderr, "ERROR: Unable to read %u bytes from port %s [%u]\n",bytes,name,port);exit(1);
}}
void writeCharPort(int port, char *name, char val) {if(tm_write(port, &val, 1) != 1) {
fprintf(stderr, "ERROR: Unable to write %u to port %s [%u]\n",val,name,port);exit(1);
}}
void writeIntPort(int port, char *name, unsigned int val) {if(tm_write(port, &val, 4) != 4) {
fprintf(stderr, "ERROR: Unable to write %u to port %s [%u]\n",val,name,port);exit(1);
}}
void writeShortIntPort(int port, char *name, unsigned short int val) {if(tm_write(port, &val, 2) != 2) {
fprintf(stderr, "ERROR: Unable to write %u to port %s [%u]\n",val,name,port);exit(1);
}}
void write3BytesPort(int port, char *name, unsigned int val) {int temp;
temp = val << 8;if(tm_write(port, &temp, 3) != 3) {
fprintf(stderr, "ERROR: Unable to write %u to port %s [%u]\n",val,name,port);exit(1);
}}
void toggleBitPort(int port, char *name, char val) {writeCharPort(port,name,val);if (val == 0)
writeCharPort(port,name,1);else
writeCharPort(port,name,0);}
/* Writes using standard memory interface method with a 3 byte address */void writeMemoryPort3(int addrport, int dataport, char *addrname, char *dataname,
unsigned int addr, int bytes, char *data) {write3BytesPort(addrport,addrname,addr);writePort(dataport,dataname,data,bytes);write3BytesPort(addrport,addrname,addr);
}
/* Writes using standard memory interface method with a 2 byte address */void writeMemoryPort2(int addrport, int dataport, char *addrname, char *dataname,
unsigned short int addr, int bytes, char *data) {writeShortIntPort(addrport,addrname,addr);writePort(dataport,dataname,data,bytes);writeShortIntPort(addrport,addrname,addr);
}
106
107
**** trilist.c ****#include <stdio.h>#include "assert.h"#include "trilist.h"
/* Node I may have to rearrange the data such that endiness is correct */unsigned short int indirect[512][1024];unsigned long long int *tridata = NULL;
/* Note: [0] is the count but it must be shifted left by 2 bits *//* The count must be larger then 8 or so to prevent result collision */unsigned short int indirectcount[512][2];
int activelevel,trinum0,trinum1,trinum2,level0,level1,level2,triindex;
unsigned long long int *gettridata() {return tridata;
}
unsigned long long int *getindirect() {return indirect;
}
unsigned int *getindirectcount() {return indirectcount;
}
void initIndirect() {int x;for (x=0; x < 512; x++) {indirectcount[x][0] = 0; /* Count *4 */indirectcount[x][1] = x*1024 / 4;
}level0 = 0; level1 = 0; level2 = 0;activelevel = 0;trinum0 = 0;trinum1 = 0;trinum2 = 0;triindex = 65534;
}
void finalizeIndirect() {int x,y;
for (x=0; x < 512; x++) {
/* Make sure every node meets the minimum requirement of 8 triangles *//* Pad if Necessary (Empty nodes are exempt) */if ((indirectcount[x][0] > 0) && (indirectcount[x][0] < 8)) {
for (y = indirectcount[x][0]; y < 8; y++) indirect[x][y] = 0xffff;indirectcount[x][0] = 7;
}if (indirectcount[x][0] != 0) {
indirectcount[x][0] += 1;}indirectcount[x][0] *= 4; /* Scale to fit proper bit position */
}}
void addTritoNode(unsigned short int triID, unsigned short int nodeID) {indirect[nodeID-72][ indirectcount[nodeID-72][0]++ ] = triID;
}
/* Packs a triangle into 24 bytes */void packTriangle( unsigned long long int *data,
int vert0x, int vert0y, int vert0z,int edge1x, int edge1y, int edge1z,int edge2x, int edge2y, int edge2z, char square) {
data[0] = (((long long int)edge1x) & 0xffff)+(((long long int)edge1y << 16) & 0xFFFF0000) +(((long long int)edge1z << 32) & 0xFFFF00000000) +(((long long int)edge2x << 48) );
/* printf("Pack 0: e1x %x e1y %x e1z %x e2x %d Packed:%016llx\n",edge1x,edge1y,edge1z,edge2x,data[0]);*/
108
data[1] = (((long long int) edge2y) & 0xFFFF)+(((long long int) edge2z << 16) & 0xFFFF0000) +(((long long int) vert0x << 32) & 0x0FFFFFFF00000000l);
/* printf("Pack 1: e2y %x e2z %x v0x %x Packed:%016llx\n",edge2y,edge2z,vert0x,data[1]);*/
data[2] = (((long long int) vert0y) & 0x0FFFFFFF)+((((long long int) vert0z) << 32) & 0x0fffffff00000000l)+((long long int)square << 63);
data[3] = 0;}
void clearTriData() {int x;for (x = 0; x < 65536; x++) {
tridata[x*4] = 0;tridata[x*4+1] = 0;tridata[x*4+2] = 0;tridata[x*4+3] = 0;
}}
void packVTriangle( unsigned long long int *data,int vert0x, int vert0y, int vert0z,int vert1x, int vert1y, int vert1z,int vert2x, int vert2y, int vert2z, char square) {
packTriangle(data,vert1x,vert1y,vert1z, vert2x-vert1x,vert2y-vert1y,vert2z-vert1z,vert0x-vert1x, vert0y-vert1y,vert0z-vert1z,square);
/* printf("%016llx\n",data[0]);printf("%016llx\n",data[1]);printf("%016llx\n",data[2]);*/
}
void initTriData() {
tridata = (unsigned long long int *) malloc(sizeof(long long int) * 65536 * 4);assert(tridata != NULL);clearTriData(tridata);
}
void addBoundPoly(sPolygon p) {int polyID;
switch (activelevel) {case(1): polyID = (level0-1)*6+trinum0; trinum0++; break;case(2): polyID = (level0-1)*8*6+(level1-1)*6+trinum1+48; trinum1++; break;case(3): polyID = (level0-1)*8*8*6+(level1-1)*8*6+(level2-1)*6+trinum2+432;
trinum2++;break;}
/* printf("Adding Bound Poly Level %d ID %d\n",activelevel,polyID);printf(" (%d %d %d) (%d %d %d) (%d %d %d)\n",
p.vert0x,p.vert0y,p.vert0z,p.vert1x,p.vert1y,p.vert1z,p.vert2x,p.vert2y,p.vert2z);
*/packVTriangle(&(tridata[polyID*4]),
p.vert0x,p.vert0y,p.vert0z,p.vert1x,p.vert1y,p.vert1z,p.vert2x,p.vert2y,p.vert2z,p.square);
}
void addObjectPoly(sPolygon p) {int nodeID;
nodeID = (level0-1)*8*8+(level1-1)*8+(level2-1)+72;
/* printf("Adding Object Poly %d to node %d\n",triindex, nodeID);printf(" (%d %d %d) (%d %d %d) (%d %d %d)\n",
p.vert0x,p.vert0y,p.vert0z,p.vert1x,p.vert1y,p.vert1z,p.vert2x,p.vert2y,p.vert2z);
*/addTritoNode(triindex,nodeID);packVTriangle(&(tridata[triindex*4]),
p.vert0x,p.vert0y,p.vert0z,p.vert1x,p.vert1y,p.vert1z,p.vert2x,p.vert2y,p.vert2z,p.square);
109
triindex--;}
int push() {switch(activelevel) {case(0):
if (level0 == 8) return 1;level0++;level1 = 0;trinum0 = 0;break;
case(1):if (level1 == 8) return 1;level1++;level2 = 0;trinum1 = 0;break;
case(2):if (level2 == 8) return 1;level2++;trinum2 = 0;break;
}activelevel++;return 0;
}
void pop() {activelevel--;
}
110
Appendix D: Brute Force Test Images