Packet-based Ray Tracing of Catmull-Clark Subdivision Surfaces

Packet-based Ray Tracing of Catmull-Clark Subdivision Surfaces

Carsten Benthin† Solomon Boulos� Dylan Lacewell‡ Ingo Wald∓†

†Intel Corporation �Stanford University ‡Walt Disney Animation Studios ∓SCI Institute, University of Utah

Figure 1: Direct ray tracing of a complex subdivision surface scene containing 1.79M base faces, from Disney’s Meet the Robinsons c©.Left: For reference, production rendering using Pixar’s PRMan. Center: Pure ray casting of the subdivision surfaces (no water or skydome)with simple shading. Using 5 levels of uniform subdivision, we can render this frame at 2.2 frames per second on an 8-Core 2.0 GHz Core2 Quad MacPro (1384 × 757 pixels). Right: Using an adaptive subdivision scheme at comparable quality at 4.8 fps (red = 5 subdivisions;green = 4, blue = 3, cyan is crack fixing).

Abstract

Efficient ray tracing of subdivision surfaces is an important prob-lem in production rendering, and for interactive applications in thenear future. The current hardware trends for both CPUs and GPUssuggest that compute power is outpacing bandwidth. Despite this,current approaches for ray tracing subdivision surfaces favor ge-ometry caches or full pre-tessellation. We demonstrate that directlyray tracing subdivision surfaces using ray packets uses much lessbandwidth, while still providing amortization benefits. Our pro-posed method performs competitively with pre-tessellation even oncurrent hardware, outperforms a single-ray implementation by upto 16× and Pixar’s PRMan 13.0 geometry caching by up to 23.1×.

1 Introduction

Subdivision surfaces (Subds) have become the most widely usedorganic modeling primitive in the animation industry, and we ex-pect games to follow this trend in the future. The advantage ofsubds is that smooth surfaces can be described at a very coarselevel, with the generation of a temporary, dense polygonal mesh de-ferred until late in the rendering pipeline (e.g., as in a REYES ren-derer [CCC87]). Deferred tessellation places the strain on computepower rather than on the memory system, which will have increas-ing benefits over geometry caching as compute power continues toincrease more quickly than memory bandwidth.

We also expect future games to demand increasing amounts of raytracing. At first ray tracing might be used for occlusion queries(e.g. exact hard shadows, easy soft shadows, collision detection),and later for indirect lighting effects such as reflections and globalillumination. Not having the ability to ray trace subds will quicklybecome a barrier to the adoption of ray tracing in games.

In this paper, we show that it is possible to efficiently support di-rect ray tracing of subds, specifically Catmull-Clark surfaces, with-out a geometry cache. We instead rely on ray packets to provide asimilar amortization benefit but with lower memory bandwidth; weuse only kilobytes of memory per packet, which fits in the proces-sor’s local cache hierarchy. Compared to either geometry cachingschemes or single-ray intersection, our packet traversal providessignificant speedups even for real-world film production scenes (seeFigure 1).

2 Background

2.1 Subdivision surfaces

Subdivision surfaces define a smooth limit surface by recursivelysubdividing a polygonal mesh, called the control mesh. Vari-ous subdivision rules have been proposed (e.g., [DS78; CC78;Loo87; ZSS96]), but Catmull-Clark subdivision has become themost common scheme in production rendering since it was adoptedby Pixar [DKT98].

Catmull-Clark subdivision

Catmull-Clark (CC) subdivision surfaces generalize bicubic B-spline surfaces. For detailed background information, the recentSIGGRAPH course notes [ZSD∗00] are an excellent primer. Wewill focus on the properties most relevant to our implementation.

To refine one mesh Mi into another Mi+1, the Catmull-Clark sub-division rules are as follows (also see, e.g., [DKT98]):

1. Create new face points. For each face, compute it’s centroid,and add it as a new face point f i+1.

2. Create new edge points. For each edge compute a new edgepoint ei+1

j by averaging it’s two vertices vij , vi

j+1 and the mid-points f i+1

j and f i+1j+1 of its two neighboring faces.

3. Refine vertices. For each vertex vi, the new vertex vi+1 isthe weighted average of the adjacent edge points ei

j , all facepoints f i+1

j of faces incident to this vertex and vi itself.

4. Generate new mesh by connecting each new face point f i+1

to all of its surrounding new edge ei+1 and vertex points vi+1.

Though not obvious from the rules above, CC subdivision has anumber of useful properties for purposes of efficient ray tracing.After one round of subdivision all faces become quads, and exceptfor boundaries, at most one vertex of a quad can be an extraordinaryvertex (EV), with valence not equal to 4. Also, CC subdivision isinvariant under affine transformations, which means any numberof data “channels” (e.g., position plus texture coordinates) can bestored at the vertices, and every channel can be subdivided inde-pendently. Most importantly for our purposes, subdivision is local:only a small neighborhood of adjacent faces (the 1-ring) is need tosubdivide a face.

a) b) c) d)

Figure 2: a) 1-ring for a single vertex. The 1-ring of a quad consists of the union of the 1-rings of its four vertices. b) 1-ring for a regularface. The sixteen control vertices can be stored in an efficient 4x4 layout. Explicit connectivity is not required. c) After one subdivision step,a quadrilateral can only contain one EV. Even for irregular faces, the 4x4 layout is maintained and the 1-ring for the EV is stored separately.d) One-ring for a regular boundary face. Vertices 0 through 3 are extrapolated, e.g., v0 = 2v4 − v8.

2.2 Interactive ray tracing

Researchers have demonstrated the feasibility of interactive raytracing in many settings over the last decade. With the supercom-puter work by Parker et al. [PMS∗99] and then later for commod-ity hardware with SIMD by Wald et al. [WSBW01]. Extendingthis trend beyond simple hardware gains, Reshetov et al. [RSH05]demonstrated a novel use of packets that allowed kd-trees to achieveamortization beyond SIMD speedups alone. Wald et al. [WIK∗06;WBS07] then presented a pair of algorithms that handle dynamicscenes while also including new packet based approaches that gobeyond SIMD speedups alone.

In the packet based approaches, gains beyond single ray are onlyachieved when the rays within packets follow the same traversalpath. In the case of primary rays and coherent shadow rays, the pre-vious papers demonstrate excellent amortization results for packettracing. As noted by Reshetov [Res06], the speedups gained byrecent work in packets may not apply to more general ray trac-ing. Boulos et al. [BEL∗07] recently demonstrated, however, thatbounding volume hierarchies still allow for both SIMD and algo-rithmic speedups for both Whitted style and distribution ray trac-ing [CPC84].

In all of these packet approaches, the gains of using packets isstrongly linked to the cost of the repeated operation. If a packetis used to amortize only a few inexpensive operations (e.g., kd-treeplane tests), there is not very much benefit in amortization. In thispaper, we demonstrate that ray packets provide an efficient way toamortize subdivision cost, and significantly outperform a single rayimplementation.

2.3 Ray tracing higher-order surfaces

Methods for ray tracing higher-order surfaces tend to fall into twobroad categories: tessellation and direct intersection.

In a tessellation approach, the higher-order surface is diced up intosmall polygons (triangles or quads). This works well for stream-ing geometry in the REYES system [CCC87] but usually requires acaching system to support less coherent ray intersection. In Pharr etal. [PKGH97], rays are coherently marched through a coarse scenegrid and geometry is tessellated on demand and reused. Chris-tensen et al. [CLF∗03] achieve high hit rates with a multi-resolutioncaching scheme, using ray differentials to determine which level ofthe cache to access.

In contrast to tessellation, researchers have also demonstrated di-rect ray intersection of smooth surfaces including Bezier surfaces,trimmed NURBS surfaces, and subdivision surfaces. Direct raytracing of trimmed NURBS surfaces were demonstrated in the UtahInteractive Ray Tracer [PMS∗99; MCFS00]. More recently, Ben-thin et al. [BWS04] demonstrated ray tracing of Bezier surfaces

and Loop subdivision with a focus on SIMD parallelism. Usingmulti-core systems it is now possible to directly ray trace trimmedNURBS surfaces interactively [OA06].

In Whitted’s original ray tracing paper [Whi80], support forCatmull-Clark subdivision was provided through a simple recur-sive method, chosen for simplicity rather than efficiency. Kobbelt etal. [KDS98] present a basis function approach to building a bound-ing hierarchy over a subdivision surface and directly intersect thissurface. Mueller et al. [MTF03] present a method to adaptively raytrace subdivision surfaces. Both methods use single ray implemen-tations and were far from interactive.

3 Ray Tracing Subdivision Surfaces

In this section, we explain how we directly intersect a ray with aCatmull-Clark subdivision surface. We first sketch this for a singleray and then discuss important technical issues such as handlingof boundaries, termination criteria, and tightness of bounding vol-umes. Finally, we discuss the extension to ray packets for improvedperformance.

3.1 Catmull-Clark patch intersection

We begin by separating the base mesh into individual faces andtheir 1-rings , which we call “patches”. Note that the 1-ring ofa quad consists of the union of the 1-rings of its four vertices, seeFigure 2 (a). Due to the local support of Catmull-Clark subdivision,the 1-ring is the only information required to subdivide a face.

Next we build an acceleration structure over the initial set of patches(we use a BVH), and begin tracing rays. Once we have deter-mined that a ray has intersected a patch’s bounding box (more onthat later), we either decide we have reached a sufficient refine-ment level or continue to subdivide. Subdivision produces foursub-patches assuming we have subdivided the mesh once on input.

For each of these sub-patches, we recurse until a termination cri-terion is satisfied. A constant maximum depth is the simplest cri-terion, which we use as a baseline, but we also describe a sim-ple adaptive scheme in Section 3.7. When we reach the maximumdepth, a final intersection test is performed. We split the final quad(which may be non-planar) into two triangles and perform intersec-tion tests on each of them. Note that we do not retain any of thesubdivided patches in memory once the ray terminates.

3.2 Efficient data layout

The vast majority of faces on a mesh are regular (each vertex hasvalence 4) and form a 4× 4 grid as shown in Figure 2 (b). For thiscommon case, connectivity is implicit and we only need to store the16 vertices.

Irregular faces as shown in Figure 2 (c) are relatively rare, and theirrelative occurrence decreases with the amount of subdivision ap-plied. Furthermore, a face can have at most one EV, so the caseillustrated is really the only case, up to rotation and valence of theEV. We only need to store the index and 1-ring of the EV separately.

For quads on the boundary of the mesh, we take advantage of ex-trapolation to store the “virtual” 1-ring in our efficient 4 × 4 for-mat (see Figure 2d). Extrapolation produces a B-spline boundaryby canceling terms in the bicubic B-spline; this allows us to treatboundary cases as regular subdivision and avoid more complicatedcontrol flow.

To ensure high data locality we store the 1-ring for each patch in acontinuous memory region. Moreover, all vertices are stored usingaligned SIMD vectors (each of which is 4 floats). As we wish tosupport texture coordinates, we allow each vertex to represent a 5-tuple (x, y, z, s, t). Because of the SIMD requirement, we end upstoring this as 2 SIMD vectors or 8-tuple.

Sometimes, however, rays only need to compute intersections with-out regard to texture coordinates (e.g. shadow rays). In this case, wesubdivide only the first SIMD vector corresponding to the positioninformation. As detailed in Section 4, this can lead to almost a 2×speedup for these rays.

3.3 Tight bounding volumes

As with any acceleration structure, tighter bounding volumes pro-duce better traversal results. In this case we seek a tight boundingbox for the limit surface of a patch. From subdivision surface the-ory, the only guarantee that can be made is that the convex hull ofthe patch bounds the limit surface. By extension, an axis alignedbounding box of the patch also bounds the limit surface. However,the convex hull contains the full one-ring around the current patch,and so is usually very large (for a regular mesh, it’s roughly 9× thesize of the eventual limit surface). Since the probability of a ran-dom ray hitting a box is proportional to its surface area, these boxesare not very efficient to use for ray tracing.

In our system, we instead utilize a “look-ahead” scheme, based ona nesting property: since we seek to bound the limit surface, wemay replace the bounds for a patch with the combined bounds forits children. As part of reading the model, we subdivide each patcha fixed number of times before calculating its bounding box. (Wedo not retain the child patches in memory, only the bounding boxand the base patch). The tighter bounding boxes greatly reduce theoverlap of boxes for neighboring patches and reduces the averagenumber of intersections per packet by up to an order or magnitudefor some scenes.

3.4 Amortization using packets

The cost of a subdivision step in the modified Catmull-Clarkscheme is fairly high. Even in the regular case (see Figure 2b),we must compute 9 new face vertices, 12 new edge vertices, and 4new vertex positions. This maps well to an efficient SIMD imple-mentation, but the computational costs are still large.

Extending the BVH packet traversal algorithm [WBS07] is fairlystraightforward for the recursive subdivision method. The behaviorwith respect to culling and first hit probabilities seems to remainroughly the same as in the triangular case (comparing the cullingprobabilities in Table 1 with those in [WBS07]). The benefit ofearly culling is also higher for subdivision surfaces, as we savemore by avoiding a subdivision step than avoiding a triangle test.

Scene Killerroo Forest Disney# BVH Traversal Steps 25.09 54.39 72.21# BVH Leaf Intersections 2.71 7.77 30.19# Subdivisions 12.54 35.49 803.48- regular 11.99 35.26 797.55- irregular 0.54 0.23 5.93# AABB Culling Tests 79.87% 61.04% 48.98%- Early Hit Tests 25.82% 26.77% 17.72%- IA Culling Tests 54.06% 34.27% 31.26%

# Final Intersections 8.64 20.66 119.64- Active SIMD packets 3.65 2.14 1.71

Table 1: Traversal stats for a ray packet size of 64 rays (max.16 active SIMD packets) using a predefined subdivision level of5 (1+4). Each frame is rendered using ray casting (primary raysonly) and uses approximately 1M rays.

As with any packet-based method, the packet size strongly corre-lates with performance. If the number of rays within the packet istoo small, the costs for subdivision cannot be efficiently amortized.On the other hand, a larger number of rays in a packet usually ex-hibit less coherence and culling efficiency is greatly reduced. Inour case a packet size of 64 rays shows the best relation betweenculling efficiency and amortization, which is also in line with pre-vious findings for triangular scenes [WBS07].

3.5 Final intersection

When the termination criteria is reached a final intersection step isperformed. Our approach separates the final quadrilateral, whichmight not be planar, into two triangles which are intersected se-quentially. As triangle tests are typically more expensive than ray-box tests, we first shrink the packet by determining both the firstand the last index of rays in the packet which intersect the currentbounding box [WBS07]. Triangle intersections are then only per-formed for these active rays. It is still more likely for a ray to missa triangle than to hit it, so we perform an inside-outside test beforethe distance test [Ben06], typically resulting in a 5-8% speedup.

3.6 Pseudo-code and implementation details

Algorithm 1 shows the pseudo code for our patch intersection algo-rithm. As vertices are stored in a SIMD-friendly layout, the bound-ing box for each patch can be computed by a sequence of SSE-min/max operations. In combination with the SIMD implementa-tion of the BVH-first hit and BVH-interval culling test, a patch canbe either chosen for subdivision or quickly culled.

During intersection, we must maintain a stack of patches that needto be subdivided. Essentially these patches are like nodes in a stan-dard BVH, but the subdivision rules lead to a branching factor of4. Each patch is fairly small, however, so the full stack for a rea-sonable number of subdivisions requires only a few KB of storage.Therefore we don’t bother pre-allocating the stack for each thread.As regular and irregular patches have a slightly different memorylayout, the subdivision code checks the type and branches to differ-ent optimized versions of the “Subdivide” method.

3.7 Adaptive subdivision

A fixed subdivision level is not efficient for scenes with a largerange of depth values, where a single resolution is too low for ob-jects near the camera, but too high for distant objects. When ren-dering from a known camera position it may be possible (althoughtedious) for an artist to assign a subdivision level to each object;this is not possible in interactive applications.

Algorithm 1 Pseudo C++ code for our on-the-fly subdivision in-tersection. Sub-patches are processed in a depth-first manner whichlimits the size of the subdivision stack to four times the maximumsubdivision level. As the stack typically requires only kilobytes ofmemory, it can be resident within the CPU’s cache hierarchy.

while true doif stack.isEmpty() == true then

breakcurrentPatch = stack.pop()Bounds box = currentPatch.GetBounds()if box.CulledByIATest(rayPacket) == true then

continueif box.FirstHitTest(rayPacket) == false then

if box.AnyRayBoxIntersection(rayPacket) == false thencontinue

if currentPatch.depth == maximumDepth thencurrentPatch.FinalIntersection(rayPacket)

elsePatch sub[4]currentPatch.Subdivide(sub)stack.push(sub,4);

Some researchers have used adaptive metrics based on ray distanceto assign a subdivision level automatically to small units of geom-etry (e.g., patches in our system). With discrete subdivision levels,however, cracks can appear between adjacent patches with differentlevels. Cracking becomes worse if the subdivision level is allowedto vary along a ray; this can produce “tunneling”, or cracks betweensub-patches within a single initial patch [SMD∗06].

Like Christensen et al. [CLF∗03] we avoid tunneling by using afixed subdivision level for an entire patch, although instead ofray differentials we use a cheaper ad hoc distance metric: level= log2 (αd/D), where d is the diameter of the scene boundingbox, D is the minimum distance from the camera to the center ofthe patch, and α is a user parameter for the scene. We threshold thisvalue to determine which of three levels to use: coarse, medium, orfine. The finest level corresponds to the same value as uniformsubdivision, while medium and coarse reduce that value by 1 leveleach. If adjacent patches have different subdivision levels then westitch cracks on the lower resolution patch (see Figure 3). Note thatour scheme ensures that we never have to stitch patches that aremore than one level apart.

This simple adaptive scheme allows our crack fixing logic to beboth simple and efficient. While more advanced adaptive subdivi-sion may result in larger gains, we present our method as a simpleexample of what might be gained through adaptivity.

Figure 3: Left: Adjacent patches are subdivided to differentdepths, so cracks might appear at T-vertices. Right: We fix cracksby creating triangle fans at border sub-patches. No crack fixingneeds to be done for interior sub-patches (red).

4 Results

In this section, we evaluate our algorithm’s performance for a setof test scenes against some competing approaches for renderingsubdivision surfaces. We will compare each approach under threerendering types: ray casting with simple shading, ray casting withshadows, and Whitted style ray tracing with 1 bounce of reflections.

Figure 4: The “Killerroo” (12.1K base quads), the “Forest” scene(122K base quads) and the “Disney” scene (1.8M base quads). Inthese examples, we use a fixed subdivision depth of 5 with 1 sub-division performed as a model preprocess. The scenes then havegeometric complexity equivalent to 6.2M, 62.5M, and 918M trian-gles. On a single 2 GHz core, these examples run at 1.7, 3.92, and6.98 seconds per frame, casting primary rays and performing sub-division for 8-tuples. For 4-tuples the scenes render at 0.96, 1.92,3.54 seconds per frame (of approximately 1M rays).

The ray casting results will establish our best possible performance.Each refinement upon that (shadows and then reflections) will stepalong the continuum from coherent towards incoherent ray distri-butions. As with any packet based method, it is expected that per-formance advantages present in the ray casting case will decreaseas the ray distributions become incoherent.

4.1 Comparison methodology

For a fair comparison, we have chosen a mix of scenes with vary-ing complexity: a simple floating object (the Killerroo, 12K basequads), the forest scene from the Razor paper (122K base quads),and a large, real-world production scene (1.8M base quads) asshown in Figure 4. The Killerroo and Forest scenes are both ren-dered at a resolution of 1024×1024 pixels, while the Disney sceneis rendered at a resolution of 1384× 757 pixels to maintain an ap-propriate aspect ratio for film.

We have identified three competing approaches for ray trac-ing subdivision surfaces: PRMan’s multi-resolution geometrycaching [CFLB06], the Razor system [DHW∗07], and full pre-tessellation of the surfaces. Since pre-tessellation is conceptuallydifferent, we will discuss it in Section 4.7, and focus first on “true”subdivision surface based ray tracing. We also present the resultsfor our simple adaptive subdivision scheme in Section 4.6, whichwe consider an additional improvement over our basic method.

PRMan. PRMan does not directly trace primary rays from the cam-era, so we place a transparent “window” in front of the camera todo so using the trace shade-op. This also allows us to accuratelyseparate tracing time from setup time as PRMan reports statisticsfor the total “shading time” which now includes the trace shade-op. All time spent in “shading” is essentially ray tracing time.

PRMan uses a multi-resolution geometry cache which uses adap-tive subdivision of the surfaces based on the REYES shadingrate [CFLB06] and does not have a uniform subdivision mode. In-stead of attempting to match their adaptive subdivision method, wehave chosen to match their rendering quality even when using uni-form subdivision. This comparison results in a large disadvantagefor our system, but we feel quality is the only adequate metric forcomparison. To further improve our performance, yet retain thesame quality requirements, we use our own adaptive subdivisionscheme. In Section 4.6 we provide a comparison between our twoapproaches.

To determine which uniform subdivision level matches the qualityof PRMan’s adaptive subdivision, we rendered the same scene inboth PRMan and our system using a highly specular shader with ahigh frequency procedural environment map. Using simple diffuseshaders instead would have allowed us to use a lower subdivisionlevel (2 levels lower in most cases), as diffuse shaders obfuscatehigh-frequency detail.

We attempted to use the multi-threading feature in PRMan 13.0,but found that employing more than 1 thread reduces performance.Consequently, all comparisons between our approach and PRManare done using a single thread for both systems.

Razor. As we do not have access to the Razor system to run com-parisons, we rely on the data available from the original paper.The data in the most recent Razor paper [DHW∗07] only presentsCatmull-Clark subdivision results for the “Forest” scene. The ren-dering method used for that setup is ray casting with shadows, sowe will only have a single point of comparison with the Razor sys-tem. We hope to be able to conduct a more detailed comparisonwith Razor at some point in the future.

Razor’s adaptive subdivision scheme is different from ours, just asis the case when comparing to PRMan. We compare our uniformsubdivision to Razor’s adaptive subdivision, while trying to matchthe same quality. To do so, we ensured that for a given subdivisionlevel, none of the final triangles in our method exceeded 8 pixelsin area (matching the Razor quality metric); however, our subdivi-sion amount applies to all rays, whereas Razor is able to decide asubdivision amount per ray regardless of type. While Razor maychoose a higher subdivision amount for some shadow rays, we feelit is unlikely that the amount is vastly different.

4.2 Hardware Configuration

We run all of our results on a system with 9GB of memory. How-ever, our scenes do not utilize much of that memory. For example,the Disney scene (which is the largest) uses approximately 1.1GBfor the control mesh after it has been subdivided once and split intopatches.

Our system has two Core 2 Quad 2.0GHz processors. As notedearlier, when comparing our results to PRMan we use only a singlecore for performance reasons. When comparing our system to itselfor to Razor, we use all 8 cores.

4.3 Ray casting with simple shading

We begin by comparing ray casting performance with only simplelocal shading. As demonstrated in Table 2, packets of rays allow usto perform up to 16.6× faster than a single ray implementation andup to 5.6× faster than a single ray geometry cache. The reason forthis improvement is similar to those in previous BVH approaches:the early hit and the interval culling test reduce the traversal costof the implicit BVH (see Table 1). Note that each traversal step in

our subdivision surface intersection is much more expensive than aregular BVH system would use, so the savings are larger.

As mentioned earlier, the size of the tuples being subdivided cangreatly increase the cost of subdivision. In particular, on a 4-wideSIMD machine, a 4-tuple is particularly well suited to the architec-ture. We show the cost for both 4 and 8-tuples for our ray castingresults to demonstrate the possible optimization for rays that don’tneed more than position information (e.g. shadow and other binaryvisibility queries). It is also interesting to note that for wider SIMDarchitectures, it may be possible to subdivide more parameters thansimply position and a single texture coordinate.

Even for the more expensive subdivision of 8-tuples we are 2.4 −5.6× faster than PRMan’s geometry caching, and about 5 − 15×faster than single ray traversal. In particular, our performance ad-vantage increases for realistically complex scenes, as we outper-form PRMan for the Disney scene by 5.6× and 11× for 8-tuplesand 4-tuples, respectively.

Scene Absolute time speedup overpacket single PRMan single PRMan

4-tupleKillerroo 0.96 5.50 4.23 5.7x 4.4xForest 1.92 5.62 19.43 2.9x 10.1xDisney 3.54 37.70 39.03 10.7x 11.0x

8-tupleKillerroo 1.70 16.40 4.23 9.7x 2.4xForest 3.92 19.47 19.43 5.0x 5.0xDisney 6.98 115.67 39.03 16.6x 5.6x

Table 2: Ray casting performance in seconds per frame for 1Mrays with 4-tuple and 8-tuple subdivision, respectively (using 1core). For both packet and single-ray implementations we use apredefined subdivision level of 5 (1+4) for all scenes to match thevisual quality of the PRMan renderings.

4.4 Ray casting with shadows

Primary rays tend to exhibit higher coherence than other types ofrays. In Table 3, we investigate our performance for secondary rayswith simple shadows from 2 point light sources. The results demon-strate that packets are already beginning to lose some ground inperformance as compared to both single ray and PRMan.

The loss of performance versus single rays is due simply to coher-ence, while the performance loss to PRMan is due to both coher-ence and PRMan’s reuse of its cached geometry for shadow rays.This is the intended use of PRMan’s geometry cache, however, sothis result is expected. For the Disney scene there are still someareas of large coherence due to large control meshes on some of thebuildings. It should be noted, however, that this is a “real world”instance of subdivision surface modeling: large and simple controlmeshes can be used to describe otherwise complicated smooth sur-faces. This allows the packets to remain more competitive with thesingle ray and PRMan approaches for this scene.


Killerroo 4.48 21.43 5.47 4.8x 1.2xForest 9.40 30.54 27.99 3.2x 3.0xDisney 13.35 174.56 53.64 13.1x 4.0x

Table 3: Rendering performance in seconds per frame includingshadows from 2 point lights. For all scenes, shadow rays use 4-tuples for subdivision while primary rays use 8-tuples.

Figure 5: The Disney scene rendered using our adaptive termination criterion. In these examples, we set the maximum subdivision depth to5 with 1 subdivision performed as a model preprocess. With all 8 cores these examples run at 2.67, 3.14, and 4.78 fps, casting primary raysand performing subdivision for 4-tuples. For 8-tuples the scenes render at 1.2 fps, 1.4 fps, and 2.0 fps. By comparison, uniform subdivisionruns at 2.2 and 1.1 fps for 4-tuple and 8-tuple subdivision, respectively.

This particular setup—ray casting with two point lights—is also theonly one that allows for direct comparison to Razor. This scene isthe only one presented in the Razor paper that uses Catmull-Clarksubdivision. While Razor does provide three renderings of thisscene using different quality setups, our system has not yet beenextended for distribution ray tracing so we can only compare to theray casting with shadows case.

We modify our standard test case of 1024× 1024 rays to match theoriginal Razor test of 512 × 512. In Table 4 we demonstrate per-formance for this test for a number of subdivision levels (for com-pleteness). Due to the subdivision performed on input, we believethat a subdivision level of 3 matches the Razor subdivision amountmost closely (based on the visible micropolygon area at level 3).

As compared to Razor [DHW∗07], we are using a similar proces-sor but a lower clock rate (2.0GHz vs 2.66GHz). To take into ac-count this frequency difference, we scale their result to match ourclock. Depending on which level of subdivision is chosen, we areanywhere between 2.7 and 16.8× faster than Razor for this setup.Though the entire comparison is too “apples-and-oranges” to beconclusive, the overall point is that our approach is at least highlycompletive for this rendering style. In particular we note that thisis only one scene and that Razor is designed to be a distribution raytracer not a ray caster with point light shadows.

Level Frames per second speeduppacket Razor

1 13.81 .82 16.8x2 8.69 .82 10.6x3 4.29 .82 5.2x4 2.23 .82 2.7x

Table 4: Ray casting with shadows for the Forest scene used byRazor. The frame size is 512 × 512 using 8-cores with shadowsfrom two point lights. All values are frames per second and theRazor result has been scaled to match our 2.0GHz processor.

4.5 Whitted style ray tracing

While primary and shadow rays are commonly believed to behighly coherent, ray distributions including specular reflections be-have quite differently [Res06; BEL∗07]. While our system does nothave actual production shaders to compare to, we have ported thediffuse/specular model used by Boulos et al. [BEL∗07]. We havecurrently only focused on single bounce reflections, but hope to domore extensive secondary ray comparisons in the future.

As can be seen from Table 5 we outperform PRMan by about1− 2.5×. As expected, PRMan’s multi-resolution geometry cacheperforms better and better in relation to our approach as coherence


Killerroo 7.70 38.08 7.58 4.9x 1.0xForest 21.28 66.99 52.63 3.1x 2.5xDisney 33.88 268.02 66.59 7.9x 2.0x

Table 5: Rendering performance in seconds per frame includingshadows from 2 point lights and 1 bounce reflections. Note thatshadow rays are cast for both primary and reflected hit points.

decreases. Compared to our single-ray implementation, the packetperformance advantage is roughly 3− 8×. This is somewhat lowerthan for primary rays, but clearly indicates that the our approach isnot limited to primary rays only. In particular, packets have an 8×performance advantage over single rays for the Disney scene, in thepresence of both less coherent packets and incredibly fine geometry(adaptive subdivision is not used).

4.6 Adaptive subdivision

In comparing to PRMan and our single ray implementation, wehave only considered uniform subdivision so far. To demonstratethe possible benefit that adaptive subdivision may provide in ad-dition to the speedups already demonstrated, we compare uniformsubdivision to our simple adaptive heuristic for varying subdivisiondepths in Table 6. We have chosen to focus on the Disney scene asit has enough depth variability to allow for a useful demonstration.

ray casting shadowsLevel uniform adaptive speedup uniform adaptive speedup

2 0.04 0.036 1.2x 0.11 0.10 1.1x3 0.09 0.053 1.8x 0.26 0.17 1.5x4 0.22 0.10 2.1x 0.61 0.36 1.7x5 0.45 0.21 2.1x 1.24 0.73 1.7x

Table 6: Uniform vs adaptive subdivision depth for the Disneyscene, using 4-tuple subdivision for ray casting (left half) and raycasting with shadows (right half). The adaptive results correspondto the rendering in Figure 1 and Figure 5 right.

Table 6 also demonstrates the linear increase in rendering time forour subdivision method. This is due to ray tracing’s logarithmicbehavior applied to a scene that grows by a factor of 4× for everysubdivision step. Our simple adaptive criteria seems to regain abouta 2× speedup which suggests it is only performing approximatelyone level of subdivision less than the uniform case on average (i.e. itis choosing the medium level). Following the trend in the previoustables, adding shadows decreases the overall gain to approximately1.7× (while not explicitly shown in the table, adding reflectionsdecreases it further to approximately 1.5×).

If we combine our adaptive rendering speedup with the uniformsubdivision tables already presented against PRMan, our methodis approximately 23.1× faster for ray casting with 4-tuples, 6.8×faster for ray casting with shadows, and 3.0× faster for ray trac-ing with single bounce reflections. The gains over our single rayimplementation are even more pronounced.

As we can see from the chosen levels (Figure 5 right), a large por-tion of the geometry uses the finest level of subdivision (5) match-ing the uniform case. Another reasonable portion uses the mediumlevel (4), and only very distant objects use the coarsest level (3).The crack fixing logic introduces overhead, so our currently mod-est gains are understandable but encouraging.

4.7 Comparison to pre-tessellation

For small models it may be feasible to subdivide the model as apreprocess and directly ray trace the resulting triangles. An obvi-ous question for our on-the-fly scheme is how close performancecompares to ray tracing a pre-tessellated result.

Since each additional level of subdivision quadruples the trianglecount of the pre-tessellated model, we can only compare with arelatively coarse base mesh–the Killerroo. At 12168 base quads,the equivalent number of triangles after 4 levels of subdivision isalready 6.23M triangles (12168 ∗ 44 ∗ 2). Assuming a (rather low)size of 40 bytes per triangle, pre-tessellation for this (rather simple)model would 250 MB storage space; for the Disney scene, it wouldbe roughly 40GB of triangle data alone.

4.7.1 Performance

As can be seen from Table 7, our implementation is slower thantracing a pre-tessellated model. As a reminder, our claim is that ouralgorithm will provide benefits over caching and pre-tessellation inthe future as compute power exceeds bandwidth. Given that we arerunning on a current CPU our lower ray casting performance is notsurprising, since we eventually intersect exactly the same triangles,but have to generate them on the fly. Furthermore, the bandwidthrequired even for the tessellated version of this scene per frame isnot enough at current frame rates to be prohibitive on a modernCPU.

number of Frames per second slowdownLevel triangles tessellated direct

1 97K 10.4 9.87 1.052 398K 6.15 4.99 1.233 1.55M 3.17 2.19 1.444 6.23M 1.75 1.04 1.68

Table 7: Our direct Catmull-Clark subdivision surface ray trac-ing method vs. ray tracing a pre-tessellated model, for the Killerrooscene with ray casting and 4-tuple subdivision. For every subdivi-sion step the number of triangles quadruple. For this setup, ourperformance is always slower but the slowdown is always less than2×.

The BVH over the subdivision control meshes is naturally looserand has more subtree overlap than in the tessellated case. With thisin mind, the performance difference is surprisingly small (5-70%depending on subdivision amount). This is particularly interestingwhen we consider that the memory use for our approach is only asmall fraction of that used for tessellation. For “real” scenes likethe Disney scene, full pre-tessellation would far exceed availablememory.

4.7.2 Off-cache bandwidth

As our algorithm directly intersects a subdivision surface, the onlymodel data required is the initial patch. Because we recurse in adepth-first manner, only a small amount of stack space for “to besubdivided” patches needs to be maintained. Current CPUs canhold this stack mostly in their L2 cache which significantly reducesoff-cache (external memory) bandwidth. In order to measure theoff-cache bandwidth we used the cachegrind cache profiler [Val07]to track L2 cache misses.

Table 8 shows that direct ray tracing requires a very low andconstant off-cache bandwidth. Compared to ray tracing a pre-tessellated scene, it requires 2.4 − 64.7 times less bandwidth (perframe), depending on the subdivision level.

number of Bandwidth (MB) reductionLevel triangles tessellated direct

1 97K 5.2 2.2 2.4x2 398K 18.5 2.2 8.4x3 1.55M 60.4 2.2 27.5x4 6.23M 142.4 2.2 64.7x

Table 8: Required off-cache (main memory) bandwidth for directCatmull-Clark subdivision surface and ray tracing a pre-tessellatedmodel, for the Killerroo scene with ray casting and 4-tuple subdi-vision. For high subdivision levels on-the-fly subdivision is able toreduce the bandwidth by a factor of 64.

On future architectures that have more compute resources thancaching, less bandwidth should translate into a significant perfor-mance advantage. For now, however, a modern CPU memory hier-archy can easily sustain the required bandwidth for this scene evenfor the highest level of subdivision.

4.8 Impact of ray packet size

Like any packet method, reduced coherence will reduce the amor-tization benefits (see Section 4.5 for evidence). As the coherenceis typically related to the number of rays per packet we tested ourmethod using a varying ray packet sizes as shown in Table 9. Likeprevious work [WBS07] we achieve the best performance for a raypacket size of 8× 8 = 64 rays.

packet sizeLevel 4x4 8x8 16x16 32x32

1 .77 (.89) 1.0 (1.0) .94 (.87) .48 (.39)2 .74 (.80) 1.0 (1.0) .95 (.90) .36 (.27)3 .70 (.75) 1.0 (1.0) .97 (.92) .27 (.21)4 .70 (.76) 1.0 (1.0) .99 (.93) .20 (.20)

Table 9: Ray casting (Whitted style ray tracing) performance forvarying ray packet size and subdivision level (Killerroo scene). Allresults are shown to the normalized performance for 8x8 = 64 raysper packet: Having less than 64 rays per packet makes amortizationof subdivision less effective, while performance for larger packetssuffers under reduced ray coherence. For our setup, a size of 8x8rays has been determined to be the most efficient.

5 Summary and Conclusion

We have proposed an approach to ray tracing subdivision surfacesusing on-the-fly tessellation. Whereas other systems like Razor orPRMan amortize the cost for patch subdivisions by caching ge-ometry, we instead use large packets of rays coupled with an ef-ficient traversal algorithm. Our approach is competitive with ge-ometry caching, and in addition allows for all the other advantages

of packet techniques. Consequently, we not only need less memorythan geometry cache approaches, but are also faster than both Razorand PRMan.

For scenes with varying depth complexity, we have proposed anadaptive subdivision method. Though crack fixing adds complexityto the system, for the Disney scene it provides additional speedupsof up to 2.1×. Adaptive subdivision also becomes particularly in-teresting when considering packets of less coherent secondary raysby using a coarser scene representation [CLF∗03; DHW∗07].

Performance-wise, our uniform subdivision approach outperformsboth Razor and PRMan by up to 5.2× and 5.6×, and a single-rayimplementation of the same algorithm by 16.6×. Our adaptive sub-division is even more competitive as it adds an additional factor of2× on top of these results. Compared to pre-tessellated modelswith pre-built acceleration structures we achieve roughly compet-itive performance, but require only a fraction of the storage andbandwidth requirements. We also demonstrated our approach for acomplex film scene which would not fit into memory if tessellated.

The compute power of future graphics architectures, regardless ofbeing more CPU or GPU oriented, is expected to grow much fasterthan their memory bandwidth. This makes it essential to reduce theoff-chip bandwidth to achieve optimal utilization. Our approach iswell suited for these architectures as it requires only a small amountof on-chip memory per core.

Future Work. Potential extensions to our system abound. Weare particularly interested in supporting displacement maps, whichare important for real-world production rendering. More advancedmethods for adaptive subdivision would further reduce the numberof required subdivisions per patch. Apart from that, the biggestissue for supporting real-world rendering is support for complexshaders and textures. Using packets of secondary rays for dynamicambient occlusion and indirect diffuse lighting, would further stressthe coherence needs of our method.

Acknowledgments

We would like to thank Walt Disney Animation Studios for provid-ing us with a scene from Disney’s Meet the Robinsons. For model-ing and providing us the Razor scene we would like to thank JefferyA. Williams, Headus (Metamorphosis), Phil Dench, Martin Rezard,Jonathan Dale, and the DAZ studio team.

ReferencesBOULOS S., EDWARDS D., LACEWELL J. D., KNISS J., KAUTZ J.,

SHIRLEY P., WALD I.: Packet-based Whitted and Distribution RayTracing. In Proceedings of Graphics Interface 2007 (May 2007).

BENTHIN C.: Realtime Ray Tracing on current CPU Architectures. PhDthesis, Saarland University, 2006.

BENTHIN C., WALD I., SLUSALLEK P.: Interactive Ray Tracing of Free-Form Surfaces. In Proceedings of Afrigraph (November 2004), pp. 99–106.

CATMULL E., CLARK J.: Behavior of recursive division surfaces nearextraordinary points. In Computer Aided Design 10(6) (1978), pp. 350–355.

COOK R. L., CARPENTER L., CATMULL E.: The REYES Image Render-ing Architecture. Computer Graphics (Proceedings of ACM SIGGRAPH1987) (July 1987), 95–102.

CHRISTENSEN P. H., FONG J., LAUR D. M., BATALI D.: Ray tracing forthe movie ’Cars’. In Proc. IEEE Symposium on Interactive Ray Tracing(2006), pp. 1–6.

CHRISTENSEN P. H., LAUR D. M., FONG J., WOOTEN W. L., BATALID.: Ray Differentials and Multiresolution Geometry Caching for Distri-bution Ray Tracing in Complex Scenes. In Computer Graphics Forum

(Eurographics 2003 Conference Proceedings) (September 2003), Black-well Publishers, pp. 543–552.

COOK R., PORTER T., CARPENTER L.: Distributed Ray Tracing. Com-puter Graphics (Proceeding of SIGGRAPH 84) 18, 3 (1984), 137–144.

DJEU P., HUNT W., WANG R., ELHASSAN I., STOLL G., MARK W. R.:Razor: An Architecture for Dynamic Multiresolution Ray Tracing. Tech.Rep. UTCS TR-07-52, University of Texas at Austin Dept. of Comp.Sciences, Jan. 2007. Conditionally accepted to ACM Transactions onGraphics.

DEROSE T. D., KASS M., TRUONG T.: Subdivision surfaces in characteranimation. In Proceedings of SIGGRAPH 98 (July 1998), ComputerGraphics Proceedings, Annual Conference Series, pp. 85–94.

DOO D., SABIN M.: Behavior of recursive division surfaces near extraor-dinary points. In Computer Aided Design 10(6) (1978), pp. 356–360.

KOBBELT L., DAUBERT K., SEIDEL H.-P.: Ray Tracing of SubdivisionSurfaces. Proceedings of the 9th Eurographics Workshop on Rendering(1998), 69–80.

LOOP C.: Smooth subdivision surfaces based on triangles. Master’s thesis,University of Utah, 1987.

MARTIN W., COHEN E., FISH R., SHIRLEY P.: Practical Ray Tracing ofTrimmed NURBS Surfaces. Journal of Graphics Tools: JGT 5 (2000),27–52.

MUELLER K., TECHMANN T., FELLNER D.: Adaptive Ray Tracing ofSubdivision Surfaces. Computer Graphics Forum (Proceedings of Euro-graphics ’03) (2003), 553–562.

OLIVER ABERT MARKUS GEIMER S. M.: Direct and Fast Ray Tracingof NURBS Surfaces. In Proceedings of the 2006 IEEE Symposium onInteractive Ray Tracing (2006).

PHARR M., KOLB C., GERSHBEIN R., HANRAHAN P.: Rendering Com-plex Scenes with Memory-Coherent Ray Tracing. Computer Graphics31, Annual Conference Series (Aug. 1997), 101–108.

PARKER S. G., MARTIN W., SLOAN P.-P. J., SHIRLEY P., SMITS B. E.,HANSEN C. D.: Interactive ray tracing. In Proceedings of Interactive3D Graphics (1999), pp. 119–126.

RESHETOV A.: Omnidirectional ray tracing traversal algorithm for kd-trees. In Proceedings of the 2006 IEEE Symposium on Interactive RayTracing (2006), pp. 57–60.

RESHETOV A., SOUPIKOV A., HURLEY J.: Multi-Level Ray Tracing Al-gorithm. ACM Transaction on Graphics 24, 3 (2005), 1176–1185. (Pro-ceedings of ACM SIGGRAPH 2005).

STOLL G., MARK W. R., DJEU P., WANG R., ELHASSAN I.: Razor:An Architecture for Dynamic Multiresolution Ray Tracing. Tech. Rep.06-21, University of Texas at Austin Dep. of Comp. Science, 2006.

VALGRIND TOOL SUITE: Cachegrind. http://valgrind.org/info/tools.html,2007.

WALD I., BOULOS S., SHIRLEY P.: Ray Tracing Deformable Scenes usingDynamic Bounding Volume Hierarchies. ACM Transactions on Graph-ics 26, 1 (2007), 1–18.

WHITTED T.: An Improved Illumination Model for Shaded Display. Com-munications of the ACM 23, 6 (1980), 343–349.

WALD I., IZE T., KENSLER A., KNOLL A., PARKER S. G.: Ray TracingAnimated Scenes using Coherent Grid Traversal. ACM Transactions onGraphics 25, 3 (2006), 485–493. (Proceedings of ACM SIGGRAPH).

WALD I., SLUSALLEK P., BENTHIN C., WAGNER M.: Interactive Render-ing with Coherent Ray Tracing. Computer Graphics Forum 20, 3 (2001),153–164. (Proceedings of Eurographics).

ZORIN D., SCHROEDER P., DEROSE T., KOBBELT L., LEVIN A.,SWELDENS W.: Subdivision for modeling and animation. SIGGRAPHcourse notes, 2000.

ZORIN D., SCHRODER P., SWELDENS W.: Interpolating subdivision formeshes with arbitrary topology. In Computer Graphics (1996), vol. 30,pp. 189–192.

Documents

Packet-based Ray Tracing of Catmull-Clark Subdivision Surfaces