Binned SAH Kd-Tree Construction on a GPUBinned SAH Kd-Tree Construction on a GPU Piotr Danilewski 1, Stefan Popov , Philipp Slusallek;2 3 1Saarland University, Saarbruc ken, Germany

Binned SAH Kd-Tree Construction on a GPU

Piotr Danilewski1, Stefan Popov1, Philipp Slusallek1,2,31Saarland University, Saarbrucken, Germany

2Deutsches Forschungszentrum fur Kunstliche Intelligenz, Saarbrucken, Germany3Intel Visual Computing Institute, Saarbrucken, Germany

Contact: [email protected]

June 2010

Abstract

In ray tracing, kd-trees are often regarded as thebest acceleration structures in the majority ofcases. However, due to their large constructiontimes they have been problematic for dynamicscenes. In this work, we try to overcome thisobstacle by building the kd-tree in parallel onmany cores of a GPU. A new algorithm ensuresclose to optimal parallelism during every stage ofthe build process. The approach uses the SAHand samples the cost function at discrete (bin)locations.

This approach constructs kd-trees faster thanany other known GPU implementation, whilemaintaining competing quality compared to se-rial high-quality CPU builders. Further testshave shown that our scalability with respect tothe number of cores is better than of other avail-able GPU and CPU implementations.

1 Introduction

Ray tracing is a method of generating realisti-cally looking images by mimicking light trans-

port, which involves traversing a ray through thescene and finding the nearest intersection pointwith the scene triangles. A naive algorithm it-erating through all triangles in the scene in or-der to find the intersection point would be pro-hibitively slow. That is why various accelera-tion structures have been designed: grids, oct-trees, bounding volume hierarchies (BVH), andkd-trees to name a few. In most cases, kd-treesare the best in terms of ray tracing speedup, butare computationally expensive to construct.

To overcome this problem, we are interestedin building the kd-tree in parallel on a GPUusing the general-purpose CUDA programminglanguage. A GPU is a powerful, highly paral-lel machine and in order to be used effectively,the program must spawn tens of thousands ofthreads and distribute the work evenly amongthem. That is the main challenge for the designof our algorithm.

Our kd-tree algorithm builds the tree in top-down, breadth-first fashion, processing all nodesof the same level in parallel with each threadhandling one or more triangles from a node. Theprogram consists of several stages each handling

1

nodes of different primitive counts and distribut-ing work differently among all cores of the GPUin order to keep them all evenly occupied.

At all stages we are using the Surface AreaHeuristic to find the splitting planes and neverrevert to simpler approaches, like splitting by themedian object. While these simpler approachesperform faster and make it easier to distributethe work evenly between processors, they alsoproduce trees of lower quality.

The result of our work is a highly-scalable im-plementation which can construct the kd-treefaster than any other currently existing program,while its quality remains comparable to the bestnon-parallel builders.

In the following sections we first focus on pre-vious approaches for kd-tree construction. Wedescribe the Surface Area Heuristic and how thekd-tree construction algorithm is mapped ontomultiprocessor architectures. Later we presentour algorithm, followed by more detailed focuson practical implementation on a GPU usingCUDA environment with primary focus on workdistribution, synchronisation, and communica-tion between processors. Finally, we put ourprogram to test and present the results both interms of construction speed and quality.

2 Previous Work

Originally kd-trees were designed for fast search-ing of points in k-dimensional space [Ben75](hence the name), but rapidly they found theirway in the field of ray tracing [Kap85].

In this section we first describe the most com-mon approach for kd-tree construction for raytracing which uses the Surface Area Heuristic(SAH). Later we describe various attempts toparallelise the algorithm.

2.1 The Surface Area Heuristic

Kd-trees are typically built in a top-down man-ner. Each node is recursively split by an axis-aligned plane creating two child nodes. The po-sition of the plane is crucial for the quality ofthe whole tree. MacDonald and Booth [MB90]defined a good estimation of the split quality bymeans of a cost function:

f(x) = Ct+Ci·SAL(x) ·NL(x) + SAR(x) ·NR(x)

SAparent

(1)Where the particular components are:

• Ct - cost of traversing an inner node,

• Ci - cost of intersecting a triangle,

• SAL, SAR - surface area of left/right bound-ing box

• NL, NR - number of primitives on the left-/right side,

• SAparent - surface area of the parent bound-ing box

The above function is linear for all values of xwith an exception when x is a lower or upperbound of a triangle. Therefore it suffices to com-pute the value of the cost function at those pointsto find the minimum. This approach is known asexact Surface Area Heuristic construction (exactSAH ) and it builds kd-tree in O(n log n) on a se-quential machine as shown by Wald and Havran[WH06].

A significantly faster approach is to computethe cost on a few sampled positions, which re-duces quality only slightly. This sampled SAHapproach has been first proposed by MacDonaldand Booth [MB90]. A more extensive study hasbeen done by Hunt et al. [HMS06]. Popov et

2

al. described an efficient algorithm for comput-ing the cost using bins (hence binned SAH ) in[PGSS06].

The SAH can be also used to determine whennot to split the node anymore. The cost of anode which is not being split is:

f0 = Ci ·N (2)

If f0 < f(x) for all x in the domain, we donot split. The values Ci and Ct influence thequality of the produced tree and its constructiontimes. Optimum values depend heavily on theray tracer, hardware, and algorithms used.

2.2 Towards Parallelism

In recent years we have been observing a majorchange in hardware development: the speed of acomputing unit no longer grows as fast as in thepast, instead we get more and more cores withinthe processor. This new architecture requiresadjustments in the program to fully take advan-tages of the cores. The most extreme example isa GPU which is capable of running thousands ofthreads in parallel.

One notable approach to parallel kd-tree con-struction was done by Shevtsov et al. [SSK07]who implemented binned SAH kd-tree builderfor multicore CPUs. In their work, the wholescene is first partitioned into p clusters, wherep is number of cores. Each cluster has approx-imately the same number of primitives. Lateron, each cluster is processed independently, inparallel to all other clusters.

Popov et al. [PGSS06] also introduced paral-lelism in their implementation but it is used onlyfor deeper stages of the kd-tree construction,with each processor handling different subtree.At initial steps with only few nodes, the con-struction is done in sequence on a single CPU.

On the other hand they use SAH on all levels ofthe tree. Choi at al. [CKL∗09] gave completelyparallel SAH algorithm which does include theinitial steps as well.

The above approaches work well as long as thenumber of cores is relatively small. While theymap well to multicore CPUs, they do not scalewell enough to saturate a GPU.

The first and only known to us algorithm forGPU kd-tree construction was given by Zhou etal. [ZHWG08]. Their algorithm is divided intotwo stages: big-node and small-node. During thebig-node stage either a part of the empty space iscut off, or a spatial median split is chosen. Onlylater, when there are less than 64 primitives pernode, the exact SAH algorithm is used.

To sum it up, so far nearly every algorithmfor kd-trees is either sequential at the beginningor sacrifices the tree quality by using mediansplits at top levels (or both). In order to obtaingood kd-trees it is important to use the SAHat top levels of the tree as well. Using mediansplits instead may reduce rendering performancesignificantly[Hav00].

3 The Algorithm

We are constructing the kd-tree in top-down,breadth-first fashion. The algorithm works insteps, one for each level of the tree. Each stepconsists of 3 basic phases which are:

• Compute-Cost phase. Here we search forthe optimal split plane using the sampledSAH for all active nodes (that is – for all leafnodes that need to be processed). We alsocompute how many primitives are on eachside of the plane and how many primitivesmust be duplicated.

3

• Split phase. As the name suggests, we thensplit the node and create two children. Inthis phase we allocate memory and rear-range triangles.

• Triangle-splitter phase. Some triangles maypartially lie on both sides of the plane. Inorder to increase the quality of the subse-quent splits, we compute the exact size oftwo bounding boxes, each encapsulating therespective part of the triangle (Figure 1).This is an important step which may in-creases resulting rendering performance byover 30% [Hav00].

t1

t2

t3

p12

p23

p31

Figure 1: 2D example of a triangle split. Leftbounding box is span over {t1, t2, p23, p31}, rightbounding box over {p23, t3, p31}

At the Compute-Cost phase and the Splitphase we perform all operations on axis-alignedbounding boxes of the triangles only. Triangle-splitter phase is the only one where we accessvertices directly. That is why we keep the trian-gle bounds in separate arrays and the first twophases work on those arrays only.

To keep a good utilisation of the GPU, thewhole program is divided into several stages,

each with different implementation of the phases,suited for different triangle counts in a node anddifferent node count in the tree level. If givennode does not meet the requirements for givenstage it is scheduled for later processing. The im-plementation and differences between the stagesare discussed in Section 4.

3.1 Compute-Cost

The first task in each step of the constructionis to find an optimal split plane. We are us-ing binned SAH approach with b = 33 bins andc = b − 1 = 32 split candidates. For each splitcandidate we need to compute the surface areacost as given by Equation 1 and to that end weneed to know how many primitives are on eachside of the plane. We are following the algorithmintroduced by Popov et al. in [PGSS06].

We are interested in the position of the bound-ing boxes of all triangles along a given axis —the bounding box’s lower and higher coordinate,which we refer to as lower and higher events.

We create 2 arrays of b bins (See Figure 2 and3, lines 6-7). One bin array (L) collects lowerevents and the other (H) collects higher events(lines 10-17). Bin i stores an event occurringbetween the i− 1-th and i-th split candidate.

The next step is to compute suffix sum on Land prefix sum on H arrays. After that, for agiven split candidate i we can immediately com-pute the number of primitives entirely on theleft and right side and the number of primitiveswhich are partially on both sides (lines 25-27)using the following formula:

Nonly left = H[i]

Nonly right = L[i + 1]

Nboth = Ntotal − L[i + 1]−H[i]

4

count

1 2 3 1 0 0 1 1 2 0

0 1 1 2 1 1 1 1 1 2

L:

H:

suffix/prefix sum

11 10 8 5 4 4 4 3 2 0

0 1 2 4 5 6 7 8 9 11

L:

H:

Figure 2: Example run of the binning algorithm:we collect lower and higher bounds of trianglesin L and H arrays. This is followed by suffix sumon H and prefix sum on L. At the end we caninstantly get the number of primitives on eachside of every split candidate.

In addition, at this phase we compute the no-split cost (Equation 2). We may also decidethat the node contains too few triangles for givenstage in which case we will process it only later,at one of the next stages (lines 2-4).

3.2 Split

At this point we know for every node if andwhere we want to split it and how many primi-tives are located on each side of the split plane.In this phase we want to rearrange the trian-gle bounding boxes so that all entries of a givennode always occupy a single continuous part ofmemory. We do not rearrange triangle verticesas they are only accessed for a few primitives andonly in the Triangle-splitter phase.

By performing a parallel prefix sum on the tri-

1 function computeCost ( node )2 i f ( node . t r iang leCount<minimumNodeSize )3 r e s u l t .{ x , y , z } . c o s t := i n f i n i t y4 return noSp l i t5 for dim in {x , y , z} do6 L [ 0 . . 3 3 ] : = 07 H[ 0 . . 3 3 ] : = 08 synchronize9 t r i a n g l e I d x :=node . t r i ang l e sBeg i n

+threadIdx10 while ( t r i ang l e Idx<node . t r i ang l e sEnd )11 lowBound:=BBox [ t r i a n g l e I d x ] . dim . low12 binIdx :=chooseBin ( lowBound )13 atomic {L [ binIdx ]+=1}14 highBound:=BBox [ t r i a n g l e I d x ] . dim . high15 binIdx :=chooseBin ( highBound )16 atomic {H[ binIdx ]+=1}17 t r i a n g l e I d x+=threadCount18 synchronize19 suf f ixSum (L)20 prefixSum (H)21 co s t [ 0 . . 3 2 ] : =SAH(L ,H, 0 . . 3 2 ) //Equation 122 i :=pick { i∈{0 . . 3 2} : c o s t [ i ] i s minimal}23 r e s u l t . dim . co s t := co s t [ i ]24 r e s u l t . dim . s p l i t P o s i t i o n :=

s p l i t P o s i t i o n ( i )25 r e s u l t . dim . l e f tCount :=H[ i ]26 r e s u l t . dim . r ightCount :=L [ i +1]27 r e s u l t . dim . bothCount :=T−L [ i +1]−H[ i ]28 r e s u l t . noSp l i t . c o s t :=SAH( noSp l i t ) //Equation

229 return pick { i∈{x , y , z , noSp l i t } :

r e s u l t . i . c o s t i s minimal}

Figure 3: Compute-Cost phase

angle counts of each node we can allocate mem-ory where we will move the entries. For eachnode three memory locations (base pointers) arecomputed — BL for left primitives, BR for rightprimitives, and BD for primitives falling intoboth child nodes. These references to primitivesneed to be duplicated and bounding boxes re-computed (Section 3.3)

If a node is not to be split, the data is movedto the auxiliary array to be processed in the nextstage (Figure 4 lines 4-8). If we do split the node,we move data to the destination array using thebase pointers.

5

1 functions p l i t ( node , dim , sp l i tP l ane ,BA ,BL ,BR ,BB )

2 t r i a n g l e I d x :=node . t r i ang l e sBeg i n+threadIdx3 p:= sp l i tP l an e4 i f (dim = noSp l i t )5 while ( t r i ang l e Idx<T)6 aux i l i a r y [BA+t r i a n g l e I d x ]

:= source [ t r i a n g l e I d x ]7 t r i a n g l e I d x+=threadCount8 return9 while ( t r i ang l e Idx<node . t r i ang l e sEnd )

10 OffL [ 0 . . threadCount−1] :=0;11 OffR [ 0 . . threadCount−1] :=0;12 OffD [ 0 . . threadCount−1] :=0;13 low:= source [ t r i a n g l e I d x ] . dim . low ;14 high := source [ t r i a n g l e I d x ] . dim . high ;15 i f ( low<p ∧ high<p) type :=L16 i f ( low<p ∧ high≥p) type :=D17 i f ( low≥p ∧ high≥p) type :=R18 Offtype [ threadIdx ] :=119 synchronize20 prefixSum (L)21 prefixSum (R)22 prefixSum (D)23 synchronize24 d e s t i n a t i on [Btype+Offtype−1]

:= source [ t r i a n g l e I d x ]25 BL+=OffL [ threadCount−1]26 BR+=OffR [ threadCount−1]27 BD+=OffD [ threadCount−1]28 t r i a n g l e I d x+=threadCount

Figure 4: Split phase

To correctly copy the primitives we introduce3 offset arrays Off L (left), Off R (right) and Off D

(duplicated) (Figure 4, lines 10-12). Each threadt handles one primitive. Depending on the loca-tion of the primitive, one of cells Off L[t], Off R[t],or Off B[t] is set to 1, the other two are set to 0(line 18). Finally we perform a prefix sum on thearrays (lines 20-22) and we move the boundingbox data to the cell indexed by the base pointerincremented by the offset (line 24).

3.3 Triangle-Splitter

The last phase is dedicated to handling trianglesthat have been intersected by the split plane.

Since only a small portion of triangles is beingintersected (usually around

√N), first we need

first to assign threads to the triangles in ques-tion. This can be achieved by a single parallelprefix sum over the number of middle trianglesof each active node, computed at Compute-Costphase (Figure 3, line 27).

Consider the triangle (t1, t2, t3) which is in-tersected by a plane (Figure 1). The algorithmprojects a line through each pair of vertices(t1, t2), (t2, t3) and (t3, t1) and finds the inter-section points with the split plane (p12, p23, p31respectively). If an intersection point does notlie between the triangle’s vertices it is discardedand not considered further. We then span twobounding boxes around the remaining intersec-tion points and around vertices which are on thecorrect side of the split plane. Finally, the ob-tained boxes are intersected with the previousbounding box, which could be already smallerdue to some previous splits of the triangle.

In some rare cases the resulting boundingbox may not be the tightest box covering thefragment triangle, however our approach is lessdemanding on memory, simpler to computeand maps better to the CUDA architecturethan more precise algorithms, e.g. Sutherland-Hodgeman algorithm [SH74].

4 Implementation details

So far we have discussed the general algorithmfor our binned SAH kd-tree construction. Nowwe are focusing on its efficient implementationin CUDA. CUDA is a new C-like programmingenvironment [NBGS08], the code of which can becompiled and run on all newest NVIDIA graphicshardware.

Execution of a CUDA code is split into blocks

6

with GPU usually capable of running severalof them in parallel in MIMD fashion. Eachblock may run independently of all others, andcommunication between blocks is slow, handledthrough main GPU memory. Each block con-sists of several threads, all of which are executedon a single Stream Multiprocessor (SM) of theGPU. Communication and synchronisation be-tween the threads of a single block is much fasterthanks to additional on-chip memory and a bar-rier primitive. Threads of a block are bundledinto 32-thread groups called warps. All threadsof the same warp execute in a SIMD fashion.

In our implementation we always launch asmany blocks as they can run in parallel on aGPU. Each node, after performing the task itwas assigned to, checks if more nodes need tobe processed, and if that is the case, it restartswith the new input data. This idea of persistentblocks follows the concept of persistent threadscoined by Aila et al. in [AL09].

4.1 Program stages

The efficiency of the different parallel implemen-tations of the algorithm’s components dependson how many primitives are contained in thenodes and how many nodes are there to be pro-cessed. To reach maximum speed our programconsists of five different implementations andperforms the kd-tree construction in five stages:huge, big, medium, small, and tiny (see Table 1)

• Huge stage is used at beginning of the con-struction when very few nodes are pro-cessed, but these contain many triangles.At this stage several blocks handle a singlenode.

• Big stage is used when there are enoughnodes to saturate the whole GPU with every

block working on a different node.

• At Medium stage we launch smaller blockshandling smaller nodes. Because of reducedresource demand from the block more ofthem can run in parallel on a single mul-tiprocessor.

• Small stage blocks process 4 nodes at once.Each node is handled by a single warp (32threads).

• Tiny stage handles very small nodes con-sisting of only few primitives. At this stagean exact SAH approach is used and singleblocks handle several nodes at a time.

4.2 Big stage

At the huge and big stages we are interested innodes consisting of at least M = 512 primitives(see Table 1). This particular value is the max-imal number of threads that a block can hold.We first focus on the big stage which is simplerin its construct than the huge stage.

4.2.1 Big Compute-Cost kernel

For the Compute-Cost kernel 3 blocks handle asingle node with each block binning along onlyone of the axes, so that the main loop (Figure3 line 5-27) is distributed between the blocks.Only the last instruction (line 29) requires com-bining the results and it is achieved by movingthe task to a separate kernel.

A major performance hazard of the Compute-Cost algorithm, as it is described in Section 3.1,are the atomic operations (Figure 3, lines 13 and16). In the worst case scenario, if all events fallinto the same bin this may lead to a completeserialisation of the execution.

7

StageNode th/bl bl

SMNbl

NSMsize CC Split

Huge >512 512 512 2 <1 ≤ 1Big >512 512 512 2 1 2

Medium 33-512 96 128 8 1 8Small 33-512 128 128 8 4 32Tiny ≤ 32 512 512 2 >14 >28

Table 1: Configuration differences between the stages. ”Node size” is the number of triangles thata node must contain. ’th/bl’ is the number of threads in each block of CC (Compute-Cost) andSplit phases. bl

SM is the number of blocks that can run in parallel on a single Stream Multiprocessor(SM), N

bl is the number of nodes a single block handles in parallel, and NSM — number of nodes

processed in parallel by a single SM on GTX 285

To reduce the number of collisions we intro-duce an expanded version of the array, consist-ing ofW = 32 counters per bin, which is the sizeof a warp — the CUDA SIMD unit. With thehelp of the expanded bin array we are guaran-teed not to have any collision between threads ofthe same warp.

Due to memory limitations on the multipro-cessor chip, we keep only one expanded array.First we read all lower events (Section 3.1) inthe loop (Figure 3 lines 11-13) and store them inthe expanded array. Then we compress the ar-ray to the simple one-counter-per-bin represen-tation, copy the results and reuse the expandedarray for higher events (lines 14-16).

4.2.2 Big Split kernel

For the big split phase we assign exactly oneblock per node consisting of M = 512 threads.Because of the block size, each element of theoffset arrays cannot exceed M after the prefixsum has been computed (Figure 4, after line 22).Having that in mind we can use a single offsetarray of 32-bit integers with every number hold-

ing three 10-bit integer values for each of the Offarrays.

This trick allows us not only to reduce mem-ory requirements but also simplifies the code asonly one prefix sum must be computed (Figure5). The prefix sum call at line 15 computes thedesired values for all 3 components at once with-out any interference between them.

4.3 Huge stage

There is one major difference between the oth-erwise similar big and huge stages. At the hugestage we have very few nodes to process, so in or-der to keep all multiprocessors busy we want toassign several blocks to a single active node. LetP denote the number of blocks of huge stage ker-nels that can run in parallel at a time on a givenGPU, taking into account their thread countsand resource usage (Table 1). Let x := P

N whereN is the number of active nodes. To best utilisethe capability of the GPU, we launch C := dx3 eblocks per axis of a node for the Compute-Costphase and S := dxe blocks per node for the Splitphase.

8

1 function b i g Sp l i t( node , dim , sp l i tP l ane ,BA ,BL ,BR ,BB )

2 t r i a n g l e I d x :=node . t r i ang l e sBeg i n+threadIdx3 p:= sp l i tP l an e

4 bitMask=210 − 15 i f (dim = doNotSpl i t ) [ . . . ]6 return7 while ( t r i ang l e Idx<node . t r i ang l e sEnd )8 low:= source [ t r i a n g l e I d x ] . dim . low9 high := source [ t r i a n g l e I d x ] . dim . high

10 i f ( low<p ∧ high<p) type :=L s h i f t :=011 i f ( low<p ∧ high≥p) type :=D s h i f t :=1012 i f ( low≥p ∧ high≥p) type :=R s h i f t :=2013 o f f s e t [ threadIdx ] :=1 sh l s h i f t14 synchronize15 prefixSum ( o f f s e t )16 synchronize17 addr :=( o f f s e t [ threadIdx ] shr s h i f t ) &

bitMask18 d e s t i n a t i on [Btype+addr−1]

:= source [ t r i a n g l e I d x ]19 l a s t := o f f s e t [ threadCount−1]20 BL+=l a s t & bitMask21 BD+=( l a s t shr 10) & bitMask22 BR+=( l a s t shr 20) & bitMask23 t r i a n g l e I d x+=threadCount

Figure 5: Split phase with only one offset array

Since many blocks handle a single node, theyneed to communicate. Unfortunately, this pro-cess is expensive, so we want to minimise theamount of exchanged data.

4.3.1 Huge Compute-Cost kernel

The main while loop of the Huge Compute-Costkernel (Figure 3 lines 10-17) can work completelyindependently between blocks working on thesame node. We just make sure that each tri-angle is accessed by exactly one thread from allblocks by correct indexing.

At the synchronisation bar (line 18) the con-tents of the L and H array are stored to globalGPU memory and then all data is read by thelast block to reach this point in the program.The prefix sums and following cost computation(lines 19-29) is performed by that single block.

4.3.2 Huge Split kernel

In Split kernel we must ensure that two differentblocks never copy triangle bounds to the samememory. To that end, the base pointers (BL,BR, BB) (Figure 4) are stored in global memoryof the GPU which can be accessed by all blocks.At each iteration of the main loop of the kernel,after the prefix sum is computed (right after line22) the block atomically increment the globalbase pointers. As a result the correct amountof memory is reserved exclusively to this block.

4.4 Medium stage

It would be inefficient to use big stage blocks,consisting of 512 threads, for nodes having lessthan 512 triangles. That is why, at mediumstage, we want to launch more, smaller blocksso that a single multiprocessor can handle morenodes in parallel (Table 1). This imposes hardermemory constraints for each block.

As a result the Compute-Cost kernel cannotuse extended bin arrays anymore (Section 4.2.1).Only 3 counters per bin are used, which onlypartially reduce atomic conflicts and may lead tohigher execution serialisation. Our experimentshave show this is a good tradeoff for being ableto process more nodes in parallel.

Because the Split block consists now only of128 threads (Table 1), that is the maximum sumvalue of the Off arrays. Therefore we are using8-bit offset counters bundled into one 32-bit in-teger, similarly to what was used in the Big Splitkernel (Figure 5) but we replace bitwise opera-tions with type castings.

4.5 Small stage

After few iterations of medium stage, the activenodes become even smaller and we want to han-

9

dle even more of them at once in parallel, with-out having any threads being idle. That is whyat the small stage we use a single warp per node.Due to hardware limitations we cannot launchmore than 8 blocks on each Stream Multiproces-sor, that is why the blocks still consists of 128threads — 4 warps, but each of the warps workindependently of all other warps.

In addition to higher parallelism, this allowsus to get rid of all barriers in the kernel becauseall inter-thread communication is restricted toa single SIMD unit which is always implicitlysynchronised.

4.6 Tiny stage

In the tiny stage we process nodes of size upto W = 32 triangles. It would be a waste ofresources to dedicate a whole warp to a singlenode, especially when these can contain only fewprimitives. That is why at tiny stage we tryto pack as many nodes as possible to a singleM = 512-thread block regardless of the warpboundaries, while keeping one-to-one mappingbetween threads and triangles.

There areM triangles assigned to each block,one per thread, but we overlap the blocks witheach other byW primitives (Figure 6). This con-figuration ensures that each node, consisting ofat most W triangles, can be handled entirely byone of the blocks, with minimal impact on theoverall performance.

Since there are only a few primitives in eachnode, using a binned algorithm would introducea needlessly big overhead coming from the binsthemselves. Secondly, as Hunt et al. in [HMS06]showed, cost function over the domain of thosesmall nodes is much less smooth and the approx-imation less effective. That is why we use anexact SAH approach, similarly as it is done in

[HMS06]. Each thread loads the triangle’s lowerand higher events and treats them as split can-didates. Then, each thread counts the numberof other events on each side of the split plane inlinear time.

At the split phase, a single block handles sev-eral nodes, similarly as it was done in Compute-Cost. We use a single offset array for all handledtriangles but the prefix sum algorithm is modi-fied in such a way that values are not added upacross node bounds (segmented sum).

5 Results

In order to evaluate the quality of the producedkd-tree versus the time needed for its construc-tion, we measure the build time and the render-ing time for four scenes using different kd-trees.We decided not to write our own GPU ray tracer,as its efficiency could influence the results. In-stead we decided to integrate our builder intoRTFact [GS08] and compare our resulting treeswith others all in the same framework.

We are not interested in time to transfer theinitial data from the RTFact structures to GPUand the construct tree back nor the overhead ofinitialising the CUDA context. We only measurethe kd-tree construction time and the resultingCPU ray tracing performance of the scene usingRTFact.

We have tested our algorithm against fourscenes: Sponza, Conference, one frame of FairyForest, and Venice (Figure 7) running on an In-tel Core2 Quad Q9400 2.66GHz computer with4GB of RAM and an NVIDIA GeForce GTX 285GPU.

10

n53 n54 n55 n56 n57 n58

44832 32

Figure 6: Work distribution at the tiny stage. Each block (e.g. green one) coversM triangles, butthe firstW triangles are also covered by the previous block (blue) and the lastW triangles – by thenext block (violet). Nodes 54, 55 and 56 are assigned to the green block. Node 57, although stillin range of the green block, is already covered entirely by the violet one and it is assigned there.

Figure 7: Four scenes we tested our algorithmwith: Sponza (67 000 triangles), Fairy Forest(174 000 triangles), Conference (283 000 trian-gles), and Venice (996 000 triangles)

5.1 Overall performance

The execution time and tree quality heavily de-pend on the chosen traverse-to-intersection costratio Ct

Ci. Throughout our program we set Ci to

1 and modify only Ct. Depending on the numberof rays in a packet, in our environment we foundvalues of Ct between 2 and 4 to be a good trade-

off between tree quality and construction time(Table 2).

For three of the tested scenes and low TC val-ues, our program performs 30-60 times fasterthan a full-quality CPU construction with onlyminor loss of rendering performance (≤ 15%).In addition, construction time may be reduceda few times by using high traverse cost TC , butas a result rendering takes about twice as muchtime.

So far the only successful kd-tree builder onGPU that we are aware of are the one of Zhou etal. [ZHWG08] and its memory-bound derivativeof Hou et al. [HSZ∗09]. As already describedin Section 2.2 their algorithm uses spatial me-dian splits or cuts off empty space on the toplevels of the tree until they reach a threshold of64 triangles in a node. This median split ap-proach, especially for big scenes with unevenlydistributed triangles may generally lead to a per-formance drop in ray tracing by a factor of 3 to 5when compared to SAH[Hav00], but we have noexplicit data to support that it is the case withthis algorithm.

In [ZHWG08] their algorithm was tested onanother NVIDIA card — a GeForce 8800 UL-TRA. That card offers only around half the num-ber of multiprocessors as GTX 285, but the mul-tiprocessors are slightly faster. On the GeForce

11

Stage Full ZhouCt

1.3 2.0 4.0 16.0

Sponza1072 37 29 21 11189 204 212 244 384

Fairy3596 77*/58 57 48 38 32244 245 238 243 277 434

Conf.5816 113 93 69 39196 208 217 238 344

Venice21150

N/A N/A272 183

250 416 656

Table 2: Building performance (the top numbers of each row) and RTFact CPU rendering perfor-mance (the bottom numbers) depending on the used algorithm and the traverse cost Ct. All valuesare in milliseconds. For rendering — 3 arbitrarily chosen viewpoints per scene were tested, shootingonly primary rays with no shading procedure, and the table reports the average results. The ”Full”column represents the full quality SAH CPU builder used in RTFact. ”Zhou” is a kd-tree from[ZHWG08] and [HSZ∗09]. Their construction time was measured on a GeForce 8800 ULTRA(*)and GTX 280. For CT equal to 2.0 and 1.3, the Venice scene tree and duplicated data did not fitour GTX 285 so no construction times are available.

8800 ULTRA their program needs 77 milisecondsto build kd-tree for the Fairy Forest. Hou et al.[HSZ∗09] test their GPU algorithm on a GTX280, and produce the tree in 58 milliseconds.

As shown in Table 2 using our algorithm wecan choose Ct = 2.0 and 25% faster constructa tree of slightly higher quality. By setting Ct

to 4 we can obtain the tree 60% faster but withreduced ray tracing performance by 13%.

5.2 Scalability

We tested our program with respect to parallelefficiency. The efficiency is defined as:

E(n) =t1

n · tnwhere n is the number of processors running inparallel, and tx is the execution time of compu-tation performed on x processors. The CUDA

environment and NVIDIA’s GPUs provide twosources of parallelism: at the top level the GPUconsists of several multiprocessors which work inMIMD fashion, and at the lower level each mul-tiprocessor consists of several cores working inSIMD. We are interested in efficiency at the toplevel, that is — with respect to the number ofmultiprocessors.

Unfortunately we are unable to shut down justa part of the GPU. Instead we can carefully con-figure which blocks work and which stay idle,so that only the desired number of multiproces-sors are working. We cannot control however thebandwidth of the memory controller which canbias down our results.

Kun Zhou et al. [ZHWG08] also analyse scal-ability of their implementation. As a compari-son base they test the program on 16 cores andcompare it with a 128-core run (which is 8 times

12

N Sponza Fairy Conference

1 289.0ms 594.4ms 1237.9ms2 0.96 (×1.9) 0.95 (×1.9) 0.97 (×1.9)4 0.89 (×3.6) 0.89 (×3.6) 0.92 (×3.7)8 0.77 (×6.2) 0.80 (×6.4) 0.86 (×6.9)16 0.61 (×9.7) 0.65 (×10.4) 0.73 (×11.7)30 0.46 (×13.8) 0.52 (×15.6) 0.60 (×17.9)

Table 3: Parallel efficiency and speedup for agiven a number of working multiprocessors N .For the value N = 1, an absolute run time isgiven and the efficiency is 1 by definition.

more, thus N := 8) of their GeForce 8800 UL-TRA. With an exception of smaller scenes weretheir parallelism is less efficient, their speedupis between 4.9 – 6.1 (efficiency 61%-76%). Ouralgorithm for N=8 shows a significantly betterspeedup (6.2–6.9) and parallel efficiency (around80%).

The parallel CPU SAH builder of Choi et al.[CKL∗09] was tested only on few initial steps ofthe kd-tree construct. They report a speedupsbetween 4.8 and 7.1 for 16 cores (hence efficiency30%–44%) which is also much lower than ourperformance for N = 16.

Still, for higher N values our efficiency is stillfar from unity. Main reason is that many con-trol functions take too little data to be efficientlyprocessed on multiple processors. Most controlfunctions perform a prefix sum operation (e.g.computation of base pointers in Section 3.2).While there exist fast prefix sum algorithms (e.g.[BOA09]) they are optimised for several-milliondata sets, while in our case, input data consistsof a few to a few thousand elements.

Tests have shown that for N = 1, on the Fairyforest, these control function take 18ms (3% of

total work) while for N = 30 — 11ms (29% oftotal run time), becoming the bottleneck andthe source of the loss of efficiency. As expected,the efficiency improves with the scene size, sincethere are more primitives to be processed in themain parallel code.

5.3 Memory usage

Since all memory allocation operations are pro-hibitively slow on GPU we preallocate memorybefore the algorithm starts. Although there existmemory-aware algorithms for kd-tree construc-tion (e.g. [HSZ∗09] [WK07]) we simply assumewe never cross the predefined limits. If N is thenumber of input triangles we assume the totaltree size and number of triangle references neverexceeds 4N , and that there are never more than2N active nodes at each step of the algorithm,to be processed in parallel.

Table 4 shows how much memory is needed forparticular objects used by the program. Sincetriangle are only referenced, never duplicated,no additional memory needs to be allocated fortheir vertices. Bounding boxes keep a separateentry for each triangle instance. In addition, asshown on Figure 1, we are using three arrayswhich triples our memory demand.

Taking into account all additional structuresneeded by the algorithm, the total memory us-age can be estimated to be around 0.85KB pertriangle. This means that on a 1GB GPU, onlyscenes with up to 1.2 million triangles can beprocessed simultaneously.

6 Future work

Although the above basic algorithm is fast, itconsumes a lot of memory. Several techniques

13

Object Element Element Venicesize size

Triangle ver-tices

48B N 45MB

Triangleboundingboxes

28B 3× 4N 320MB

Kd-tree data 32B 4N 122MB

Huge stagebins

260B 8P 61KB

Active nodescontrol arrays

52B 3× 2N 297MB

Prefix sum ar-rays

20B 2N 38MB

Total 822MB

Table 4: Memory usage of particular compo-nents of the algorithm. As an example memoryusage for Venice test scene is given.

can be explored to significantly reduce the mem-ory requirements. For example, after a few ini-tial steps, a part of the scene could be scheduledfor later processing, with the algorithm focusingonly on another part. This would allow to signif-icantly reduce the limits on the number of nodesand triangles being processed at a time with onlya minimal impact on the overall performance.

Another option for handling big scenes wouldbe to split the work between several GPUs. Thesplit could be performed after a few iterationsof the presented algorithm with very low limits,or — if that already uses too much memory —by performing a fast low-quality split on a CPUand uploading the data on separate GPUs.

So far we discussed a kd-tree builder as a sepa-rate entity which, in our case, is used with a CPUray tracer. A much more efficient solution would

be to closely integrate the builder and renderer,so that data would not have to be transferred ortransformed from one representation to another.The close builder-renderer integration also openspossibility for lazy construction of the kd-treewhich could speed up animations even further.

7 Conclusion

We have presented an efficient kd-tree construc-tion algorithm. The builder uses SAH on alllevels producing high-quality trees while outper-forming all other known CPU and GPU imple-mentations. The program is highly scalable andso we believe it will map well to future GPU ar-chitectures as well, taking advantage of most oftheir computing power.

References

[AL09] Aila T., Laine S.: Understand-ing the efficiency of ray traversalon gpus. In Proceedings of High-Performance Graphics 2009 (2009).

[Ben75] Bentley J. L.: Multidimensionalbinary search trees used for associa-tive searching. Communications ofthe ACM 18, 9 (1975), 509–517.

[BOA09] Billeter M., Olsson O., As-sarsson U.: Efficient stream com-paction on wide simd many-corearchitectures. In Conference onHigh Performance Graphics (2009),pp. 159–166.

[CKL∗09] Choi B., Komuravelli R., Lu V.,Sung H., Bocchino R. L., AdveS. V., Hart J. C.: Parallel SAH

14

k-D Tree Construction for Fast Dy-namic Scene Ray Tracing. Tech.rep., University of Illinois, 2009.

[GS08] Georgiev I., Slusallek P.: RT-fact: Generic Concepts for Flexi-ble and High Performance Ray Trac-ing. In IEEE/Eurographics Sym-posium on Interactive Ray Tracing(Aug. 2008).

[Hav00] Havran V.: Heuristic Ray ShootingAlgorithms. Ph.d. thesis, Depart-ment of Computer Science and En-gineering, Faculty of Electrical Engi-neering, Czech Technical Universityin Prague, November 2000.

[HMS06] Hunt W., Mark W. R., StollG.: Fast kd-tree construction withan adaptive error-bounded heuris-tic. In IEEE Symposium on Interac-tive Ray Tracing (September 2006),pp. 81–88.

[HSZ∗09] Hou Q., Sun X., Zhou K.,Lauterbach C., Manocha D.,Guo B.: Memory-Scalable GPUSpatial Hierarchy Construction.Tech. rep., Tsinghua University,2009.

[Kap85] Kaplan W. N.: Space-tracing: Aconstant time ray-tracer. ComputerGraphics 19 (1985).

[MB90] MacDonald D. J., Booth K. S.:Heuristics for ray tracing using spacesubdivision. The Visual Computer 6,3 (1990), 153–166.

[NBGS08] Nickolls J., Buck I., GarlandM., Skadron K.: Scalable parallel

programming with CUDA. Queue 6,2 (2008), 40–53.

[PGSS06] Popov S., Gunther J., SeidelH.-P., Slusallek P.: Experienceswith streaming construction of SAHKD-trees. In IEEE Symposium onInteractive Ray Tracing (September2006), pp. 89–94.

[SH74] Sutherland I. E., HodgmanG. W.: Reentrant polygon clipping.Communications of the ACM 17, 1(1974), 32–42.

[SSK07] Shevtsov M., Soupikov A., Ka-pustin A.: Highly parallel fast kd-tree construction for interactive raytracing of dynamic scenes. ComputerGraphics Forum 26, 3 (2007), 395–404.

[WH06] Wald I., Havran V.: On buildingfast kd-trees for ray tracing, and ondoing that in O(N log N). In Sym-posium on Interactive Ray Tracing(2006), pp. 61–69.

[WK07] Wachter C., Keller A.: Termi-nating spatial hierarchies by a prioribounding memory. In IEEE Sym-posium on Interactive Ray Tracing(2007), pp. 41–46.

[ZHWG08] Zhou K., Hou Q., Wang R., GuoB.: Real-time kd-tree constructionon graphics hardware. In ACM SIG-GRAPH Asia 2008 papers (2008),ACM, pp. 1–11.

15

Documents

Binned SAH Kd-Tree Construction on a GPUBinned SAH Kd-Tree Construction on a GPU Piotr Danilewski 1, Stefan Popov , Philipp Slusallek;2 3 1Saarland University, Saarbruc ken, Germany